Headless Browsing with Puppeteer & Playwright on Pipedream
You can leverage any package from the npm ecosystem in Pipedream's Node.js code steps (PyPI for Python code steps as well). Simply import
a package in your code, and Pipedream will handle installing and bundling the package for you.
No package.json
required, just focus on writing code.
However, one of the ongoing challenges has been supporting the Puppeteer and Playwright browser automation libraries. This is because these packages rely a full Chromium browser running locally to interface with.
But we've just overcome this hurdle with a specialized npm package @pipedream/browsers
that's compatible with the Pipedream out of the box.
This package contains the exact combinations of Puppeteer, Playwright and Chromium that are compatible with with Pipedream's Node.js code steps. Simply import the @pipedream/browsers
package, and launch a Puppeteer or Playwright browser.
Below is an example of scraping a webpage's title & HTML content in a few lines of code:
import { puppeteer } from '@pipedream/browsers';
export default defineComponent({
async run({steps, $}) {
const browser = await puppeteer.browser();
// Interact with the web page programmatically
const page = await browser.newPage();
await page.goto('https://pipedream.com/');
const title = await page.title();
const content = await page.content();
await browser.close();
return { title, content }'
},
})
What can you do with Puppeteer and Playwright?
Puppeeter and Playwright are fully emulate a browser, including Javascript rendering. This allows you to programmatically interact with websites as a real user, not just retrieve the static HTML content.
Potential actions include:
- Automate login/registration for websites
- Click on Javascript powered links
- Smoke test web applications
You can chain these browser actions with other actions in your Pipedream workflows to:
- Notify you when a website's HTML changes over Slack, Email
- Store scraped content into a database like RDS, MongoDb, Supabase
- Automate login, checkout, or other actions on websites on a schedule
Difference between HTTP Requests and Puppeteer/Playwright
The useful HTTP Request action Pipedream can be used to send an HTTP request to any website and scrape the raw HTML in downstream steps. However, there are few downsides to this approach:
- Not all websites server side render all content. SPA (Single Page Applicatons) will retrieve content over an API or some other source in-browser.
- The HTML content can't be rendered in a browser, so clicking on some links or submitting forms and following redirects are not possible.
Examples
Just like any other NPM package, simply import it at the top of an Node.js code step in a Pipedream workflow to install it. This package exports two modules: playwright
and puppeteer
with a common interface to launch either browser.
Take a screenshot of a webpage
import { puppeteer } from '@pipedream/browsers';
export default defineComponent({
async run({steps, $}) {
const browser = await puppeteer.browser();
const page = await browser.newPage();
await page.goto('https://pipedream.com/');
const screenshot = await page.screenshot();
// Exports a Buffer instance of the screenshot
// Can be saved to /tmp or uploaded to other services
$.export('screenshot', screenshot);
// The browser needs to be closed, otherwise the step will hang
await browser.close();
},
})
Taking a screenshot is also available as a no-code pre-built action:
Get a PDF of a webpage
import { puppeteer } from '@pipedream/browsers';
export default defineComponent({
async run({steps, $}) {
const browser = await puppeteer.browser();
const page = await browser.newPage();
await page.goto('https://pipedream.com/');
const pdf = await page.pdf();
// Exports a Buffer instance of the pdf
// Can be saved to /tmp or uploaded to other services
$.export('pdf', pdf);
// The browser needs to be closed, otherwise the step will hang
await browser.close();
},
})
Generating a PDF of a webpage is also available as a no-code pre-built action:
Get the HTML of a webpage
import { puppeteer } from '@pipedream/browsers';
export default defineComponent({
async run({steps, $}) {
const browser = await puppeteer.browser();
const page = await browser.newPage();
await page.goto('https://pipedream.com/');
// Retrieve the HTML content of the rendered webpage
const content = await page.content();
$.export('content', content);
await browser.close();
},
})
Click on an HTML element
import { puppeteer } from '@pipedream/browsers';
export default defineComponent({
async run({steps, $}) {
const browser = await puppeteer.browser();
const page = await browser.newPage();
// Open the Pipedream Blog
await page.goto('https://pipedream.com/blog');
// Click the first blog post link
await page.click('.post-feed .post-card-content-link')
const url = await page.url();
$.export('url', url)
// The browser needs to be closed, otherwise the step will hang
await browser.close();
},
})
Learn more about browser automation
See the Puppeteer and Playwright documentation on how to leverage headless browser automation.