Why am I Getting '[Errno 30] Read-only File System' Error While Using Pyppeteer in Python?

user-1 · November 29, 2023, 4:38pm

This topic was automatically generated from Slack. You can find the original thread here.

guys, I’m trying to use puppeteer in python (pyppeteer) but I’m getting this error: [Errno 30] Read-only file system: '/home/sbx_user1051'
What might be the problem?

user-1 · November 29, 2023, 4:38pm

this is my code:

import asyncio
from pyppeteer import launch

def handler(pd: "pipedream"):
    async def main():
        browserObj = await launch({"headless": True})
        url = await browserObj.newPage()
        await url.goto('https://scrapeme.live/shop/')
        htmlContent = await url.content()
        print(htmlContent)
        await browserObj.close()

    asyncio.get_event_loop().run_until_complete(main())

user-1 · November 29, 2023, 4:38pm

perhaps I’m not importing it correctly?

user-1 · November 29, 2023, 4:38pm

Hi , only the /tmp directory is writable within the workflow’s execution environment.

Perhaps try configuring the puppeteer instance to use that directory instead of the default /home directory.

user-1 · November 29, 2023, 4:38pm

We don’t officially support Puppeteer within Python steps yet, because it requires a Chromium instance that’s compatible with serverless environments.

But we do have Puppeteer available in Node.js code & pre-built actions:

user-1 · November 29, 2023, 4:38pm

can you show me how to use Puppeteer in Node.js code steps using the @pipedream/browsers package?

user-1 · November 29, 2023, 4:38pm

yeah I’ve tried using it with node js but it’s giving me tons of problems

user-1 · November 29, 2023, 4:38pm

Sure, here’s an example of how you can use Puppeteer in Node.js code steps using the @pipedream/browsers package: import { launch } from "@pipedream/browsers"; export default defineComponent({ async run({ steps, $ }) { const browser = await launch(); const page = await browser.newPage(); await page.goto("https://example.com"); const title = await page.title(); console.log(title); await browser.close(); return title; }, }); In this example, we’re launching a new browser instance, opening a new page, navigating to “https://example.com”, getting the title of the page, logging it to the console, and then closing the browser. The title of the page is also returned as the result of the step. Please note that the @pipedream/browsers package is a wrapper around Puppeteer that’s optimized for use in Pipedream’s serverless environment. It provides the same API as Puppeteer, so you can refer to the Puppeteer documentation for more information on how to use it.

user-1 · November 29, 2023, 4:38pm

Can you describe some of the problems?

Did you upgrade the memory to at least 2 gbs on the deployed workflows using Puppeteer?

user-1 · November 29, 2023, 4:38pm

I did. I’m trying to loop through an array of URLs, and get each one’s HTML code.
It’s very hard to do because its an async function

user-1 · November 29, 2023, 4:38pm

It sounds like your workflow retrieves this list of URLs from a database of some kind perhaps.

Have you tried splitting the querying & iteration portion of the workflow from the individual processing of each URL?

user-1 · November 29, 2023, 4:38pm

my list of URL is just a const I’ve set, for example
const websites = ['https://www.google.com', 'https://www.facebook.com'];

user-1 · November 29, 2023, 4:38pm

Got it, same principle applies.

You also could use await within a for loop to await each iteration. Or use Promse.all to await all promises.

user-1 · November 29, 2023, 4:38pm

then I’m using forEach to scrape each one

user-1 · November 29, 2023, 4:38pm

I’ll try Promise.all since await just don’t work for me

user-1 · November 29, 2023, 4:38pm

It’s really confusing in Node.js, but you can’t just make the forEach ’s callback function async to make it wait for each promise serially.

You have to use a for await(const url of websites) for that kind of one at a time processing.

user-1 · November 29, 2023, 4:38pm

But Promise.all would be better in your case anyway, it’ll launch all HTTP requests at once, making it faster.

user-1 · November 29, 2023, 4:38pm

given an array of URLs, can you show me how to map over each one and use the puppeeter export from @pipedream/browsers to retrieve the HTML content from each URL in an array of Promises passed to a Promise.all so the step execution will wait until the promises are finished before code control moves to the next step in the workflow?

user-1 · November 29, 2023, 4:38pm

Sure, here’s an example of how you can use the Puppeteer export from @pipedream/browsers to retrieve the HTML content from each URL in an array of Promises passed to a Promise.all. This will ensure that the step execution waits until all the promises are finished before moving to the next step in the workflow: import { launch } from "@pipedream/browsers"; export default defineComponent({ async run({ steps, $ }) { const websites = ['https://www.google.com', 'https://www.facebook.com']; const browser = await launch(); const promises = websites.map(async (url) => { const page = await browser.newPage(); await page.goto(url); return await page.content(); }); const htmlContents = await Promise.all(promises); await browser.close(); return htmlContents; }, }); In this example, we’re launching a new browser instance, mapping over each URL in the websites array, opening a new page for each URL, navigating to the URL, getting the HTML content of the page, and adding the Promise to the promises array. We then use Promise.all to wait for all the Promises in the promises array to resolve, storing the resulting array of HTML contents in htmlContents. Finally, we close the browser and return htmlContents as the result of the step.

user-1 · November 29, 2023, 4:38pm

That should do the trick