How to scrape Keepa with WebScraping.ai

sideaswithyou · October 30, 2022, 7:13pm

Hello there.
I am trying to extract (scrape) informations from a web page and put those informations on Google Sheet.

The web page is similar with this one:

From this page I would like to retrieve the title and the 6 values (circled in red in the attached picture) from the table with CSS selector #statsTable

How can I set the code in Pipedream to do this?

Could be in NodeJS or Python.
I would need the Country to be Italy, I don’t think I need to render JS!

PS I’ve tried WebScraping.ai, but if you have other ideas/methods, they are welcome!

Thanks in advance

Ricky

pierce · October 31, 2022, 12:52pm

Hi Ricky

First off, welcome to the Pipedream community. Happy to have you!

That’s an interesting idea, I have a few suggestions.

1. Use the Amazon API

Instead of trying to scrape a webpage, it might be possible to scrape this data directly from the Amazon API. It would save you a step of crafting CSS selectors and setting up an HTML scraper.

2. BeautifulSoup (Python) or JSdom or Cheerio (Node.js)

If this table is server side rendered on Keepa, then you can use any HTTP client such as requests (Python), axios or fetch (Node.js), or just use the built in HTTP request builder step with no code.

Then once the HTML has been retrieved, you can use data extraction tools like BeautifulSoup, JSdom or Cheerio in a code step to perform the HTML search and extraction of the data.

3. Browserless.io

However, if that table is rendered by JS on the frontend of the browser, using an HTTP client will not suffice.

If that’s the case, you can use a tool like Browserless.io which will emulate a browser and actually execute the Javascript on the page.

Here’s an example:

restyler · November 16, 2022, 4:58am

here is a Youtube video you might find useful: scraping website data to Google Sheets via Pipedream and ScrapeNinja

ScrapeNinja has cheerio selectors sandbox to quickly test your JS extractors against HTML output of the website: ScrapeNinja Cheerio Live Sandbox

sideaswithyou · November 16, 2022, 6:06pm

Thank you guys for your kindness.
I will reply here:

@pierce

I still do not have access to the Amazon API due to a lack of suitable sales.
BeautifulSoup (Python) or JSdom or Cheerio (Node.js) I have no idea how they are configured in Pipedream.
I tried Browserless, but you can only take a screenshot or export HTML to PDF. And then anyway, I can’t figure out how to filter the information via Pipedream to extract only the numbers I want.

@restyler
Saving on Google Sheet with Pipedream, there are no difficulties. The problem for me is configuring tools such as ScrapeNinja (or similar) so that they only extract the desired information.

I know nothing about Python, NodeJS or other languages. I would just like to know the code riches I have to put into Pipedream to get the desired information.

Many thanks to all