How can I efficiently extract "title", "description", and "image_url" from a webpage using Pipedream, Supabase, and Google Sheets?

This topic was automatically generated from Slack. You can find the original thread here.

After following a “Connecting and inserting data into a Supabase Table” from Pipedream I was able to get a basic expression to work but am stuck on how to be more efficient with my code.
The goal is to get “title”, “description” and “image_url” from webpage.
My Steps:
Step 1: Trigger on adding a URL to Google Spreadsheet
Step 2: custom_request (HTTP request)
–get / {{steps.trigger.event.newRow[2]}}
– returns html page
Step 3: insert_row (Postgres)
– this works but is using a match but probably not the best option (inefficient, multiple title tags, etc.)
title / {{${steps.custom_request.$return_value}.match(/(.**?)</title>/)[1]}}
Also tried using a code step to parse the HTML but received an error.
– tried this with import instead of require and “steps” in place of "event’
**const** cheerio = require('cheerio');
**const** html = event.custom_request.body;
**const** $ = cheerio.load(html);
**const** title = $('title').text();
**this**.title = title;
Got this error for all of the different varieties I tried:
*ReferenceError
steps is not defined


After following a “Connecting and inserting data into a Supabase Table” from Pipedream I was able to get a basic expression to work but am stuck on how to be more efficient with my code.
The goal is to get “title”, “description” and “image_url” from webpage.
My Steps:
Step 1: Trigger on adding a URL to Google Spreadsheet
Step 2: custom_request (HTTP request)
–get / {{steps.trigger.event.newRow[2]}}
– returns html page
Step 3: insert_row (Postgres)
title / {{${steps.custom_request.$return_value}.match(/(.*?)</title>/)[1]}}

To make your code more efficient and extract the “title”, “description”, and “image_url” from the webpage, you can use a custom code step with a package like cheerio to parse the HTML and extract the required data. Here’s an example of how you can modify your workflow: Step 1: Trigger on adding a URL to Google Spreadsheet (same as before) Step 2: custom_request (HTTP request) --get / {{steps.trigger.event.newRow[2]}} – returns html page Step 3: Add a new code step to parse the HTML and extract the required data: import { axios } from "@pipedream/platform"; import cheerio from "cheerio"; const html = steps.custom_request.$return_value; const $ = cheerio.load(html); const title = $("title").text(); const description = $('meta[name="description"]').attr("content"); const image_url = $('meta[property="og:image"]').attr("content"); return { title, description, image_url, }; Step 4: insert_row (Postgres) - title / {{steps.parse_html.title}} - description / {{steps.parse_html.description}} - image_url / {{steps.parse_html.image_url}} This should make your code more efficient and easier to maintain. Make sure to replace steps.parse_html with the actual name of the code step you added in Step 3.

After trying this code I am getting an error on Step3.
I also ran this several times using ChatGPT4 to check the errors but nothing I tried resolved the error:
I simplified things for testing to only get the “title”
here is my revised script:
**import** cheerio **from** "cheerio";

// Get HTML from input object
**const** html = input.custom_request.$return_value;

// Load HTML into Cheerio
**const** $ = cheerio.load(html);

// Extract title
**const** title = $('title').text();

// Assign to exports
**this**.title = title;
here is the error:
ReferenceError
input is not defined
DETAILS

DETAILS

ReferenceError: input is not defined
    at file:///tmp/__pdg__/dist/code/39e80a526133c5e6317deb8d779f6bdfef7d9f187f72b5ef891c00342fc06ba3/index.mjs:4:14
    at ModuleJob.run (internal/modules/esm/module_job.js:183:25)
    at async Loader.import (internal/modules/esm/loader.js:178:24)
    at async Object.loadESM (internal/process/esm_loader.js:68:5)
    at async handleMainPromise (internal/modules/run_main.js:59:12)

For the code that was first suggested I received this error:
Code Error
'return' outside of function. (9:0)
DETAILS

SyntaxError: 'return' outside of function. (9:0)
    at _raise (/var/task/node_modules/@babel/parser/src/parser/error.js:147:45)
    at raiseWithData (/var/task/node_modules/@babel/parser/src/parser/error.js:142:17)
    at raise (/var/task/node_modules/@babel/parser/src/parser/error.js:91:17)
    at parseReturnStatement (/var/task/node_modules/@babel/parser/src/parser/statement.js:672:12)
    at parseStatementContent (/var/task/node_modules/@babel/parser/src/parser/statement.js:288:21)
    at parseStatement (/var/task/node_modules/@babel/parser/src/parser/statement.js:242:17)
    at parseStatement (/var/task/node_modules/@babel/parser/src/plugins/flow/index.js:1842:26)
    at parseBlockOrModuleBlockBody (/var/task/node_modules/@babel/parser/src/parser/statement.js:980:25)
    at parseBlockBody (/var/task/node_modules/@babel/parser/src/parser/statement.js:956:10)
    at parseProgram (/var/task/node_modules/@babel/parser/src/parser/statement.js:129:10)

fix the errors above and rewrite step 3 as a valid Pipedream component

I apologize for the confusion earlier. Here’s the corrected code for Step 3 as a valid Pipedream component: import cheerio from "cheerio"; export default defineComponent({ props: { html: { type: "string", label: "HTML content", }, }, async run({ $ }) { const $ = cheerio.load(this.html); const title = $("title").text(); const description = $('meta[name="description"]').attr("content"); const image_url = $('meta[property="og:image"]').attr("content"); return { title, description, image_url, }; }, }); Make sure to replace this.html with the actual HTML content from the previous step (Step 2) in your workflow. Then, you can use the output from this step in Step 4 to insert the data into your Postgres table.

After multiple tries, still not getting this to work. Not sure I understand these instructions:
Make sure to replace this.html with the actual HTML content from the previous step (Step 2) in your workflow.
in Step 2 here are my settings/values:
custom_request
Get HTTP
{{steps.trigger.event.newRow[2]}}
in the “Results” Export tab:
full html output with these paths:
steps.custom_request
steps.custom_request.$return_value
and trying to return this from the html code:

How to Restore a Deck on Your Own | Family Handyman

Try pasting the code from this step into a Node.js code step. Then press the button above the code step labeled Refresh Fields. You’ll see an input field appear above the step.

Pass {{steps.custom_request.$return_value}} to the HTML input field. The code will reference that value. Press Test on that code step and you should see the title returned

I did have this in the HTML input field
and got this error:
Code Error
Identifier ‘$’ has already been declared (11:10)
I then run this revised the code:

import cheerio from "cheerio";
import { defineComponent } from "@pipedreamhq/pd-backend";

export default defineComponent({
  props: {
    html: {
      type: "string",
      label: "HTML content",
    },
  },
  async run({ html }) {
    const $ = cheerio.load(html);

    const title = $("title").text();

    return {
      title,
    };
  },
});
and get this error

Code Error
Identifier ‘defineComponent’ has already been declared. (23:10)
After a few revisions I am getting closer with this code:

import cheerio from "cheerio";

export default {
  props: {
    html: {
      type: "string",
      label: "HTML content",
    },
  },
  async run({ html }) {
    const $ = cheerio.load(html && html.toString ? html.toString() : "");

    const title = $("title").text();

    return {
      title,
    };
  },
};

and here is my output from the lastest code provided above:
Step 3: parse_html

it then inserts a “empty” value into Supabase:
here is the screenshots of Step 4: insert_row

here is the content between the “title” tags I am trying to parse/capture:

How to Restore a Deck on Your Own | Family Handyman

Also got an “object” in one of the code iterations:
inputs into the Supabase table:

This code worked:

import cheerio from "cheerio";

export default {
  async run(event) {
    const html = event.steps.custom_request.$return_value;
    const $ = cheerio.load(html);

    const title = $("title").text().trim();

    return {
      title,
    };
  },
};