How to Efficiently Retrieve and Insert "title", "description", and "image_url" from a Webpage into a Supabase Table Using Pipedream?

This topic was automatically generated from Slack. You can find the original thread here.

After following a “Connecting and inserting data into a Supabase Table” from Pipedream I was able to get a basic expression to work but am stuck on how to be more efficient with my code.
The goal is to get “title”, “description” and “image_url” from webpage.
My Steps:
Step 1: Trigger on adding a URL to Google Spreadsheet
Step 2: custom_request (HTTP request)
–get / {{steps.trigger.event.newRow[2]}}
– returns html page
Step 3: insert_row (Postgres)
– this works but is using a match but probably not the best option (inefficient, multiple title tags, etc.)
title / {{${steps.custom_request.$return_value}.match(/(.**?)</title>/)[1]}}
Also tried using a code step to parse the HTML but received an error.
– tried this with import instead of require and “steps” in place of "event’
**const** cheerio = require('cheerio');
**const** html = event.custom_request.body;
**const** $ = cheerio.load(html);
**const** title = $('title').text();
**this**.title = title;
Got this error for all of the different varieties I tried:
*ReferenceError
steps is not defined


After following a “Connecting and inserting data into a Supabase Table” from Pipedream I was able to get a basic expression to work but am stuck on how to be more efficient with my code.
The goal is to get “title”, “description” and “image_url” from webpage.
My Steps:
Step 1: Trigger on adding a URL to Google Spreadsheet
Step 2: custom_request (HTTP request)
–get / {{steps.trigger.event.newRow[2]}}
– returns html page
Step 3: insert_row (Postgres)
title / {{${steps.custom_request.$return_value}.match(/(.*?)</title>/)[1]}}

To make your code more efficient and extract the “title”, “description”, and “image_url” from the webpage, you can use a custom code step with a package like cheerio to parse the HTML and extract the required data. Here’s an example of how you can modify your workflow: Step 1: Trigger on adding a URL to Google Spreadsheet (same as before) Step 2: custom_request (HTTP request) --get / {{steps.trigger.event.newRow[2]}} – returns html page Step 3: Add a new code step to parse the HTML and extract the required data: import { axios } from "@pipedream/platform"; import cheerio from "cheerio"; const html = steps.custom_request.$return_value; const $ = cheerio.load(html); const title = $("title").text(); const description = $('meta[name="description"]').attr("content"); const image_url = $('meta[property="og:image"]').attr("content"); return { title, description, image_url, }; Step 4: insert_row (Postgres) - title / {{steps.parse_html.title}} - description / {{steps.parse_html.description}} - image_url / {{steps.parse_html.image_url}} This should make your code more efficient and easier to maintain. Make sure to replace steps.parse_html with the actual name of the code step you added in Step 3.

After trying this code I am getting an error on Step3.
I also ran this several times using ChatGPT4 to check the errors but nothing I tried resolved the error:
I simplified things for testing to only get the “title”
here is my revised script:
**import** cheerio **from** "cheerio";

// Get HTML from input object
**const** html = input.custom_request.$return_value;

// Load HTML into Cheerio
**const** $ = cheerio.load(html);

// Extract title
**const** title = $('title').text();

// Assign to exports
**this**.title = title;
here is the error:
ReferenceError
input is not defined
DETAILS

DETAILS

ReferenceError: input is not defined
    at file:///tmp/__pdg__/dist/code/39e80a526133c5e6317deb8d779f6bdfef7d9f187f72b5ef891c00342fc06ba3/index.mjs:4:14
    at ModuleJob.run (internal/modules/esm/module_job.js:183:25)
    at async Loader.import (internal/modules/esm/loader.js:178:24)
    at async Object.loadESM (internal/process/esm_loader.js:68:5)
    at async handleMainPromise (internal/modules/run_main.js:59:12)

For the code that was first suggested I received this error:
Code Error
'return' outside of function. (9:0)
DETAILS

SyntaxError: 'return' outside of function. (9:0)
    at _raise (/var/task/node_modules/@babel/parser/src/parser/error.js:147:45)
    at raiseWithData (/var/task/node_modules/@babel/parser/src/parser/error.js:142:17)
    at raise (/var/task/node_modules/@babel/parser/src/parser/error.js:91:17)
    at parseReturnStatement (/var/task/node_modules/@babel/parser/src/parser/statement.js:672:12)
    at parseStatementContent (/var/task/node_modules/@babel/parser/src/parser/statement.js:288:21)
    at parseStatement (/var/task/node_modules/@babel/parser/src/parser/statement.js:242:17)
    at parseStatement (/var/task/node_modules/@babel/parser/src/plugins/flow/index.js:1842:26)
    at parseBlockOrModuleBlockBody (/var/task/node_modules/@babel/parser/src/parser/statement.js:980:25)
    at parseBlockBody (/var/task/node_modules/@babel/parser/src/parser/statement.js:956:10)
    at parseProgram (/var/task/node_modules/@babel/parser/src/parser/statement.js:129:10)

fix the errors above and rewrite step 3 as a valid Pipedream component

I apologize for the confusion earlier. Here’s the corrected code for Step 3 as a valid Pipedream component: import cheerio from "cheerio"; export default defineComponent({ props: { html: { type: "string", label: "HTML content", }, }, async run({ $ }) { const $ = cheerio.load(this.html); const title = $("title").text(); const description = $('meta[name="description"]').attr("content"); const image_url = $('meta[property="og:image"]').attr("content"); return { title, description, image_url, }; }, }); Make sure to replace this.html with the actual HTML content from the previous step (Step 2) in your workflow. Then, you can use the output from this step in Step 4 to insert the data into your Postgres table.

After multiple tries, still not getting this to work. Not sure I understand these instructions:
Make sure to replace this.html with the actual HTML content from the previous step (Step 2) in your workflow.
in Step 2 here are my settings/values:
custom_request
Get HTTP
{{steps.trigger.event.newRow[2]}}
in the “Results” Export tab:
full html output with these paths:
steps.custom_request
steps.custom_request.$return_value
and trying to return this from the html code:

How to Restore a Deck on Your Own | Family Handyman

Try pasting the code from this step into a Node.js code step. Then press the button above the code step labeled Refresh Fields. You’ll see an input field appear above the step.

Pass {{steps.custom_request.$return_value}} to the HTML input field. The code will reference that value. Press Test on that code step and you should see the title returned

I did have this in the HTML input field
and got this error:
Code Error
Identifier ‘$’ has already been declared (11:10)
I then run this revised the code:

import cheerio from "cheerio";
import { defineComponent } from "@pipedreamhq/pd-backend";

export default defineComponent({
  props: {
    html: {
      type: "string",
      label: "HTML content",
    },
  },
  async run({ html }) {
    const $ = cheerio.load(html);

    const title = $("title").text();

    return {
      title,
    };
  },
});
and get this error

Code Error
Identifier ‘defineComponent’ has already been declared. (23:10)
After a few revisions I am getting closer with this code:

import cheerio from "cheerio";

export default {
  props: {
    html: {
      type: "string",
      label: "HTML content",
    },
  },
  async run({ html }) {
    const $ = cheerio.load(html && html.toString ? html.toString() : "");

    const title = $("title").text();

    return {
      title,
    };
  },
};

and here is my output from the lastest code provided above:
Step 3: parse_html

it then inserts a “empty” value into Supabase:
here is the screenshots of Step 4: insert_row

here is the content between the “title” tags I am trying to parse/capture:

How to Restore a Deck on Your Own | Family Handyman

Also got an “object” in one of the code iterations:
inputs into the Supabase table:

This code worked:

import cheerio from "cheerio";

export default {
  async run(event) {
    const html = event.steps.custom_request.$return_value;
    const $ = cheerio.load(html);

    const title = $("title").text().trim();

    return {
      title,
    };
  },
};

Posted thread to Discourse: How can I efficiently extract "title", "description", and "image_url" from a webpage using Pipedream, Supabase, and Google Sheets?

for step #1
Step 1: Trigger on adding a URL to Google sheet
after testing and getting the desired results the Pipedream script is not updating with the new trigger item.
The new trigger item is adding a URL to a Google Sheet.
How can I clear the “cache” and/or refresh the script so that it processes the new URL that I enter

It sounds like you want the workflow to trigger when a new URL is added to a Google Sheet. To achieve this, you can use the Google Sheets trigger “New Row” in your Pipedream workflow. Here’s how to set it up: 1. In your Pipedream workflow, add a new trigger by clicking on the “+” button next to the current trigger. 2. Search for the Google Sheets app in the trigger menu. 3. Select the “New Row” trigger. 4. Connect your Google Sheets account. 5. Choose the specific Google Sheet and the worksheet within it that you want to monitor for new rows. 6. Save the trigger. Now, whenever a new row is added to the specified Google Sheet, your workflow will be triggered. Make sure to remove or disable the previous trigger if you don’t want it to run anymore. If you still face issues, please provide more information about your workflow or visit https://pipedream.com/support for more support options.

the script won’t add new URL, it is stuck with a prior reference when I run it.