Convert URL To LLM-Friendly Input with Jina Reader API on New Scraping Completed from WebScraper.IO API

Pipedream makes it easy to connect APIs for Jina Reader, WebScraper.IO and 3,000+ other apps remarkably fast.

Trigger workflow on

New Scraping Completed from the WebScraper.IO API

Next, do this

Convert URL To LLM-Friendly Input with the Jina Reader API

No credit card required

▶

Watch us build a workflow

8 min

Watch now ➜

Trusted by 1,000,000+ developers from startups to Fortune 500 companies

Developers ♥ Pipedream

Getting Started#

This integration creates a workflow with a WebScraper.IO trigger and Jina Reader action. When you configure and deploy the workflow, it will run on Pipedream's servers 24x7 for free.

Select this integration
Configure the New Scraping Completed trigger
1. Connect your WebScraper.IO account
2. Configure timer
Configure the Convert URL To LLM-Friendly Input action
1. Connect your Jina Reader account
2. Optional- Configure URL
3. Optional- Select a Content Format
4. Optional- Configure Timeout
5. Optional- Configure Target Selector
6. Optional- Configure Wait For Selector
7. Optional- Configure Excluded Selector
8. Optional- Configure JSON Response
9. Optional- Configure Forward Cookie
10. Optional- Configure Proxy Server URL
11. Optional- Configure Bypass Cache
12. Optional- Configure Stream Mode
13. Optional- Configure Browser Locale
14. Optional- Configure Iframe
15. Optional- Configure Include Shadow DOM Content
16. Optional- Configure PDF File Path or URL
17. Optional- Configure HTML File Path or URL
18. Optional- Configure syncDir
Deploy the workflow
Send a test event to validate your setup
Turn on the trigger

Details#

This integration uses pre-built, source-available components from Pipedream's GitHub repo. These components are developed by Pipedream and the community, and verified and maintained by Pipedream.

To contribute an update to an existing component or create a new component, create a PR on GitHub. If you're new to Pipedream component development, you can start with quickstarts for trigger span and action development, and then review the component API reference.

Trigger#

New Scraping Completed on WebScraper.IO

Description:Emit new event when a page scraping job has completed. [See the docs here](https://webscraper.io/documentation/web-scraper-cloud/api)

Version:0.0.1

Key:webscraper_io-new-scraping-completed

View on GitHub

WebScraper.IO Overview#

The WebScraper.IO API allows you to programmatically perform web scraping tasks, extracting structured data from websites. With the API, you can automate the gathering of web content for analysis, monitoring, and integration with other data sources. In Pipedream, you can leverage this API to build workflows that process, analyze, and act on the data you scrape without writing code for backend infrastructure.

Trigger Code#

import common from "../common/base.mjs";

export default {
  ...common,
  key: "webscraper_io-new-scraping-completed",
  name: "New Scraping Completed",
  description: "Emit new event when a page scraping job has completed. [See the docs here](https://webscraper.io/documentation/web-scraper-cloud/api)",
  version: "0.0.1",
  type: "source",
  dedupe: "unique",
  methods: {
    ...common.methods,
    async emitHistoricalEvents({ limit }) {
      const jobs = await this.getScrapingJobs();
      if (!(jobs?.length > 0)) {
        return;
      }
      jobs.reverse().slice(0, limit)
        .forEach((job) => this.emitEvent(job));
    },
    isRelevant(job, previousIds) {
      return job.status === "finished" && !previousIds[job.id];
    },
    emitEvent(job) {
      const meta = this.generateMeta(job);
      this.$emit(job, meta);
    },
    generateMeta(job) {
      return {
        id: job.id,
        summary: job.sitemap_name,
        ts: job.time_created,
      };
    },
    async getScrapingJobs() {
      const jobs = [];
      const previousIds = this._getPreviousIds();

      const results = await this.webscraper.paginate(this.webscraper.getScrapingJobs);
      for (const job of results) {
        if (this.isRelevant(job, previousIds)) {
          previousIds[job.id] = true;
          jobs.push(job);
        }
      }

      this._setPreviousIds(previousIds);

      return jobs;
    },
  },
  async run() {
    const jobs = await this.getScrapingJobs();
    jobs.forEach((job) => this.emitEvent(job));
  },
};

Trigger Configuration#

This component may be configured based on the props defined in the component code. Pipedream automatically prompts for input values in the UI and CLI.

Label	Prop	Type	Description
WebScraper.IO	`webscraper`	`app`	This component uses the WebScraper.IO app.
N/A	`db`	`$.service.db`	This component uses `$.service.db` to maintain state between executions.
	`timer`	`$.interface.timer`

Trigger Authentication#

WebScraper.IO uses API keys for authentication. When you connect your WebScraper.IO account, Pipedream securely stores the keys so you can easily authenticate to WebScraper.IO APIs in both code and no-code steps.

" To retrieve your API token,

Navigate to your Web Scraper account and sign in
Go to “Account Info” > “API settings”"

About WebScraper.IO#

Making web data extraction easy and accessible for everyone.

Action#

Convert URL To LLM-Friendly Input on Jina Reader

Description:Converts a provided URL to an LLM-friendly input using Jina Reader. [See the documentation](https://github.com/jina-ai/reader)

Version:1.0.2

Key:jina_reader-convert-to-llm-friendly-input

View on GitHub

Action Code#

import {
  ConfigurationError, getFileStream,
} from "@pipedream/platform";
import app from "../../jina_reader.app.mjs";

export default {
  key: "jina_reader-convert-to-llm-friendly-input",
  name: "Convert URL To LLM-Friendly Input",
  description: "Converts a provided URL to an LLM-friendly input using Jina Reader. [See the documentation](https://github.com/jina-ai/reader)",
  version: "1.0.2",
  annotations: {
    destructiveHint: false,
    openWorldHint: true,
    readOnlyHint: true,
  },
  type: "action",
  props: {
    app,
    url: {
      type: "string",
      label: "URL",
      description: "The URL to convert to an LLM-friendly input.",
      optional: true,
    },
    contentFormat: {
      type: "string",
      label: "Content Format",
      description: "You can control the level of detail in the response to prevent over-filtering. The default pipeline is optimized for most websites and LLM input.",
      optional: true,
      options: [
        "markdown",
        "html",
        "text",
        "screenshot",
        "pageshot",
      ],
    },
    timeout: {
      type: "integer",
      label: "Timeout",
      description: "Maximum time to wait for the webpage to load. Note that this is NOT the total time for the whole end-to-end request.",
      optional: true,
    },
    targetSelector: {
      type: "string",
      label: "Target Selector",
      description: "Provide a list of CSS selector to focus on more specific parts of the page. Useful when your desired content doesn't show under the default settings. E.g., `body, .class, #id`.",
      optional: true,
    },
    waitForSelector: {
      type: "string",
      label: "Wait For Selector",
      description: "Provide a list of CSS selector to wait for specific elements to appear before returning. Useful when your desired content doesn't show under the default settings. E.g., `body, .class, #id`.",
      optional: true,
    },
    excludedSelector: {
      type: "string",
      label: "Excluded Selector",
      description: "Provide a list of CSS selector to remove the specified elements of the page. Useful when you want to exclude specific parts of the page like headers, footers, etc. E.g., `header, .class, #id`.",
      optional: true,
    },
    jsonResponse: {
      type: "boolean",
      label: "JSON Response",
      description: "The response will be in JSON format, containing the URL, title, content, and timestamp (if available). In Search mode, it returns a list of five entries, each following the described JSON structure. Keep in mind **JSON Response** will take piority over **Stream mode** if both are enabled.",
      optional: true,
    },
    forwardCookie: {
      type: "string",
      label: "Forward Cookie",
      description: "The API server can forward your custom cookie settings when accessing the URL, which is useful for pages requiring extra authentication. Note that requests with cookies will not be cached. E.g., `<cookie-name>=<cookie-value>, <cookie-name-1>=<cookie-value>; domain=<cookie-1-domain>`. [Learn more here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie).",
      optional: true,
    },
    useProxyServer: {
      type: "string",
      label: "Proxy Server URL",
      description: "The API server can utilize your proxy to access URLs, which is helpful for pages accessible only through specific proxies. E.g., `http://your_proxy_server.com`. [Learn more here](https://en.wikipedia.org/wiki/Proxy_server).",
      optional: true,
    },
    bypassCache: {
      type: "boolean",
      label: "Bypass Cache",
      description: "The API server caches both Read and Search mode contents for a certain amount of time. To bypass this cache, set this header to `true`.",
      optional: true,
    },
    streamMode: {
      type: "boolean",
      label: "Stream Mode",
      description: "Stream mode is beneficial for large target pages, allowing more time for the page to fully render. If standard mode results in incomplete content, consider using **Stream mode**. [Learn more here](https://github.com/jina-ai/reader?tab=readme-ov-file#streaming-mode). Keep in mind **JSON Response** will take piority over **Stream mode** if both are enabled.",
      optional: true,
    },
    browserLocale: {
      type: "string",
      label: "Browser Locale",
      description: "Control the browser locale to render the page. eg. `en-US`. [Learn more here](https://developer.mozilla.org/en-US/docs/Web/API/Navigator/language).",
      optional: true,
    },
    iframeContent: {
      type: "boolean",
      label: "Iframe",
      description: "Returning result will also include the content of the iframes on the page.",
      optional: true,
    },
    shadowDomContent: {
      type: "boolean",
      label: "Include Shadow DOM Content",
      description: "Returning result will also include the content of the shadow DOM on the page.",
      optional: true,
    },
    pdf: {
      type: "string",
      label: "PDF File Path or URL",
      description: "The path or URL to the pdf file.",
      optional: true,
    },
    html: {
      type: "string",
      label: "HTML File Path or URL",
      description: "The path or URL to the html file.",
      optional: true,
    },
    syncDir: {
      type: "dir",
      accessMode: "read",
      sync: true,
      optional: true,
    },
  },
  methods: {
    streamToBase64(stream) {
      return new Promise((resolve, reject) => {
        const chunks = [];
        stream.on("data", (chunk) => chunks.push(chunk));
        stream.on("end", () => {
          const buffer = Buffer.concat(chunks);
          resolve(buffer.toString("base64"));
        });
        stream.on("error", reject);
      });
    },
    streamToUtf8(stream) {
      return new Promise((resolve, reject) => {
        let data = "";
        stream.setEncoding("utf-8");
        stream.on("data", (chunk) => data += chunk);
        stream.on("end", () => resolve(data));
        stream.on("error", reject);
      });
    },
  },
  async run({ $ }) {
    const {
      app,
      url,
      contentFormat,
      timeout,
      targetSelector,
      waitForSelector,
      excludedSelector,
      jsonResponse,
      forwardCookie,
      useProxyServer,
      bypassCache,
      streamMode,
      browserLocale,
      iframeContent,
      shadowDomContent,
      pdf,
      html,
    } = this;

    if (!url && !pdf && !html) {
      throw new ConfigurationError("You must provide at least one of **URL**, **PDF File Path or URL**, or **HTML File Path or URL**.");
    }

    const data = {
      url,
    };

    if (pdf) {
      const stream = await getFileStream(pdf);
      data.pdf = await this.streamToBase64(stream);
    }

    if (html) {
      const stream = await getFileStream(html);
      data.html = await this.streamToUtf8(stream);
    }

    const response = await app.post({
      $,
      headers: {
        "X-Return-Format": contentFormat,
        "X-Timeout": timeout,
        "X-Target-Selector": targetSelector,
        "X-Wait-For-Selector": waitForSelector,
        "X-Remove-Selector": excludedSelector,
        "X-Set-Cookie": forwardCookie,
        "X-Proxy-Url": useProxyServer,
        "X-No-Cache": bypassCache,
        "Accept": jsonResponse
          ? "application/json"
          : streamMode
            ? "text/event-stream"
            : undefined,
        "X-Locale": browserLocale,
        "X-With-Shadow-Dom": shadowDomContent,
        "X-Iframe": iframeContent,
      },
      data,
    });

    $.export("$summary", "Converted URL to LLM-friendly input successfully.");
    return response;
  },
};

Action Configuration#

This component may be configured based on the props defined in the component code. Pipedream automatically prompts for input values in the UI.

Label	Prop	Type	Description
Jina Reader	`app`	`app`	This component uses the Jina Reader app.
URL	`url`	`string`	The URL to convert to an LLM-friendly input.
Content Format	`contentFormat`	`string`	Select a value from the drop down menu:`markdownhtmltextscreenshotpageshot`
Timeout	`timeout`	`integer`	Maximum time to wait for the webpage to load. Note that this is NOT the total time for the whole end-to-end request.
Target Selector	`targetSelector`	`string`	Provide a list of CSS selector to focus on more specific parts of the page. Useful when your desired content doesn't show under the default settings. E.g., `body, .class, #id`.
Wait For Selector	`waitForSelector`	`string`	Provide a list of CSS selector to wait for specific elements to appear before returning. Useful when your desired content doesn't show under the default settings. E.g., `body, .class, #id`.
Excluded Selector	`excludedSelector`	`string`	Provide a list of CSS selector to remove the specified elements of the page. Useful when you want to exclude specific parts of the page like headers, footers, etc. E.g., `header, .class, #id`.
JSON Response	`jsonResponse`	`boolean`	The response will be in JSON format, containing the URL, title, content, and timestamp (if available). In Search mode, it returns a list of five entries, each following the described JSON structure. Keep in mind JSON Response will take piority over Stream mode if both are enabled.
Forward Cookie	`forwardCookie`	`string`	The API server can forward your custom cookie settings when accessing the URL, which is useful for pages requiring extra authentication. Note that requests with cookies will not be cached. E.g., `<cookie-name>=<cookie-value>, <cookie-name-1>=<cookie-value>; domain=<cookie-1-domain>`. Learn more here
Proxy Server URL	`useProxyServer`	`string`	The API server can utilize your proxy to access URLs, which is helpful for pages accessible only through specific proxies. E.g., `http://your_proxy_server.com`. Learn more here
Bypass Cache	`bypassCache`	`boolean`	The API server caches both Read and Search mode contents for a certain amount of time. To bypass this cache, set this header to `true`.
Stream Mode	`streamMode`	`boolean`	Stream mode is beneficial for large target pages, allowing more time for the page to fully render. If standard mode results in incomplete content, consider using Stream mode. Learn more here. Keep in mind JSON Response will take piority over Stream mode if both are enabled.
Browser Locale	`browserLocale`	`string`	Control the browser locale to render the page. eg. `en-US`. Learn more here
Iframe	`iframeContent`	`boolean`	Returning result will also include the content of the iframes on the page.
Include Shadow DOM Content	`shadowDomContent`	`boolean`	Returning result will also include the content of the shadow DOM on the page.
PDF File Path or URL	`pdf`	`string`	The path or URL to the pdf file.
HTML File Path or URL	`html`	`string`	The path or URL to the html file.
syncDir	`syncDir`	`dir`