Scraping Discourse with a custom Pipedream Source

Scraping Discourse with a custom Pipedream Source

Follow along as we build a custom Pipedream Source that scrapes an entire database of Discourse topics via it's REST API and a Pipedream Source component.

Open a Gitpod Development Environment

First, open a Gitpod Development Environment so you can follow along in a dedicated window.

Just open this link to start: https://gitpod.io/#https://pipedream.com/PipedreamHQ/pipedream

After the environment has initialized, you'll be prompted to enter in your Pipedream API keys, which you can find here.

Initialize a new source component

First follow these commands to create a new directory called personal for your app to live in:

cd components
mkdir personal

Next, let's create our new discourse-scraper source scaffolding so we have a file to start from:

pd init source discourse-scraper

Finally, we can open this brand new file in our code editor:

code discourse-scraper/discourse-scraper.mjs

Steal, I mean use my code

With the code scaffolding open, you can copy and paste my code from the video:

import { axios } from '@pipedream/platform';

export default {
  name: "Discourse Scraper",
  version: "0.0.1",
  key: "discourse-scraper",
  description: "Emit new events on each...",
  props: {
    discourse: {
      type: "app",
      app: "discourse"
    },
    db: "$.service.db",
    timer: {
      type: "$.interface.timer",
      default: {
        cron: "0 0 * * *", // Run job once a day
      },
    },
  },
  dedupe: 'unique',
  type: "source",
  methods: {},
  async run(event) {
    const page = this.db.get('page') ?? 0;

    const data = await axios(this, {
      url: `https://${this.discourse.$auth.domain}/c/help/5.json?page=${page}`,
      headers: {
        "Api-Username": `${this.discourse.$auth.api_username}`,
        "Api-Key": `${this.discourse.$auth.api_key}`,
      },
    });

    console.log(data);

    for(const topic of data.topic_list.topics) {
      console.log(`Emitting a single topic `, topic);
      this.$emit(
        { topic },
        {
          id: topic.id,
          summary: topic.title,
          ts: Date.now(),
        }
      );
    }

    this.db.set('page', page + 1);
  },
};

Deploy this code to your Pipedream account

Back in your terminal on Gitpod, enter in this command to deploy this component to your account:

pd dev discourse-scraper/discourse-scraper.mjs

The pd dev command will allow you make changes to your files and they will update the source in your Pipedream account in real time.

Finally, return to your Pipedream accounts sources here, open up the new Discourse Scraper and click RUN NOW to trigger it manually.

Learn more and get connected!

🔨  Start building at https://pipedream.com
📣  Read our blog https://pipedream.com/blog
💬  Join our community https://pipedream.com/community