Creating an Event Source which emits batches of events?

chexxor · April 22, 2021, 3:31pm

I have been doing some deep thinking about how to design an integration which pulls new records from one app and references them to create records in another app.

I have been reading Pipedream docs all day yesterday and it seems I have two design goals which aren’t easily accomplished using the provided Pipedream tools, but maybe I’m lacking knowledge and someone here can help me.

I want to ensure my workflow processes each record exactly once. My first idea was to use a scheduled workflow to issue an HTTP query which retrieves the latest records from the source app.

While I was thinking this may work if I handle all the edge-cases, I discovered that the Event Source component has special support for ensuring only-once delivery of an event to a workflow which uses it, using a “dedupe strategy” and event ids.

If I use an event source, it means the semantics of my workflow would be “operate on one record”, rather than “once per hour query a batch of records”, which are two wildly different things.

While the former seems fine, it conflicts with a nice-to-have goal of mine which is reducing API usage. In my current case, I want to create records in Salesforce, and Salesforce provides a REST resource named “composite” which enables me to send several REST requests in a single HTTP request. Salesforce will process each REST request in that bundle, then respond with an array of statuses for each sub-request. A primary reason I like using fewer HTTP requests is due to some feelings I have (source unknown) that an HTTP request has some overhead, perhaps because of the a server’s costs in having an HTTP request open, or perhaps I’ve seen an API which limits number of requests per hour, or perhaps I’ve seen some API integration tools which bill you based on the number of requests you make per month.

If I wanted to reduce HTTP requests I send, I would want to consider creating an Event Source which emits batches of events, allowing workflows using that source to reduce the batch to produce a set of REST requests, then batch those REST requests into a few HTTP requests to send to the “composite” resource.

I don’t recall seeing a philosophy of API integration design in the docs which might answer this question, but does Pipedream have an opinion this matter? I feel like it must, and I feel like it could easily convince me that in Pipedream-world it’s more efficient for workflow-builder and for Pipedream the platform to build workflows which have semantics of “operate on one record”.

dylburger · April 22, 2021, 5:43pm

@chexxor I appreciate the time you took to read docs, and I’m happy to share some insight here.

You’re correct that we’ve optimized Pipedream workflows for the use case where you want to operate on one event at a time. That often simplifies the workflow’s logic / code, and allows you to take advantage of features like the deduper for events emitted from sources.

If Salesforce doesn’t dictate you send these events in batch, and allows you to send individual requests, I’d recommend the latter. Different APIs will apply different constraints: some will “charge” batch requests like the sum of individual requests (Google is a notable example: they operate APIs where if you send 5 API calls in one HTTP request, they charge it as 5 API calls), and others will limit the HTTP requests themselves. But you can use our concurrency and throttling controls to stay within these limits. For example, if an API limits you to 10 requests / second, you can throttle the workflow accordingly:

Using these controls allows you to set constraints, based on the API, that help you manage request rates / load. Pipedream manages this so you don’t have to, and it can save a significant amount of code.

Moreover, these concurrency / throttling controls assume atomic events trigger the workflow. If each “event” is a batch of 10 events (for example), we’ll apply the throttling settings you apply, but now Salesforce will receive a maximum of 100 events per second (we’re limiting to 10 events per second based on the throttling settings, but each incoming event itself contains 10 events).

That said, there’s also nothing wrong with batching events in the manner you describe. Within a workflow, you have access to $checkpoint (workflow state), which allows you to store and retrieve any data you’d like. Check out this example for how you can use $checkpoint to dedupe data. You could modify that to operate on an array of data, checking each ID to see if you’ve observed it before. You’ll also want to apply a concurrency of 1 to your workflow to ensure you don’t process more than one batch of events at the same time:

Screen Shot 2021-04-22 at 10.42.12 AM

Otherwise, both batches could contain the same ID. Since $checkpoint is loaded at the start of a workflow execution, and only written back to state once the execution finishes, neither execution will think we’ve seen the ID before, and the event will get processed twice. Serializing your execution will ensure this doesn’t happen.

Let me know if that helps or if you have any other questions.

chexxor · April 22, 2021, 6:09pm

That helps me, yes, it makes me feel more comfortable to have my understanding verified.

The example for the second way of doing de-duping is great!

Thanks for your time!