2021-12-07 Outage

Today, AWS had an outage in their us-east-1 region, and that affected Pipedream for much of the day. Users could not access https://pipedream.com, and workflows were running intermittently, between 15:35 UTC on Dec 7 and 0:58 UTC on Dec 8.

It is easy to blame AWS for incidents like this, and the scale of this AWS outage is certainly unprecedented. But we can do a better job of building resiliency for these events into the core service, and we own the downtime. We plan to improve this as we start 2022.

What happened

Today at 7:35am PT (15:35 UTC), we received our first alarm indicating that HTTP services and the Pipedream UI were down. Visiting https://pipedream.com returned a 502 error. We rely on AWS to run the platform, and upon investigation, it became clear that this was an AWS outage in the us-east-1 region, where most of our infrastructure runs.

Since AWS login and authentication was affected by the outage, we were unable to access the AWS Console or API to make changes. During this time, workflows were running intermittently. At 14:58pm PT (22:58 UTC), we regained full access to our production services. Many AWS services were still unavailable at this point, so we started to migrate workloads off of failing services (e.g. Fargate) to our core Kubernetes cluster.

By around 16:00 PT (0:00 UTC), we started enqueuing the majority of incoming events, and quickly thereafter, workflows and event sources started processing these events. At that point, we started work to recover the Pipedream API and UI.

At 16:58 PT (0:58 UTC), service was restored to the Pipedream UI. We continued to work through a backlog of queued events for workflows and event sources, and added capacity to accommodate the increased load. At 17:57 PT (1:57 UTC), the backlog of events had been processed and the service was fully-operational.

How to troubleshoot the impact to your workflows

Workflows and event sources were running intermittently throughout the day. To review the impact to your specific resources, visit your workflow and event source logs to see the events that were successfully processed.

If services that trigger your workflows deliver events via webhook, they may retry events that failed earlier in the day. Some services (like GitHub and Stripe) provide interfaces that let you see these queued events and retry them manually.

If you have any questions at all or observe any lingering issues from this incident, please let us know.

2 Likes

Is there a possibly an account setting could be added to control what happens to workflows and sources when there’s an outage? A number of services on the internet run on us-east-1. It would be useful if I could tell pipedream to halt all or specific workflows in the event of an outage.

Not currently, but there are three relevant things we’re discussing:

  1. Automated, step-level retry to handle failures. So for example, you’d be able to define how retries in workflows / steps happen (e.g. num retries, delay between retries, etc.). This should help when Pipedream is available but third party APIs are down / return error responses.
  2. We’re discussing our multi-region strategy, and may provide the ability to deploy workflows into specific regions in the future.
  3. We plan to provide a workflow API in the future that would let you programmatically disable / enable workflows.

Let me know if there are other dev tools you use that provide comparable features, too. I’d love to know if you like the implementation for retry / halting in another tool.

Those changes would be stellar!

Re: step-level retry - I can also see value in being able to define trigger another workflow or route to another step after N failed attempts. Something else I’ve been hoping to see pop up is multiple error workflows. If there’s a beta group for this feature, I’d love to be a part of it.

I multi-regionalized an AWS application a few months ago. It is not always the easiest thing to do, but I’m glad this is on Pipedream’s roadmap.

Most no-code/low-code tools I’ve used don’t allow region-specific configuration. If I need to build a more resilient system, I reach for one of two stacks, Laravel + Laravel Vapor or AWS Cloudformation + Lambda (nodejs) + DynamoDB/Postgresql + SNS/SQS.

This is all great feedback and that’s great to see what your go-to tools are outside PD.

re: error workflows, were you aware that we expose a $errors stream in sources and workflows? This means you can forward errors for one or multiple workflows to another custom workflow. You can continue forwarding errors to the global error workflow, as well, or send errors only to that custom error workflow (or multiple error workflows).

Give it a try and let me know if that works for your use case, or if you have any suggested improvements.