Today, AWS had an outage in their us-east-1 region, and that affected Pipedream for much of the day. Users could not access https://pipedream.com, and workflows were running intermittently, between 15:35 UTC on Dec 7 and 0:58 UTC on Dec 8.
It is easy to blame AWS for incidents like this, and the scale of this AWS outage is certainly unprecedented. But we can do a better job of building resiliency for these events into the core service, and we own the downtime. We plan to improve this as we start 2022.
What happened
Today at 7:35am PT (15:35 UTC), we received our first alarm indicating that HTTP services and the Pipedream UI were down. Visiting https://pipedream.com returned a 502 error. We rely on AWS to run the platform, and upon investigation, it became clear that this was an AWS outage in the us-east-1 region, where most of our infrastructure runs.
Since AWS login and authentication was affected by the outage, we were unable to access the AWS Console or API to make changes. During this time, workflows were running intermittently. At 14:58pm PT (22:58 UTC), we regained full access to our production services. Many AWS services were still unavailable at this point, so we started to migrate workloads off of failing services (e.g. Fargate) to our core Kubernetes cluster.
By around 16:00 PT (0:00 UTC), we started enqueuing the majority of incoming events, and quickly thereafter, workflows and event sources started processing these events. At that point, we started work to recover the Pipedream API and UI.
At 16:58 PT (0:58 UTC), service was restored to the Pipedream UI. We continued to work through a backlog of queued events for workflows and event sources, and added capacity to accommodate the increased load. At 17:57 PT (1:57 UTC), the backlog of events had been processed and the service was fully-operational.
How to troubleshoot the impact to your workflows
Workflows and event sources were running intermittently throughout the day. To review the impact to your specific resources, visit your workflow and event source logs to see the events that were successfully processed.
If services that trigger your workflows deliver events via webhook, they may retry events that failed earlier in the day. Some services (like GitHub and Stripe) provide interfaces that let you see these queued events and retry them manually.
If you have any questions at all or observe any lingering issues from this incident, please let us know.