Why Has My Workflow Stopped Processing Despite Not Exceeding Queue Size and What New Settings Can be Used to Avoid Future Issues?

This topic was automatically generated from Slack. You can find the original thread here.

Hi, I’ve been waiting 14 days for my support call to be investigated. It seems to be a similar issue to two of my past support calls. I’m eager to resolve this before the server logs and event history expire, which has occurred with previous support calls.

One of my workflows recently stopped executing for approximately one week without error notifications or recorded event history, despite the endpoint receiving over 200 webhooks per day. After I manually removed the concurrency and rate limit settings, the workflow immediately processed over 2,500 triggers.

We typically limit concurrency to 1 worker and execution rate to 1 every 2 seconds, to avoid flooding upstream APIs. According to the documentation, events should be lost only if the queue size is exceeded, which should trigger an error in the workflow and an email. Since I didn’t see an error in the workflow or receive an email, I assume the queue was not full.

Furthermore, the documentation suggests that under our current settings, events should continue to process regardless of load. (“If an event takes longer than [x seconds] to process, the next event in the queue will begin processing immediately.”)

Because the workflow stopped entirely, it appears there may be a bug in how concurrency and rate limits are handled or how they interact. Could you please have the technical team investigate this? Additionally, would you recommend using different settings to avoid this issue in the future?

It’s over three weeks now since I submitted a support ticket for this - has it been looked at at all? It looks like most of the missed executions came through in that big chunk, after I removed the concurrency and rate limit settings, but we would still like to find out how we can avoid this happening again in the future. We also found at least one webhook was unaccounted for and has no trace in the event history despite the endpoint returning “HTTP 200 Success” when it was sent - could this be due to the same issue and would using different settings help to avoid it?