Why Does the Workflow that Calls `pd.flow.delay()` from a Python Cell Sometimes Pause and Never Wake Up, and How Can the Use of Workflow Execution Rate Limiting Resolve This Issue?

This topic was automatically generated from Slack. You can find the original thread here.

Workflow that calls pd.flow.delay() from a Python cell sometimes pauses and then never wakes back up again.
[Resolved: Going to try using workflow execution rate limiting instead of using explicit delays]

Background:

I have a workflow that triggers itself as its last step in order to “loop” over a spreadsheet and do some work for each row.

Near the end of the workflow, immediately before the HTTP self-triggering step, there’s a python step that delays for a few seconds. The exact delay varies; if the workflow took a while to reach this step the delay might get skipped entirely, but if the workflow executed faster the delay might be longer. The max delay is ~16 seconds.

Issue:

Sometimes, a workflow execution pauses at this step and never resumes.

The event history shows a sequence of green checkmark executions and then a grey paused execution at the end. The history of the paused execution shows that all steps completed successfully until the delay step, and the logs I put into the delay step show the intended duration of the delay. Today I saw two cases of this; one intended to delay for ~11 sec and the other for ~12 sec. But in both cases the execution had remained paused at that step for many minutes, sometimes hours.

Right now, my workaround is:

  1. notice that the work has stopped (time consuming to keep checking back)
  2. delete the paused execution (which is why I don’t have any screenshots, sorry)
  3. restart the workflow manually so it keeps working

Any ideas about what might be causing this to happen?

Let me know what info might be useful to help debug this, if any. It might take me some time to reproduce, because I don’t have workloads for this workflow every day and it would be pretty expensive to set up a dummy workload just to trigger this issue.

We’ve been dealing with this issue for a long time, unfortunately. :disappointed:

Our current workaround is:
• Create a datastore entry before the delay (store the current timestamp in it along with the resume_url).
• Delete the datastore entry after the resume.
• Have a workflow running every X minutes, which checks for leftover entries in the datastore (and checks if the timestamp is too old).
• Call the resume_url multiple times (because sometimes that doesn’t work to resume the workflow either).

It’s yucky, but :man-shrugging:

For your use case, could you maybe use the concurrency limits and just throw out the delay entirely?

Something like this?

image.png

:joy: mm, I get it re: the workaround.

it’s funny, cause I already have a watchdog workflow + datastore entry system set up - in which the watchdog workflow is subscribed to errors from the working workflow and restarts it, and it uses the datastore to record how many failures have occurred in a row to get exponential backoff (and the work workflow clears the datastore when it succeeds).

More of similar, I guess.

the delay is partially to ensure it doesn’t put too much pressure on a downstream component, so removing it and relying on concurrency control might be too fast for that component to handle. But a good thing for me to re-consider, I guess.

Well if you limit the execution rate as well, it should act the same as a delay. No?

:man-facepalming: where did that setting come from, I could have sworn that didn’t exist a month ago! haha that (execution rate) might actually be all that’s needed. :pray: