Why is My Workflow Occasionally Timing Out with No Steps Executed and No Automatic Retry despite "Automatically Retry on Errors" Setting Enabled?

This topic was automatically generated from Slack. You can find the original thread here.

I have a workflow that occasionally fails with the following error message: Timeout
You've exceeded the default timeout for this workflow. You can raise the timeout in your workflow's Settings. See https://pipedream.com/docs/troubleshooting/#timeout

The thing is, the total duration time is listed as 0s, and 0 steps have been executed. I only have one trigger and one Python codeblock, and it does not seem that Python codeblock ever gets executed. The Python codeblock does contain an API call to Salesforce, but it’s not apparent that the call ever gets made. Plus, prior logging steps also don’t get executed. Can someone help me troubleshoot this issue? Here are a few sample failed events:
Sign Up - Pipedream
Sign Up - Pipedream
Sign Up - Pipedream
There is no discernable pattern from these events - they are not cold starts. There are successful events before and after. If I replay any of these events, they are always successful.

Additionally, under workflow settings, I have Automatically retry on errors checked, but these don’t seem to auto-retry.

It’s just a UI glitch. It doesn’t show you what was processed before ~it timed out~ an OOM error, or where ~it timed out~ the OOM error happened exactly. :disappointed:

Hopefully they’ll fix that eventually.

And possibly auto-retry just doesn’t work on ~timeouts~ OOM errors. :thinking_face: :man-shrugging:

Try increasing the ~timeout~ memory of your workflow and see if that helps.

I tried increasing timeout from 2 minutes to 4 minutes recently with no success. The workflow itself typically takes less than 3 seconds, or up to 16 seconds from a cold start. 2 min should be more than enough.

If the duration time is 0, then it’s most likely a memory issue — try raising it and see if it helps

Ah yeah, is right!

I got confused between timeouts & OOM errors. :man-facepalming:

Everything I described above actually applies to OOM errors, not timeouts.

: Many users come here for help with this very issue… might be worth it to add a note to the UI (for those failed runs) until it is resolved?

Thanks and @U05LWJMUX0A. I’ll try bumping up the memory. Just curious though, should memory usage change significantly from one run to the next? Why would an OOM error be resolved by retrying?

Most usually not, it depends a lot in the workflow. Usually retrying without increasing the memory will fail

Depending what your workflow processes, memory usage can be vastly different from one run to the next.

Especially if it processes files (for example with Google Drive), and the size of the files is unpredictable (like for video files, for example).

Re-reading your specific case now, and I’m wondering if this might be a sign of a memory leak (i.e. you said “they are not cold starts” and that replays “are always successful”). :thinking_face:

Because if the executor remains warm for many consecutive runs, and memory is leaked from one run to the next… memory usage will keep going up until it hits an OOM error in an apparently random run.

: Does that make sense? :point_up_2: What do you think?

But until we have some stats on the memory usage of our workflow runs… we can’t really diagnose this at all. :man-shrugging:

Just recording how much memory is used vs free at the beginning & end of a workflow would be a huge help.

Then we could easily identify memory leaks if we see that the free memory keeps going down in successive workflows.

Hmm, I wonder if there could be pattern in the spacing of these failures. That might tell us something about memory leaks on the executor level. To give more context, this particular workflow is pretty uncomplicated. Someone submits something on Jotform, triggers the workflow, I take the Jotform JSON and do some manipulation, send an API call to Salesforce to gather some info, then another to write the output to Salesforce. Overall memory usage I imagine would be very small. Variable, yes, from the varying text submitted to Jotform, but it’s all just a few line of text.

Yeah, hard to imagine how that could result in a timeout or an OOM error.

But do you really get enough submissions for the executor to remain warm for long periods of time? (and thus process multiple successive requests without restart)

You have to get a new submission at least every 5 minutes, otherwise it would go cold.

Although… now that I think about it, we do occasionally run into timeouts in our workflows because of API requests which are stalling.

That might be a more likely culprit.