← All posts
Engineering8 min read

Designing checkpoints for resumable agent workflows

Long-running agent runs fail in the middle. Here's how we model checkpoints so a 30-step workflow can resume from step 18 instead of starting over.

A multi-agent run is a sequence of expensive, side-effecting operations. A model call times out, a tool returns a 503, a rate limit trips — and if your only option is to start the whole workflow over, you've thrown away every token and every minute spent on the steps that already succeeded. For a 30-step research workflow, restarting from scratch isn't an inconvenience; it's the difference between a feature that ships and one that quietly gets turned off.

The fix is to make runs resumable. That means treating progress as durable state rather than something that lives only in memory for the lifetime of a request. In LoopLlama, the unit of durability is the step.

Why a step is the natural checkpoint boundary#

LoopLlama executes a crew sequentially: each agent takes a turn, sees the original input plus the output of every agent before it, and produces a result. That hand-off point — the moment one agent's turn ends and the next begins — is the cleanest boundary to checkpoint on. The output of a completed step is final; nothing downstream can change it. So once a step finishes, we can write it down and never recompute it.

Checkpointing mid-step is far messier: you'd have to capture partial model output, in-flight tool calls, and provider-side state you don't control. By making the step atomic — it either completes and is persisted, or it didn't happen — resumption logic stays simple and correct.

What we persist at each checkpoint#

After every step, LoopLlama records a durable row capturing exactly what's needed to continue the run later:

  • The step index and the agent role that produced it.
  • The agent's output — the artifact later steps build on.
  • Token counts and timing, for usage metering and for reading the run back like a trace.
  • The step status, so a failed step is distinguishable from one that never ran.

We deliberately do not persist transient model state. The contract is that a step is a pure function of the accumulated context, so we only need to store the context and the outputs — not the internal scratchpad of any single model call.

Resuming a run#

When a run resumes, LoopLlama loads the persisted steps, reconstructs the accumulated context from their outputs, and dispatches the next pending step. Steps that already succeeded are never re-executed, so a run that died at step 18 picks up at step 18 — not step one. Because the resume path reuses the same execution machinery as the initial run, there's no separate, lightly tested 'recovery mode' to drift out of sync.

This same checkpoint model is what makes human-in-the-loop pauses and per-step retries possible: they're all just a run that stops at a step boundary and is asked to continue. Get the boundary right, and resumability, pausing, and replay all fall out of the same design.

Written by The LoopLlama team.

Run your first agent crew in five minutes.

Get an API key and put these ideas to work. Pay only for the steps your agents run.