9 min readmedium+45 XP

Resumability and Durable Artifacts

A real agent run can outlive a single process: pods restart, users go to bed, sessions get handed off. This topic covers the durable artifacts and checkpoints that let an agent resume mid-task instead of starting over.

After this topic, you'll be confident about Durable artifact, Checkpoint, Idempotent step and 1 more concept.

Resumability and Durable Artifacts

A long-running agent is a distributed system. If your design assumes the process lives forever, the first restart costs the user an hour of work. The fix is two things: durable artifacts for what the agent produced, and resume tokens for how a new caller picks the run back up.

What "durable" means here

A durable artifact is an output that survives the process. The canonical set:

The thread / run record in a managed agent service (Foundry threads, Copilot SDK sessions).
The plan document written to repo, blob, or thread.
The draft PR with the partial diff and a reference to the run ID.
The memory writes that have already been committed to the store.

If you can't enumerate the durable artifacts of a run, you can't resume it.

The resume contract

// 1. Caller starts a run and gets a resume token.
const { runId, threadId } = await agent.start({ task });

// 2. Caller (or a fresh process) resumes later.
const result = await agent.resume({ runId, threadId });

The contract requires two properties:

Durable thread store — the run's history is reconstructible from (runId, threadId).
Idempotent step boundaries — replaying the last incomplete step is safe.

Checkpoint where it's safe to replay

The cheapest way to make resume safe is to checkpoint between idempotent steps, never in the middle of a side-effect. A good rule:

Checkpoint after a side-effect has produced a verifiable artifact (PR opened, comment posted, memory written) — never during one.

If a step isn't naturally idempotent, give it an idempotency key (e.g. a run-derived hash on the request) so re-execution becomes a no-op when the side-effect already landed.

Quick check

1 of 3

+45 XP

An agent is mid-way through a 20-minute refactor when its pod is killed. What is the *minimum* infrastructure that lets it resume without losing work?

Pick your answer.

Where this shows up on the exam

Look for "the run was interrupted — what should happen?" prompts. The right answer combines a durable thread, a resume token, and step-level idempotency. Restart-from-scratch and "make the run faster" are the trap answers.

Anchor concepts

Key terms

Durable artifact: An output written to durable storage during a run — a thread, plan document, draft PR, partial diff — that lets the agent (or a human) resume from a known good point.
Checkpoint: A serialised snapshot of agent state taken at a well-defined point in the loop, so a resumed run can rehydrate exactly where it left off.
Idempotent step: A step whose effect is the same whether it runs once or many times — the precondition for safe resume after a crash.
Resume token: A stable handle (thread ID, run ID, session ID) that lets a caller reconnect to an in-flight or paused agent run.

Watch out

Common pitfalls

Storing state only in process memory: a pod restart, a deployment, or a timeout means the entire run is lost and the user starts over.
Checkpointing inside a non-idempotent step: resume replays the side-effect and you double-charge the card / double-open the PR.
Issuing no resume token: even with durable artifacts, the user has no way to point a fresh session back at the right run.