10 min readhard+50 XP

Recovery: Rollback and Human-in-the-Loop

Once a failure is detected, the orchestrator must choose between retry, resume from checkpoint, rollback, and human escalation. Learn how reversibility, blast radius, and idempotency drive that choice.

After this topic, you'll be confident about Retry, Resume from checkpoint, Rollback and 2 more concepts.

Recovery: Rollback and Human-in-the-Loop

Detection is half the job. Once you know an agent has failed, you have to pick a recovery action. The exam evaluates this on three axes: reversibility of the failed step, blast radius of the side-effects, and idempotency of any retry.

The four recovery actions

Action	Preconditions	Use when
Retry (with backoff and attempt cap)	The action is idempotent or guarded by an idempotency key	Transient errors, network blips, rate limits
Resume from checkpoint	A durable, validated artifact exists before the failure point	Multi-step pipelines where earlier steps passed
Rollback (compensating transaction)	The side-effects of the failed step are reversible	Failed writes that did partially commit
Human-in-the-loop escalation	Failure is silent, irreversible, or ambiguous; high blast radius	Bad action already applied; orchestrator can't safely choose

Decision rules

If the action is non-idempotent and lacks an idempotency key, never auto-retry. Either add the key or escalate.
If a green checkpoint exists upstream of the failure, prefer resume over restart. Restart discards correct work and can drift.
If side-effects have already escaped (email sent, payment charged), rollback alone is insufficient — you need a compensating action (apology email, refund) and probably HITL.
If the failure was silent (eval caught it, logs didn't), escalate. The orchestrator can't trust its own state.

What a good HITL escalation looks like

A reviewer awakened at 2 a.m. needs everything in one place: the trace ID, the failed step's input and output, the eval result that flagged the failure, the candidate recovery actions, and the orchestrator's recommendation. The Foundry workflow human-in-the-loop node is designed for exactly this — pause, present, await explicit approval, continue.

A bad escalation is a Slack ping that says "agent failed, please review." The reviewer has no context, defaults to "approve", and the silent failure becomes a production incident.

Exam tip: The recovery action is a function of the failure mode (from the previous topic) and the reversibility of the side-effects. Memorise the matrix: transient + idempotent → retry; partial + checkpointed → resume; bad write + reversible → rollback; silent or irreversible → human.

Quick check

1 of 3

+50 XP

An agent's call to a payment API times out. The agent retries automatically. What's the design hazard?

Pick your answer.

Where this shows up on the exam

GH-600 recovery questions often present a failure scenario with multiple plausible actions. The trap is usually "retry" when the action is non-idempotent or "rollback" when the side-effect has already escaped. Read carefully for the words idempotent, irreversible, silent, and checkpoint — they are the keys to the right answer.

Anchor concepts

Key terms

Retry: Re-run the failing step with bounded attempts and backoff. Safe only for idempotent actions.
Resume from checkpoint: Restart from the last durable artifact (passing tests, committed plan) rather than from scratch.
Rollback: Revert the side-effects of the failed step to a known-good prior state (git revert, restore snapshot, compensating transaction).
Human-in-the-loop escalation: Pause the workflow and require explicit human input to choose the recovery action. Required for irreversible, high-blast-radius, or ambiguous failures.
Idempotency: An action can be applied multiple times with the same effect as applying it once. The prerequisite for safe retries.

Watch out

Common pitfalls

Retrying a non-idempotent action (charge card, send email) and double-billing or double-sending.
Rolling back without a compensating transaction for the irreversible side-effect (the email was already sent).
Resuming from a checkpoint that was itself produced by the failing run — replaying the same bug.
Escalating to a human without surfacing the trace, the rejected options, and the recommended action; reviewers without context default to 'approve'.