Recovery: Rollback and Human-in-the-Loop
Once a failure is detected, the orchestrator must choose between retry, resume from checkpoint, rollback, and human escalation. Learn how reversibility, blast radius, and idempotency drive that choice.
Recovery: Rollback and Human-in-the-Loop
Detection is half the job. Once you know an agent has failed, you have to pick a recovery action. The exam evaluates this on three axes: reversibility of the failed step, blast radius of the side-effects, and idempotency of any retry.
The four recovery actions
| Action | Preconditions | Use when | | --- | --- | --- | | Retry (with backoff and attempt cap) | The action is idempotent or guarded by an idempotency key | Transient errors, network blips, rate limits | | Resume from checkpoint | A durable, validated artifact exists before the failure point | Multi-step pipelines where earlier steps passed | | Rollback (compensating transaction) | The side-effects of the failed step are reversible | Failed writes that did partially commit | | Human-in-the-loop escalation | Failure is silent, irreversible, or ambiguous; high blast radius | Bad action already applied; orchestrator can't safely choose |
Decision rules
- If the action is non-idempotent and lacks an idempotency key, never auto-retry. Either add the key or escalate.
- If a green checkpoint exists upstream of the failure, prefer resume over restart. Restart discards correct work and can drift.
- If side-effects have already escaped (email sent, payment charged), rollback alone is insufficient β you need a compensating action (apology email, refund) and probably HITL.
- If the failure was silent (eval caught it, logs didn't), escalate. The orchestrator can't trust its own state.
What a good HITL escalation looks like
A reviewer awakened at 2 a.m. needs everything in one place: the trace ID, the failed step's input and output, the eval result that flagged the failure, the candidate recovery actions, and the orchestrator's recommendation. The Foundry workflow human-in-the-loop node is designed for exactly this β pause, present, await explicit approval, continue.
A bad escalation is a Slack ping that says "agent failed, please review." The reviewer has no context, defaults to "approve", and the silent failure becomes a production incident.
Exam tip: The recovery action is a function of the failure mode (from the previous topic) and the reversibility of the side-effects. Memorise the matrix: transient + idempotent β retry; partial + checkpointed β resume; bad write + reversible β rollback; silent or irreversible β human.
Quick check
Quick check
An agent's call to a payment API times out. The agent retries automatically. What's the design hazard?
Where this shows up on the exam
GH-600 recovery questions often present a failure scenario with multiple plausible actions. The trap is usually "retry" when the action is non-idempotent or "rollback" when the side-effect has already escaped. Read carefully for the words idempotent, irreversible, silent, and checkpoint β they are the keys to the right answer.
Key terms
- Retry
- Re-run the failing step with bounded attempts and backoff. Safe only for idempotent actions.
- Resume from checkpoint
- Restart from the last durable artifact (passing tests, committed plan) rather than from scratch.
- Rollback
- Revert the side-effects of the failed step to a known-good prior state (git revert, restore snapshot, compensating transaction).
- Human-in-the-loop escalation
- Pause the workflow and require explicit human input to choose the recovery action. Required for irreversible, high-blast-radius, or ambiguous failures.
- Idempotency
- An action can be applied multiple times with the same effect as applying it once. The prerequisite for safe retries.
Common pitfalls
- Retrying a non-idempotent action (charge card, send email) and double-billing or double-sending.
- Rolling back without a compensating transaction for the irreversible side-effect (the email was already sent).
- Resuming from a checkpoint that was itself produced by the failing run β replaying the same bug.
- Escalating to a human without surfacing the trace, the rejected options, and the recommended action; reviewers without context default to 'approve'.