Skip to content
πŸ”₯0
Sign in
10 min readhard+50 XP

Recovery: Rollback and Human-in-the-Loop

Once a failure is detected, the orchestrator must choose between retry, resume from checkpoint, rollback, and human escalation. Learn how reversibility, blast radius, and idempotency drive that choice.

After this topic, you'll be confident about Retry, Resume from checkpoint, Rollback and 2 more concepts.

Recovery: Rollback and Human-in-the-Loop

Detection is half the job. Once you know an agent has failed, you have to pick a recovery action. The exam evaluates this on three axes: reversibility of the failed step, blast radius of the side-effects, and idempotency of any retry.

The four recovery actions

| Action | Preconditions | Use when | | --- | --- | --- | | Retry (with backoff and attempt cap) | The action is idempotent or guarded by an idempotency key | Transient errors, network blips, rate limits | | Resume from checkpoint | A durable, validated artifact exists before the failure point | Multi-step pipelines where earlier steps passed | | Rollback (compensating transaction) | The side-effects of the failed step are reversible | Failed writes that did partially commit | | Human-in-the-loop escalation | Failure is silent, irreversible, or ambiguous; high blast radius | Bad action already applied; orchestrator can't safely choose |

Decision rules

  1. If the action is non-idempotent and lacks an idempotency key, never auto-retry. Either add the key or escalate.
  2. If a green checkpoint exists upstream of the failure, prefer resume over restart. Restart discards correct work and can drift.
  3. If side-effects have already escaped (email sent, payment charged), rollback alone is insufficient β€” you need a compensating action (apology email, refund) and probably HITL.
  4. If the failure was silent (eval caught it, logs didn't), escalate. The orchestrator can't trust its own state.

What a good HITL escalation looks like

A reviewer awakened at 2 a.m. needs everything in one place: the trace ID, the failed step's input and output, the eval result that flagged the failure, the candidate recovery actions, and the orchestrator's recommendation. The Foundry workflow human-in-the-loop node is designed for exactly this β€” pause, present, await explicit approval, continue.

A bad escalation is a Slack ping that says "agent failed, please review." The reviewer has no context, defaults to "approve", and the silent failure becomes a production incident.

Exam tip: The recovery action is a function of the failure mode (from the previous topic) and the reversibility of the side-effects. Memorise the matrix: transient + idempotent β†’ retry; partial + checkpointed β†’ resume; bad write + reversible β†’ rollback; silent or irreversible β†’ human.

Quick check

Quick check

1 of 3
+50 XP

An agent's call to a payment API times out. The agent retries automatically. What's the design hazard?

Pick your answer.

Where this shows up on the exam

GH-600 recovery questions often present a failure scenario with multiple plausible actions. The trap is usually "retry" when the action is non-idempotent or "rollback" when the side-effect has already escaped. Read carefully for the words idempotent, irreversible, silent, and checkpoint β€” they are the keys to the right answer.

Anchor concepts

Key terms

Retry
Re-run the failing step with bounded attempts and backoff. Safe only for idempotent actions.
Resume from checkpoint
Restart from the last durable artifact (passing tests, committed plan) rather than from scratch.
Rollback
Revert the side-effects of the failed step to a known-good prior state (git revert, restore snapshot, compensating transaction).
Human-in-the-loop escalation
Pause the workflow and require explicit human input to choose the recovery action. Required for irreversible, high-blast-radius, or ambiguous failures.
Idempotency
An action can be applied multiple times with the same effect as applying it once. The prerequisite for safe retries.
Watch out

Common pitfalls

  • Retrying a non-idempotent action (charge card, send email) and double-billing or double-sending.
  • Rolling back without a compensating transaction for the irreversible side-effect (the email was already sent).
  • Resuming from a checkpoint that was itself produced by the failing run β€” replaying the same bug.
  • Escalating to a human without surfacing the trace, the rejected options, and the recommended action; reviewers without context default to 'approve'.
Recovery: Rollback and Human-in-the-Loop Β· Training