10 min readhard+50 XP

Detect Failed, Partial and Stalled Agents

Learn the failure taxonomy for agent runs — hard failure, partial success, stalled, and silently wrong — and match observable signals (HTTP codes, traces, eval scores) to each mode so the orchestrator can react correctly.

After this topic, you'll be confident about Hard failure, Partial success, Stalled agent and 1 more concept.

Detect Failed, Partial and Stalled Agents

An agent run that ends "successfully" isn't the same as a successful run. The exam expects you to recognise four distinct failure modes and the signals that distinguish them — because each mode demands a different recovery path.

The four failure modes

Mode	Process state	Output state	Typical signals
Hard failure	Crashed or exited non-zero	None or partial	Exception traceback, 5xx, runner timeout
Partial success	Exited cleanly, early	Some artifacts present	Exit 0 with a checkpoint but no final deliverable
Stalled	Still running	No progress	Repeated identical tool calls, flat artifact graph, heartbeat-only logs
Silent failure	Exited cleanly	Wrong	Logs claim success; eval / downstream metric disagrees

The mistake every team makes once: treating "no exceptions" as success. Silent failures by definition produce no exceptions. Catching them needs an external check — an eval, a downstream metric, a human spot-check — not a more verbose log.

What an orchestrator should monitor

Liveness: a heartbeat from the agent process.
Progress: are new artifacts being produced? are tool calls diverse?
Resource budget: tokens, wall-clock, tool quota.
Outcome validation: did the post-action eval pass?

A stall is the diff between liveness and progress: the agent is alive but not advancing. A common detector is "N identical tool calls in a row" or "no new artifact written in T seconds while the agent is still consuming tokens."

Exam tip: The single highest-leverage detector is outcome validation. Liveness/progress/budget catch the loud failures; only an external truth signal catches silent ones.

Match the signal to the failure mode

Match signals to failure modes

+50 XP

Drag each observed signal onto the failure mode it most strongly indicates.

Signals

Hard failure

Partial success

Stalled

Silent failure

0 / 8 placed

Where this shows up on the exam

Questions on failure detection usually present a log excerpt or a metrics description and ask which failure mode is in play and what the orchestrator should do next. If the run looks clean but the outcome looks wrong, the answer involves an external eval. If the run looks alive but flat, the answer involves a stall detector. Hard failures and partial successes are the easier two — focus your prep on stalls and silents.

Anchor concepts

Key terms

Hard failure: The agent terminates abnormally — exception, non-zero exit, infrastructure timeout. The orchestrator sees a clear error.
Partial success: Some steps completed but the goal was not reached. The agent stopped, but the state is in between the start and the desired end.
Stalled agent: The agent is still running but is making no observable progress — repeated identical tool calls, no new artifacts, heartbeat-only logs.
Silent failure: The agent claims success but the outcome log or evals show the task was not actually completed. The most dangerous mode.

Watch out

Common pitfalls

Treating 'no errors logged' as success — silent failures by definition produce clean logs.
Using wall-clock timeout as the only stall detector; a long-running tool call looks identical to a loop.
Counting tool-call success as task success without an eval or downstream observation.
Letting an agent retry the same failing tool call indefinitely; with no backoff or attempt cap, a 5xx becomes a budget incident.