Weight 18%·7 topics

Evaluation, Error Analysis & Tuning

Define success signals, analyze failures by category, and tune instructions, workflows and tool usage.

1
Define Success Criteria and Signals
Before you can evaluate an agent you have to decide what 'good' looks like. This topic teaches you to translate a business goal into measurable success criteria and the concrete signals — task completion, tool-call accuracy, groundedness, safety — that prove the agent met them.
⏱ 8 min·+40 XP·medium
2
Quantitative vs Qualitative Signals
Healthy agent programmes combine cheap quantitative signals that run on every build with slower qualitative signals that catch what numbers can't see. This topic shows how to pair them so neither side ships blind.
⏱ 7 min·+40 XP·medium
3
Automated Scanning and Regression Detection
Automated scanning is how you catch the bad change before users do. This topic covers running evaluators in CI/CD, gating releases on thresholds, scheduling production evaluation, and using continuous evaluation plus red teaming to detect regressions and drift.
⏱ 9 min·+45 XP·medium
4
Failure Analysis from Traces and Artifacts
When an agent run goes wrong, the trace is your crime scene. This topic shows how to read a Foundry / OpenTelemetry-style trace, identify the failing span, and use the surrounding artifacts (instructions, tool args, retrieved context, model output) to assign root cause.
⏱ 10 min·+50 XP·hard
5
Classify Failure Modes: Reasoning, Tool, Context
A useful failure taxonomy turns a vague 'the agent broke' into an actionable fix. This topic teaches the three core agent failure modes — reasoning, tool, and context — and how to match observable signals to each so you fix the right layer.
⏱ 9 min·+50 XP·hard
6
Tune Instructions, Workflows and Constraints
Tuning an agent is rarely about changing the model — it is about tightening the instructions, the workflow steps, and the constraints around tool use. This topic walks through how to iterate on those levers safely using evaluators as a guard.
⏱ 10 min·+50 XP·medium
7
Refine Memory and Tool Usage
Once instructions and workflow are tight, the next levers are the agent's memory layer and its tool catalog. This topic shows how to refine what the agent remembers and which tools it can pick from — using evaluator signals to decide what to add, prune, or scope.
⏱ 9 min·+45 XP·medium