Define Success Criteria and Signals
Before you can evaluate an agent you have to decide what 'good' looks like. This topic teaches you to translate a business goal into measurable success criteria and the concrete signals β task completion, tool-call accuracy, groundedness, safety β that prove the agent met them.
Define Success Criteria and Signals
You cannot evaluate what you have not defined. The first move in any agent evaluation programme is to translate the business goal ("help customers resolve billing issues") into success criteria that name a dimension, a threshold, and a dataset, then bind each criterion to one or more signals the system can actually observe.
From goal to criterion to signal
Microsoft Foundry's evaluator taxonomy is a useful checklist because every criterion you write should land in one of these buckets:
| Bucket | Example criterion | Signal | | --- | --- | --- | | Quality | Coherent, fluent responses | Coherence / fluency evaluators | | RAG-specific | Answer is supported by retrieved context | Groundedness, retrieval relevance | | Agent-specific | Uses the right tool with the right args | Tool-call accuracy, task completion | | Safety | No harmful or unsafe content | Hate/unfairness, violence, protected-material evaluators | | Operational | Response within SLA | Latency, token cost, error rate |
If a criterion doesn't fit one of these buckets, you usually haven't defined it precisely enough yet.
Exam framing: a "good" criterion is measurable, attributable to a fix, and tied to a dataset. If a question gives you a fluffy goal like "users are happier", the right answer almost always restates it as a measurable criterion + signal.
Why per-dimension signals matter
When the agent regresses, the on-call engineer needs to know which dimension fell. A single aggregate score collapses the diagnosis into "something broke". Per-dimension evaluators let you read the regression as "tool-call accuracy dropped 8 points after the new MCP server shipped" β which directly points at the fix.
Quick check
Quick check
Your team ships a customer-support agent. Which set of success criteria is well-formed for evaluation?
Where this shows up on the exam
GH-600 will hand you a scenario with a vague goal and ask which evaluation plan is correct. The wrong answers will be operational-only (latency, cost) or single-aggregate ("overall quality score"). The right answer always names a dimension, a threshold, and a dataset that the on-call engineer could act on tomorrow.
Key terms
- Success criterion
- A measurable, agent-specific statement of what 'done well' means for one task (for example, 'PR opens, CI passes, reviewer leaves no blocking comment').
- Signal
- An observable value β metric, log line, evaluator score, human label β that maps back to a success criterion.
- Evaluator
- A built-in or custom function in Microsoft Foundry that scores an agent response on a specific dimension such as groundedness, task adherence, or tool-call accuracy.
- Quality gate
- An automated check in CI/CD that blocks a release when one or more evaluator scores drop below a defined threshold.
Common pitfalls
- Defining success only as 'the agent responded' β that measures uptime, not quality, and lets silent regressions ship.
- Picking metrics first and goals second: you end up optimising token cost while task completion silently drops.
- Using a single overall score with no per-dimension breakdown, so when quality drops you can't tell whether it was reasoning, tool use, grounding, or safety.