8 min readmedium+40 XP

Define Success Criteria and Signals

Before you can evaluate an agent you have to decide what 'good' looks like. This topic teaches you to translate a business goal into measurable success criteria and the concrete signals — task completion, tool-call accuracy, groundedness, safety — that prove the agent met them.

After this topic, you'll be confident about Success criterion, Signal, Evaluator and 1 more concept.

Define Success Criteria and Signals

You cannot evaluate what you have not defined. The first move in any agent evaluation programme is to translate the business goal ("help customers resolve billing issues") into success criteria that name a dimension, a threshold, and a dataset, then bind each criterion to one or more signals the system can actually observe.

From goal to criterion to signal

Microsoft Foundry's evaluator taxonomy is a useful checklist because every criterion you write should land in one of these buckets:

Bucket	Example criterion	Signal
Quality	Coherent, fluent responses	Coherence / fluency evaluators
RAG-specific	Answer is supported by retrieved context	Groundedness, retrieval relevance
Agent-specific	Uses the right tool with the right args	Tool-call accuracy, task completion
Safety	No harmful or unsafe content	Hate/unfairness, violence, protected-material evaluators
Operational	Response within SLA	Latency, token cost, error rate

If a criterion doesn't fit one of these buckets, you usually haven't defined it precisely enough yet.

Exam framing: a "good" criterion is measurable, attributable to a fix, and tied to a dataset. If a question gives you a fluffy goal like "users are happier", the right answer almost always restates it as a measurable criterion + signal.

Why per-dimension signals matter

When the agent regresses, the on-call engineer needs to know which dimension fell. A single aggregate score collapses the diagnosis into "something broke". Per-dimension evaluators let you read the regression as "tool-call accuracy dropped 8 points after the new MCP server shipped" — which directly points at the fix.

Quick check

1 of 3

+40 XP

Your team ships a customer-support agent. Which set of success criteria is well-formed for evaluation?

Pick your answer.

Where this shows up on the exam

GH-600 will hand you a scenario with a vague goal and ask which evaluation plan is correct. The wrong answers will be operational-only (latency, cost) or single-aggregate ("overall quality score"). The right answer always names a dimension, a threshold, and a dataset that the on-call engineer could act on tomorrow.

Anchor concepts

Key terms

Success criterion: A measurable, agent-specific statement of what 'done well' means for one task (for example, 'PR opens, CI passes, reviewer leaves no blocking comment').
Signal: An observable value — metric, log line, evaluator score, human label — that maps back to a success criterion.
Evaluator: A built-in or custom function in Microsoft Foundry that scores an agent response on a specific dimension such as groundedness, task adherence, or tool-call accuracy.
Quality gate: An automated check in CI/CD that blocks a release when one or more evaluator scores drop below a defined threshold.

Watch out

Common pitfalls

Defining success only as 'the agent responded' — that measures uptime, not quality, and lets silent regressions ship.
Picking metrics first and goals second: you end up optimising token cost while task completion silently drops.
Using a single overall score with no per-dimension breakdown, so when quality drops you can't tell whether it was reasoning, tool use, grounding, or safety.