Quantitative vs Qualitative Signals
Healthy agent programmes combine cheap quantitative signals that run on every build with slower qualitative signals that catch what numbers can't see. This topic shows how to pair them so neither side ships blind.
Quantitative vs Qualitative Signals
Healthy agent evaluation runs on two legs: cheap quantitative signals that fire on every build, and slower qualitative signals that catch what numbers miss. Neither leg is sufficient alone, and the exam will punish you for picking just one.
The two legs side by side
| Dimension | Quantitative | Qualitative | | --- | --- | --- | | Speed | Seconds | Minutes to days | | Scale | Every request, every build | Sampled subset | | Examples | Latency, token cost, groundedness score, tool-call accuracy, error rate | Helpfulness rubric, tone, partial credit, edge-case behaviour | | Producer | Automated evaluator, runtime telemetry | Human reviewer, LLM-as-judge against a rubric | | Gates CI? | Yes | Usually not directly β feeds back into thresholds | | Failure mode | Misses tone, sycophancy, subtle hallucinations | Doesn't scale, can't gate every PR |
How they cooperate
The Microsoft Foundry pattern is explicit: run automated evaluators continuously, then sample-and-review a slice of traffic with humans (or an LLM-as-judge calibrated against humans) for the things numbers cannot see. The qualitative pass does two jobs:
- Catch the failure modes that automated metrics under-weight (tone, sincerity, partial helpfulness).
- Recalibrate the quantitative evaluators so they keep tracking ground truth as models and data drift.
Exam framing: when a scenario gives you only one type of signal, the right answer is almost always to add the other type β and to specify how they will be kept aligned.
Quick check
Quick check
Which split between quantitative and qualitative signals fits the Foundry evaluation lifecycle best?
Where this shows up on the exam
Look for questions that ask you to design an evaluation plan or pick the failure most likely to slip past a given setup. If the setup is "100% automated metrics", the slip is tone or subtle hallucination. If the setup is "weekly human review", the slip is a regression that shipped on Tuesday and went unseen until Friday.
Key terms
- Quantitative signal
- A numeric measurement an automated evaluator or pipeline can produce at scale β latency, token cost, groundedness score, tool-call accuracy, error rate.
- Qualitative signal
- A judgement-based observation produced by a human reviewer, a labelled dataset, or an LLM-as-judge rubric β tone, helpfulness, edge-case behaviour.
- LLM-as-judge
- An evaluator pattern where one model scores another model's output against a rubric; sits between fully automated metrics and human review.
- Sample-and-review
- Production technique that runs automated evaluators on a sampled fraction of traffic and routes a subset to humans for qualitative review.
Common pitfalls
- Treating an LLM-as-judge score as ground truth. It is a calibrated proxy β recalibrate against human labels regularly or it drifts.
- Relying only on qualitative review: it doesn't scale, can't gate a release in CI, and slows the iteration loop to a crawl.
- Mixing the two but never aligning them β your dashboards show one truth and your weekly review meeting argues a different one.