7 min readmedium+40 XP

Quantitative vs Qualitative Signals

Healthy agent programmes combine cheap quantitative signals that run on every build with slower qualitative signals that catch what numbers can't see. This topic shows how to pair them so neither side ships blind.

After this topic, you'll be confident about Quantitative signal, Qualitative signal, LLM-as-judge and 1 more concept.

Quantitative vs Qualitative Signals

Healthy agent evaluation runs on two legs: cheap quantitative signals that fire on every build, and slower qualitative signals that catch what numbers miss. Neither leg is sufficient alone, and the exam will punish you for picking just one.

The two legs side by side

Dimension	Quantitative	Qualitative
Speed	Seconds	Minutes to days
Scale	Every request, every build	Sampled subset
Examples	Latency, token cost, groundedness score, tool-call accuracy, error rate	Helpfulness rubric, tone, partial credit, edge-case behaviour
Producer	Automated evaluator, runtime telemetry	Human reviewer, LLM-as-judge against a rubric
Gates CI?	Yes	Usually not directly — feeds back into thresholds
Failure mode	Misses tone, sycophancy, subtle hallucinations	Doesn't scale, can't gate every PR

How they cooperate

The Microsoft Foundry pattern is explicit: run automated evaluators continuously, then sample-and-review a slice of traffic with humans (or an LLM-as-judge calibrated against humans) for the things numbers cannot see. The qualitative pass does two jobs:

Catch the failure modes that automated metrics under-weight (tone, sincerity, partial helpfulness).
Recalibrate the quantitative evaluators so they keep tracking ground truth as models and data drift.

Exam framing: when a scenario gives you only one type of signal, the right answer is almost always to add the other type — and to specify how they will be kept aligned.

Quick check

1 of 3

+40 XP

Which split between quantitative and qualitative signals fits the Foundry evaluation lifecycle best?

Pick your answer.

Where this shows up on the exam

Look for questions that ask you to design an evaluation plan or pick the failure most likely to slip past a given setup. If the setup is "100% automated metrics", the slip is tone or subtle hallucination. If the setup is "weekly human review", the slip is a regression that shipped on Tuesday and went unseen until Friday.

Anchor concepts

Key terms

Quantitative signal: A numeric measurement an automated evaluator or pipeline can produce at scale — latency, token cost, groundedness score, tool-call accuracy, error rate.
Qualitative signal: A judgement-based observation produced by a human reviewer, a labelled dataset, or an LLM-as-judge rubric — tone, helpfulness, edge-case behaviour.
LLM-as-judge: An evaluator pattern where one model scores another model's output against a rubric; sits between fully automated metrics and human review.
Sample-and-review: Production technique that runs automated evaluators on a sampled fraction of traffic and routes a subset to humans for qualitative review.

Watch out

Common pitfalls

Treating an LLM-as-judge score as ground truth. It is a calibrated proxy — recalibrate against human labels regularly or it drifts.
Relying only on qualitative review: it doesn't scale, can't gate a release in CI, and slows the iteration loop to a crawl.
Mixing the two but never aligning them — your dashboards show one truth and your weekly review meeting argues a different one.