9 min readmedium+45 XP

Automated Scanning and Regression Detection

Automated scanning is how you catch the bad change before users do. This topic covers running evaluators in CI/CD, gating releases on thresholds, scheduling production evaluation, and using continuous evaluation plus red teaming to detect regressions and drift.

After this topic, you'll be confident about Pre-production evaluation, Continuous evaluation, Scheduled evaluation and 2 more concepts.

Automated Scanning and Regression Detection

A regression in an agent is rarely a stack trace — it is a silent drop in groundedness, task completion, or safety after a prompt change, a model swap, or a quietly updated upstream dataset. Automated scanning is the only way to see it before users do.

The Foundry detection stack

Layer	When it runs	What it catches
Pre-production evaluation	Every PR / release candidate	Regressions introduced by code, prompt, or tool changes
Continuous evaluation	Sampled live traffic, ongoing	Real-world regressions a labelled dataset missed
Scheduled evaluation	Recurring against a fixed test set	Drift over time (model updates, upstream data changes)
Scheduled red teaming	Recurring adversarial probes	Safety / security regressions, prompt-injection paths

A regression-detection plan is missing a layer if any of those rows is empty.

Wiring it into CI/CD

Pre-production evaluation belongs in the pipeline that builds your agent. The Foundry SDKs let you run an evaluator suite against a labelled dataset and emit per-dimension scores; the CI step that consumes them is the quality gate. A gate is well-formed when:

Every dimension you care about (groundedness, tool-call accuracy, task completion, safety) has its own threshold.
The gate fails when any dimension breaches, not only when the average does.
The gate output links to the evaluation run so the author can see which prompts regressed.

Exam framing: when you see "composite score is fine but X dropped", the right answer is always to gate on X, not to comfort yourself with the composite.

Don't confuse operational telemetry with quality

Application Insights dashboards (latency, token usage, error rate) are necessary but they do not measure quality. An agent can return a 200 OK with a perfectly hallucinated answer; Azure Monitor will tell you everything is healthy. Quality regressions live in evaluator scores, not in HTTP status codes.

Quick check

1 of 3

+45 XP

Which Foundry capability is designed to catch quality regressions in live traffic on an ongoing basis?

Pick your answer.

Where this shows up on the exam

Expect a scenario that lists existing monitoring (Azure Monitor, alerts on 5xx, latency dashboards) and asks why regressions still slip through. The answer is that quality lives in evaluator scores, and the missing piece is a layer from the Foundry detection stack — usually continuous evaluation or scheduled red teaming.

Anchor concepts

Key terms

Pre-production evaluation: A run of evaluators against a labelled dataset before deployment to catch regressions in quality, safety, or task completion.
Continuous evaluation: Foundry pattern of running quality and safety evaluators on a sampled fraction of production traffic so regressions in live conditions are detected.
Scheduled evaluation: Recurring evaluation against a fixed test dataset that detects drift over time as models, prompts, or upstream data change.
Quality gate: A CI/CD check that fails the build when one or more evaluator scores fall below a threshold for the changed artifact.
AI red teaming: Automated adversarial testing (Foundry uses the PyRIT framework) that probes safety and security vulnerabilities before and after deployment.

Watch out

Common pitfalls

Setting a single global threshold and ignoring per-dimension drops — overall score holds while groundedness silently collapses.
Running evaluations only at release time. By then the regression is in main and the diff that caused it is buried.
Treating production telemetry (latency, 5xx) as the regression signal — quality regressions are invisible in operational metrics.
Never scheduling re-evaluation, so model or upstream-data drift accumulates without anyone noticing.