Automated Scanning and Regression Detection
Automated scanning is how you catch the bad change before users do. This topic covers running evaluators in CI/CD, gating releases on thresholds, scheduling production evaluation, and using continuous evaluation plus red teaming to detect regressions and drift.
Automated Scanning and Regression Detection
A regression in an agent is rarely a stack trace β it is a silent drop in groundedness, task completion, or safety after a prompt change, a model swap, or a quietly updated upstream dataset. Automated scanning is the only way to see it before users do.
The Foundry detection stack
| Layer | When it runs | What it catches | | --- | --- | --- | | Pre-production evaluation | Every PR / release candidate | Regressions introduced by code, prompt, or tool changes | | Continuous evaluation | Sampled live traffic, ongoing | Real-world regressions a labelled dataset missed | | Scheduled evaluation | Recurring against a fixed test set | Drift over time (model updates, upstream data changes) | | Scheduled red teaming | Recurring adversarial probes | Safety / security regressions, prompt-injection paths |
A regression-detection plan is missing a layer if any of those rows is empty.
Wiring it into CI/CD
Pre-production evaluation belongs in the pipeline that builds your agent. The Foundry SDKs let you run an evaluator suite against a labelled dataset and emit per-dimension scores; the CI step that consumes them is the quality gate. A gate is well-formed when:
- Every dimension you care about (groundedness, tool-call accuracy, task completion, safety) has its own threshold.
- The gate fails when any dimension breaches, not only when the average does.
- The gate output links to the evaluation run so the author can see which prompts regressed.
Exam framing: when you see "composite score is fine but X dropped", the right answer is always to gate on X, not to comfort yourself with the composite.
Don't confuse operational telemetry with quality
Application Insights dashboards (latency, token usage, error rate) are necessary but they do not measure quality. An agent can return a 200 OK with a perfectly hallucinated answer; Azure Monitor will tell you everything is healthy. Quality regressions live in evaluator scores, not in HTTP status codes.
Quick check
Quick check
Which Foundry capability is designed to catch quality regressions in live traffic on an ongoing basis?
Where this shows up on the exam
Expect a scenario that lists existing monitoring (Azure Monitor, alerts on 5xx, latency dashboards) and asks why regressions still slip through. The answer is that quality lives in evaluator scores, and the missing piece is a layer from the Foundry detection stack β usually continuous evaluation or scheduled red teaming.
Key terms
- Pre-production evaluation
- A run of evaluators against a labelled dataset before deployment to catch regressions in quality, safety, or task completion.
- Continuous evaluation
- Foundry pattern of running quality and safety evaluators on a sampled fraction of production traffic so regressions in live conditions are detected.
- Scheduled evaluation
- Recurring evaluation against a fixed test dataset that detects drift over time as models, prompts, or upstream data change.
- Quality gate
- A CI/CD check that fails the build when one or more evaluator scores fall below a threshold for the changed artifact.
- AI red teaming
- Automated adversarial testing (Foundry uses the PyRIT framework) that probes safety and security vulnerabilities before and after deployment.
Common pitfalls
- Setting a single global threshold and ignoring per-dimension drops β overall score holds while groundedness silently collapses.
- Running evaluations only at release time. By then the regression is in main and the diff that caused it is buried.
- Treating production telemetry (latency, 5xx) as the regression signal β quality regressions are invisible in operational metrics.
- Never scheduling re-evaluation, so model or upstream-data drift accumulates without anyone noticing.