10 min readhard+50 XP

Failure Analysis from Traces and Artifacts

When an agent run goes wrong, the trace is your crime scene. This topic shows how to read a Foundry / OpenTelemetry-style trace, identify the failing span, and use the surrounding artifacts (instructions, tool args, retrieved context, model output) to assign root cause.

After this topic, you'll be confident about Trace, Span, Artifact and 2 more concepts.

Failure Analysis from Traces and Artifacts

When an agent run fails, the trace is the crime scene. Microsoft Foundry builds tracing on OpenTelemetry so every LLM call, tool invocation and agent decision becomes a span you can inspect — with the instructions, retrieved context, tool args and model outputs available as artifacts.

The discipline is to read top-down: what was the goal, where did the run diverge, which span carried the bad value forward.

A reading order that works

Goal: re-read the user request and the agent instructions in the run.
Plan span: what did the agent decide to do, and is the plan reasonable?
Tool spans: did each call succeed, and were the arguments sensible?
Retrieval / context spans: did the right material arrive in the prompt?
Model span: given the inputs it received, did the model produce a defensible output?
Final answer: only judge the answer in light of 2-5.

Most "the model hallucinated" tickets are actually retrieval bugs or tool-arg bugs. The trace tells you which.

Hunt the anti-patterns in this trace

Spot the anti-patterns

+50 XP

Click the lines in this trace summary that point at the actual failure or anti-pattern.

1[span 0] user.request goal='Refund order #1234, send confirmation email'
2[span 1] agent.plan steps=['lookup_order','issue_refund','send_email']
3[span 2] tool.lookup_order(args={ order_id: '1243' }) status=200
4[span 3] tool.issue_refund(args={ order_id: '1243', amount: 89.00 }) status=200
5[span 4] tool.send_email(args={ to: 'customer@example.com' }) status=200
6[span 5] model.summarise response='Refund processed successfully.'

0 lines flagged

Then go from one failure to a pattern

Single-trace analysis tells you what happened in one run. Cluster analysis in Foundry groups failing evaluation runs by similarity so you can see that twelve of last week's failures share the same retrieval bug — and that the right fix is upstream, not per ticket.

Exam framing: when the question asks "what is the root cause", the correct answer almost always points to a layer upstream of where the bad output appeared. The trace is what lets you find it.

Where this shows up on the exam

Expect a short trace excerpt and a question about which span is the root cause. The wrong answers will point at the model or the final answer. The right answer points at the span where a bad value first entered the chain — tool args, retrieved context, or an instruction that omitted a verification step.

Anchor concepts

Key terms

Trace: Distributed-tracing record of a single agent run capturing LLM calls, tool invocations, and decisions; built on OpenTelemetry in Microsoft Foundry.
Span: One unit of work inside a trace — for example, a single tool call or model invocation — with its inputs, outputs, duration, and status.
Artifact: Durable evidence attached to or referenced by a trace: the instructions used, the retrieved context, the tool arguments and responses, the final answer.
Root-cause attribution: The act of assigning a failure to a specific layer — model, retrieval, tool, instruction, or upstream data — using the trace and its artifacts.
Cluster analysis: Foundry capability that groups failing evaluation runs by similarity so you can find systemic failure modes instead of triaging single cases.

Watch out

Common pitfalls

Reading only the final model output. Without spans you cannot tell whether the bug was the model, the retrieval, or the tool.
Triaging failures one at a time when cluster analysis would surface the common root cause across dozens of runs.
Treating a tool error as the root cause without checking the args the agent passed — the tool may have failed because the agent called it wrong.
Throwing away traces after 24h; you lose the ability to correlate a regression to the run that caused it.