Persist State and Detect Drift
Persisting agent state is half the job โ the other half is noticing when that persisted state has drifted from reality. This topic covers durable state stores, version stamps, and the failure signals that tell you the agent is acting on a stale snapshot.
Persist State and Detect Drift
Durable state lets an agent survive restarts, scaling events, and human handoffs. Drift detection is what keeps that durable state honest. Both are required โ persistence without drift detection just means the agent confidently acts on yesterday's world.
The persistence stack
| Layer | What you persist | Where |
| --- | --- | --- |
| Conversation thread | Messages, tool calls, plan | Managed thread store (e.g. Foundry threads, Cosmos DB) |
| Task state | Hypotheses, intermediate artifacts | Task-scoped store keyed by taskId |
| Cached external data | File contents, ticket bodies | Cache with version stamp (SHA, ETag, revision) |
| Long-term memory | Preferences, conventions | Memory store keyed by scope |
The thing that turns a cache into a drift detector is the version stamp stored next to the value. Without it, you cannot tell stale from fresh.
The drift signals you must recognise
A well-instrumented agent listens for drift instead of pretending it can't happen.
| Signal | What it means | Right reaction |
| --- | --- | --- |
| 412 Precondition Failed / ETag mismatch | Source moved since you cached | Re-fetch, re-plan, retry |
| 409 Conflict on write | Concurrent writer beat you to it | Merge or re-plan against new state |
| Re-fetched value differs from cache | Cache invalidation was missed | Invalidate cache, log a drift event |
| Tool returns unexpected schema | Source schema migrated | Stop, surface to a human |
Match each drift signal to its root cause
Match signals to failure modes
Drag each observed signal onto the root cause that best explains it.
Don't swallow drift signals
The tempting fix is to wrap every drift error in a retry. Resist. Each drift signal is operational information: it tells you the agent's state model is wrong. Log it, surface it, and only then re-plan. Silent retries hide the problem and let it compound.
Where this shows up on the exam
Expect a scenario where the agent "did the right thing but the world had moved". The correct answer always involves a version stamp on persisted state plus a named drift signal (ETag, 409, 412) wired into the agent's error path.
Key terms
- Persisted state
- Agent state written to a durable store (database, blob, conversation thread) so it survives process restarts, scaling events, and handoffs.
- State drift
- The gap between the agent's persisted view of the world and the actual current state of the system it acts on.
- Version stamp / ETag
- A monotonically changing identifier (commit SHA, file ETag, ticket revision) stored alongside cached state so you can detect when the source has moved.
- Drift signal
- An observable cue โ a 412/409 from a tool, an unexpected diff, a re-fetched value that disagrees with the cache โ that the agent's state is stale.
Common pitfalls
- Persisting state without a version stamp: you can't tell whether the cached file is current, and the agent applies edits to a stale base.
- Treating a tool-call success as proof of freshness: a 200 from one tool doesn't mean a different system's cached state is still valid.
- Catching drift signals and silently retrying: you lose the operational signal that the state model is wrong and just paper over the inconsistency.