Refine Memory and Tool Usage
Once instructions and workflow are tight, the next levers are the agent's memory layer and its tool catalog. This topic shows how to refine what the agent remembers and which tools it can pick from β using evaluator signals to decide what to add, prune, or scope.
Refine Memory and Tool Usage
After instructions and workflow, the next two levers worth pulling are the memory layer and the tool catalog. Both are easy to expand thoughtlessly and both quietly degrade agent quality when they are. The refinement discipline is to add little, scope tight, and let evaluator signals tell you when to prune.
Refining the tool catalog
Microsoft Foundry's tool catalog includes built-ins (web search, file search, memory, code interpreter), custom functions, and MCP servers; a Toolbox lets you curate a single MCP-compatible endpoint of approved tools per agent. The signals that drive refinement:
| Signal | Refinement | | --- | --- | | Tool-call accuracy drops after adding tools | Scope: hide tools the agent doesn't need, or split into a smaller per-agent Toolbox | | Agent often picks the wrong tool from two similar ones | Improve tool descriptions and parameter names; consider merging | | Tool returns errors the agent ignores | Document expected errors in the tool description; add handling in instructions | | Tool succeeds with wrong args | Tighten the schema; add input validation in the wrapper |
Rule of thumb: every tool the agent never uses on a real task is a tool that lowers selection accuracy on the ones it does use.
Refining the memory layer
Memory turns into a context-failure machine when nobody owns it. Memory hygiene rules:
- Write less: only persist facts that have to outlive the conversation.
- Update on write: if a fact has a canonical owner (account record, profile), refresh from source on read instead of caching forever.
- Expire on time: every memory entry has a TTL, even if the TTL is "until the next session".
- Never persist secrets: tokens, credentials, and PII that the agent saw in passing should not live in durable memory.
- Prefer authoritative source: when memory disagrees with a system of record, the system of record wins.
When a failure is "agent used a stale fact", it is almost always a memory hygiene problem, not a model problem.
Scope is a quality lever, not only a security lever
Granting a tool the full surface ("the whole org", "any file path") inflates the agent's choice space and the blast radius of mistakes. Scoping down to the minimum surface needed for the task improves tool-call accuracy and limits damage when something goes wrong β which is why Foundry exposes scope as a first-class configuration on connected tools and MCP servers.
Quick check
Quick check
Tool-call accuracy drops 10 points after a release. The change log shows the team added 5 new MCP tools to the catalog. What is the most likely cause?
Where this shows up on the exam
Two recurring shapes: (1) tool-call accuracy drops after a catalog expansion β the answer is to scope down or curate a Toolbox; and (2) the agent answers using a stale value β the answer is memory hygiene (expiry, write-through, source-of-truth precedence), not a model swap.
Key terms
- Memory layer
- Storage the agent uses across turns or sessions β short-term scratchpad, conversation history, durable user/profile memory β separate from the prompt.
- Tool catalog
- The set of tools available to an agent. In Foundry, tools include built-ins, custom functions, and MCP servers; a Toolbox curates a single MCP-compatible endpoint of approved tools.
- Tool-call accuracy
- Foundry agent-specific evaluator that measures whether the agent selected the right tool with the right arguments for a given task.
- Scope
- The subset of a tool's surface (specific actions, repos, paths) that an agent is permitted to invoke. Reducing scope reduces the agent's search space and the blast radius of a mistake.
- Memory hygiene
- Discipline of writing only what is needed, expiring stale entries, and never persisting secrets β so memory does not become a source of subtle context failures.
Common pitfalls
- Adding more tools to 'help' the agent. A larger catalog raises selection ambiguity and lowers tool-call accuracy.
- Writing free-form chat history to durable memory. Old chatter becomes the wrong context tomorrow and triggers context failures.
- Never expiring memory. The agent answers today using a fact that changed last week.
- Granting a tool full scope (e.g., a whole org's repos) when the task only ever touches one repo. Mistakes get bigger.