5 Ways to Reduce Your AI Agent API Costs
I ran the same query through 6 different configurations to figure out which cost optimizations actually work. Here's the full data:
| Configuration | Cost | Calls | Answer Quality | Savings |
|---|---|---|---|---|
| Naive (no guards) | $0.0171 | 6 | Real answer | — |
| Budget cap | $0.0099 | 4 | Garbage answer | 42% |
| Smart truncation | $0.0105 | 4 | Real answer | 39% |
| Prompt caching | $0.0107 | 4 | Real answer | 37% |
| All combined | $0.0099 | 4 | Partial answer | 42% |
| Fuzzy loop detect | $0.0086 | 3 | Real answer | 49% |
The budget cap saved the most on paper but returned a useless answer. The agent hit its limit during a string of failed tool calls and gave up. Cost savings don't mean anything if the agent stops working correctly.
Here's what actually works.
1. Fuzzy loop detection
This was the clearest winner at 49% savings.
When a tool fails, the agent retries. That's reasonable. But when the same tool keeps failing the same way, same query, same error, the agent is stuck. Without detection it keeps retrying until it hits a token limit or your budget runs out.
Fuzzy loop detection tracks what tools the agent has called and what came back. After a set number of consecutive failures on the same tool, it blocks that tool for the rest of the session and forces the agent to answer from what it knows.
In my test, web search was returning nothing useful. The naive agent made 6 API calls, 4 of them failed searches, each one adding 15-20% to the context window. The loop detection version made 3 calls, caught the pattern after the second failed search, and got a real answer on the third call.
Every other optimization I tested tried to make wasted iterations cheaper. Loop detection just eliminated them. That's why it won.
2. Don't send tool definitions to agents that don't need them
Each tool definition costs roughly 500 tokens, and those tokens are billed on every single request to that agent.
In a multi-agent system, most agents only need 1 or 2 tools. A writer agent doesn't need web search. A critic agent doesn't need a calculator. But if you send all tool definitions to all agents, you pay 500 tokens per unused tool per call.
On a 4-agent system with 3 tools, the difference looks like this:
- Selective routing: 4 agents x 500 tokens = 2,000 tokens overhead per round
- All tools to all agents: 4 agents x (3 x 500) tokens = 6,000 tokens overhead per round
3x more tokens for the same result. In my multi-agent experiment, switching non-tool-using agents from the full tool call to a simple call function saved 2,500 input tokens across 5 calls.
3. Choose a memory strategy deliberately
Most frameworks default to buffer memory, which means the full conversation history is resent every turn. This works fine for short conversations. For anything longer it becomes the biggest cost driver.
From my Day 5 experiments across 10-turn conversations on the same task:
| Memory Strategy | Token Usage | Extra API Calls Per Turn |
|---|---|---|
| Buffer (full history) | 1.0x baseline | 0 |
| Summary memory | 0.71x | ~0.3 |
| Entity extraction | 0.45x | 1 |
| No memory (stateless) | ~0x | 0 |
Entity extraction saved 55% of tokens vs buffer, but it adds an extra API call every turn for the extraction step. Whether that tradeoff makes sense depends on your model cost. On Opus at $5/MTok input, the token savings easily justify the extra calls. On Haiku at $1/MTok, it's a closer call.
The point is to pick intentionally. The default is rarely optimal.
4. Cap revision loops and give the critic specific criteria
For evaluator-optimizer patterns (one agent generates, another evaluates), an uncapped revision loop is expensive. If the critic keeps rejecting and the writer keeps revising, you can end up with 5-6 revision cycles. At 2 API calls per round, that's 10-12 calls for one output.
Two things fix this:
Set a hard revision cap. I use 1 in production setups. After one revision, accept whatever came out.
Give the critic specific acceptance criteria, not vague ones. "Review for quality" produces inconsistent verdicts. "Verify: are all claims supported by the research? Is the word count between 300-500? Does it directly answer the original question?" produces consistent PASS/FAIL decisions.
In my experiment, the critic flagged the draft as incomplete and triggered a revision. After the revision, the supervisor still marked the output incomplete. Revision loops that don't converge are expensive and produce bad output anyway.
5. Prompt caching on large static content
Cache reads cost about 10% of the base price, so savings are real when caching activates. The catch is Anthropic's minimum threshold of 1,024 tokens. Smaller content won't create a cache at all.
Caching makes a real difference when:
- System prompts are large (5,000+ tokens of instructions, examples, reference content)
- There are many tool definitions (10+ tools = 5,000+ tokens just in definitions)
- You're on an expensive model (Sonnet/Opus where input is $3-5/MTok)
- There are many calls per session (10+)
Caching doesn't help much when:
- System prompts are short
- Queries are one-off with no repeated context
- You're on a cheap model like Haiku
On a well-cached Opus agent with a 5,000-token system prompt making 20 calls per session, caching saves a meaningful amount. On a Haiku chatbot with a 200-token system prompt, it saves close to nothing.
What not to do: hard budget caps without fallback logic
Budget caps feel safe. Set a limit, stay under it. The problem is what happens when the agent hits the cap mid-task with nothing useful to show for it.
In my test, the $0.008 budget cap returned garbage because the agent burned its budget on failed searches and got cut off before it could fall back to its training knowledge. The naive version, with no cap, eventually gave up on search and wrote a real answer from training data.
If you use budget caps, pair them with explicit fallback instructions in the system prompt: something like "if you cannot find information via tools, answer from your training knowledge and note that the answer is unverified." Without that, a hard cap just means your agent stops working at the worst possible moment.
AgentQuote estimates AI agent running costs and surfaces which optimizations will actually move the needle for your architecture. Try it here.
Ready to estimate your agent costs?
Describe your system, get a cost breakdown in 60 seconds. Free, no signup required.
Estimate Your System →