
How to Reduce LLM Costs with End-to-End Observability¶
A team ships an agentic AI travel assistant. Users love it — they ask it to find flights, compare hotels, suggest itineraries. The demo worked great on 50 test queries. In production, the agent serves 10,000 conversations a day.
The first monthly bill arrives: $47,000 in LLM API costs.
Nobody expected it. Nobody can explain it. The agent works, the users are happy, but the unit economics don't close. The team starts guessing: Is it the system prompt? Too many tool calls? Context windows getting too long? Users asking weird follow-ups?
They don't know, because they can't see.
This is the LLM cost visibility gap — and it's burning through AI budgets at companies of every size. The fix isn't cheaper models or shorter prompts. It's observability: knowing exactly where every token goes, why, and what it costs — in development and in production.
Why LLM Costs Surprise Everyone¶
LLM pricing is deceptively simple: you pay per token. But the total cost of an AI feature is a function of dozens of variables that interact in unpredictable ways:
| Cost Factor | Why It's Hard to Predict |
|---|---|
| System prompt length | A 2,000-token system prompt costs nothing in testing — but at 10,000 conversations/day, it's $600/month just in prompt overhead |
| Conversation history | Each follow-up message resends the full context. A 10-turn conversation costs ~10x what a single-turn one does |
| Tool call chains | Agents that use tools (web search, APIs, databases) generate intermediate reasoning tokens that never reach the user but still cost money |
| Retries and loops | A bug that causes the agent to retry a failed tool call 5 times multiplies cost 5x — silently |
| User behavior patterns | Some user segments ask short questions; others paste entire documents and ask for analysis. The cost per conversation can vary 100x |
| Model selection | Routing every query to GPT-4o when 70% could be handled by GPT-4o-mini wastes 10–30x per routed query |
The core problem: you don't know what you don't know. Without granular visibility into token usage — broken down by conversation, agent step, tool call, and user segment — cost optimization is guesswork.
The Two Phases Where Costs Go Wrong¶
LLM cost problems emerge at two distinct stages, and each requires different observability capabilities.
%%{init: {"flowchart": {"curve": "linear", "rankSpacing": 60}}}%%
graph LR
subgraph "Development & Iteration"
D1["Prompt Engineering"]
D2["Agent Architecture"]
D3["Tool Integration"]
D4["Model Selection"]
end
subgraph "Production"
P1["User Traffic Patterns"]
P2["Conversation Dynamics"]
P3["Agent Behavior Drift"]
P4["Scaling Effects"]
end
D1 --> P1
D2 --> P2
D3 --> P3
D4 --> P4
style D1 stroke:#7c4dff,color:#fff
style D2 stroke:#7c4dff,color:#fff
style D3 stroke:#7c4dff,color:#fff
style D4 stroke:#7c4dff,color:#fff
style P1 stroke:#e53935,color:#fff
style P2 stroke:#e53935,color:#fff
style P3 stroke:#e53935,color:#fff
style P4 stroke:#e53935,color:#fff
Phase 1: Development — Building Features You Don't Know Will Be Expensive¶
During development, teams iterate on prompts, agent architectures, and tool integrations. They test with synthetic queries or small user groups. Everything seems efficient. Then production reality hits:
The hidden cost multipliers that development doesn't reveal:
-
Prompt bloat — You add instructions to handle edge cases. Each instruction adds tokens. By the time the prompt covers all the edge cases QA found, it's 4x longer than the original — and 4x more expensive on every single call.
-
Unnecessary context — The agent passes the full conversation history to every tool call, even when the tool only needs the current query. That's thousands of wasted tokens per interaction.
-
Over-powered models — During development, you use the most capable model because you want the best outputs. Nobody goes back to test which steps could use a smaller, cheaper model.
-
Redundant reasoning — The agent's chain-of-thought produces 500 tokens of reasoning to answer a question that could be handled with a simple lookup. Those reasoning tokens are invisible to the user but fully visible on the bill.
Without observability during development, you ship cost problems to production. By the time you discover them, they've already burned through your budget at scale.
Phase 2: Production — Users and Agents Consuming More Than Expected¶
Even if you optimize during development, production introduces cost dynamics that no test suite can predict:
Real-world patterns that inflate costs:
-
Long-tail conversations — 5% of users generate 40% of your token spend because they treat the agent as a conversational partner, sending 20+ messages per session. Your average-based cost projections never accounted for them.
-
Copy-paste power users — Users paste entire emails, documents, or logs into the chat. A single message can contain 10,000 tokens. Your agent dutifully processes all of it.
-
Agent retry storms — A third-party API goes intermittently slow. The agent retries, each retry adding to the context window. A single conversation burns through $5 in tokens before timing out.
-
Seasonal spikes — Your ticket-booking agent handles 3x normal volume during holiday season. Costs scale linearly with volume, but your budget doesn't.
-
Feature interaction effects — A new feature that adds "memory" to the agent (recalling past conversations) also adds 2,000 tokens of context to every request. Nobody calculated the cost impact before shipping.
Scenario: An Agentic Ticket-Booking Service¶
Let's make this concrete. Consider an AI-powered ticket-booking service where users interact with an agent to search, compare, and book flights and events.
What the Agent Does¶
%%{init: {"flowchart": {"curve": "linear"}}}%%
graph LR
U["User: 'Find me a round-trip to Tokyo in April'"] --> A["Agent Reasoning"]
A --> T1["Tool: Flight Search API"]
A --> T2["Tool: Price Comparison"]
A --> T3["Tool: Seat Availability"]
T1 --> R["Response Generation"]
T2 --> R
T3 --> R
R --> F["'Here are 5 options for Tokyo in April...'"]
style A stroke:#7c4dff,color:#fff
style R stroke:#4caf50,color:#fff
Each conversation involves:
- User message — the request (50–500 tokens)
- System prompt — agent instructions, persona, tool definitions (1,500–3,000 tokens)
- Tool calls — search APIs, price lookups, availability checks (200–800 tokens each, including tool descriptions)
- Agent reasoning — intermediate chain-of-thought (300–1,000 tokens, not shown to user)
- Response — the answer shown to the user (200–600 tokens)
- Conversation history — all of the above, resent with every follow-up message
A single booking conversation that takes 6 turns can easily consume 30,000–50,000 tokens. At GPT-4o pricing ($2.50/M input, $10/M output), that's roughly $0.15–$0.30 per conversation. At 10,000 conversations per day, you're looking at $1,500–$3,000/day — or $45,000–$90,000/month.
Where the Money Actually Goes (Without Observability)¶
Without granular token tracking, the team's cost breakdown looks like this:
"We're spending $60K/month on OpenAI. The agent handles 10K conversations/day. That's roughly $0.20 per conversation."
That average hides everything. With Anosys, the actual breakdown reveals:
| Segment | % of Conversations | % of Token Spend | Cost per Conversation |
|---|---|---|---|
| Simple one-shot queries | 35% | 8% | $0.05 |
| Standard booking flows (3–5 turns) | 40% | 30% | $0.15 |
| Complex multi-destination searches | 15% | 32% | $0.43 |
| Long exploratory sessions (10+ turns) | 8% | 22% | $0.55 |
| Retry/error loops (agent failures) | 2% | 8% | $0.80 |
Now the team knows: 10% of conversations (complex searches + error loops) consume 30% of the budget. That's where optimization effort should go — not blanket prompt shortening or model downgrades.
How Anosys Gives You Cost Visibility¶
Anosys provides the end-to-end observability needed to see, understand, and reduce LLM token costs — across both development and production.
1. Token Usage Attribution per Conversation¶
Every conversation traced through Anosys captures token counts at each step: prompt tokens, completion tokens, tool call tokens, and reasoning tokens. This is available through native SDKs for OpenAI Agents, Anthropic, and any OpenTelemetry-compatible framework.
| Step | Type | Tokens | Flag |
|---|---|---|---|
| System Prompt | Input | 2,100 | 🔴 Cost hotspot — resent every turn |
| User Message 1 | Input | 85 | |
| Agent Reasoning | Output | 340 | 🟠 Hidden — not shown to user, still billed |
| Tool: Flight Search | Input | 620 | |
| Tool: Price Compare | Input | 410 | |
| Response 1 | Output | 280 | |
| User Message 2 | Input | 120 | |
| History Context Resend | Input | 3,855 | 🔴 Cost hotspot — grows every turn |
| Agent Reasoning | Output | 510 | 🟠 Hidden — not shown to user, still billed |
| Response 2 | Output | 350 |
The biggest cost targets are the system prompt (resent every turn) and conversation history (growing with every turn). Reasoning tokens are invisible to users but fully billed. Anosys makes all of this visible automatically.
2. Agent Step-Level Cost Breakdown¶
Not all agent actions cost the same. Anosys traces every step — tool call, reasoning chain, handoff — and attributes token cost to each:
| Agent Step | Avg Tokens | Avg Cost | Frequency |
|---|---|---|---|
| System prompt (per turn) | 2,100 | $0.005 | Every message |
| Flight search tool call | 620 | $0.002 | 1.8x per conversation |
| Price comparison tool call | 410 | $0.001 | 1.2x per conversation |
| Chain-of-thought reasoning | 450 | $0.004 | Every message |
| History context resend | 3,200 | $0.008 | Every follow-up |
| Response generation | 310 | $0.003 | Every message |
This immediately reveals that history resend is the most expensive per-occurrence step — and it grows with every turn. A 6-turn conversation resends the history 5 times, at escalating cost each time.
3. Cost Anomaly Detection¶
Anosys runs ML-based anomaly detection on token usage metrics just like it does on latency, error rates, and user behavior. This catches cost spikes before they become budget crises:
- Per-conversation cost anomalies — a sudden jump in average tokens per conversation (e.g., a prompt change that doubled the system prompt length)
- Per-agent anomalies — one agent workflow suddenly consuming 5x more tokens than baseline (e.g., a retry loop introduced by a code change)
- User segment anomalies — a new user cohort generating disproportionate token spend (e.g., enterprise users pasting large documents)
- Cost rate anomalies — total spend per hour deviating from learned daily/weekly patterns
4. User Behavior → Token Cost Correlation¶
This is what no other observability tool provides. Anosys tracks both user behavior (via its JavaScript tag) and agent performance (via SDK traces) in the same platform. This lets you correlate how users interact with what it costs:
| User Behavior Signal | Token Cost Insight |
|---|---|
| Session length (number of messages) | Which conversation lengths are cost-efficient vs. wasteful |
| Message length (characters pasted) | Which users are sending oversized inputs that inflate context |
| Feature usage (which agent tools triggered) | Which features are disproportionately expensive |
| User satisfaction (thumbs up/down) | Whether expensive conversations actually produce better outcomes |
| Abandonment point | Where users give up — often after expensive retry loops that produced nothing useful |
Scenario: Developers Building Features That Burn Tokens¶
The cost problem isn't limited to end-user conversations. Development teams building agentic AI features are some of the biggest silent token consumers.
The Pattern¶
A developer is building a document analysis agent. During development, they:
-
Iterate on prompts — each iteration sends the full document to the LLM. A 50-page contract is ~25,000 tokens. Testing 20 prompt variations = 500,000 tokens just in prompt engineering.
-
Test tool integrations — the agent calls a summarization tool, a classification tool, and an extraction tool. Each tool call sends context. Testing the full pipeline 100 times = 3M+ tokens.
-
Debug with verbose output — they enable chain-of-thought logging, which generates 2x more tokens. They forget to disable it before merging. Now production generates 2x more reasoning tokens on every request.
-
Add features without cost analysis — a "conversation memory" feature that retrieves the last 5 conversations and includes them in the context. Nobody calculated that this adds 15,000 tokens to every request.
What Anosys Shows in Development¶
With Anosys instrumented during development, the team sees:
%%{init: {"flowchart": {"curve": "linear"}}}%%
graph LR
subgraph "AI Agent Cost"
A["Prompt Iterations — $12.50 / 500K tokens"]
B["Tool Chain Testing — $45.00 / 3.2M tokens"]
C["Debug Logging Overhead — +$8.00 / extra 1.1M tokens"]
D["Memory Feature Impact — +15,000 tokens/request, Projected: +$4,500/month in prod"]
end
style A stroke:#4caf50,color:#fff
style B stroke:#ff9800,color:#fff
style C stroke:#e53935,color:#fff
style D stroke:#e53935,color:#fff
Green items are expected development costs. Orange and red items are the problems: debug logging that shouldn't ship to production, and a new feature whose cost impact hasn't been evaluated.
Without Anosys, these issues ship silently. With Anosys, the developer sees the projected production cost of their changes before merging — and can make informed trade-offs.
The Optimization Playbook¶
Once you have observability, cost reduction becomes systematic rather than guesswork. Here are the most impactful optimizations Anosys helps you identify and measure:
1. Prompt Compression¶
Problem: System prompts grow over time as teams add instructions for edge cases.
What Anosys shows: The exact token count of your system prompt on every call, and the percentage of total cost it represents.
Optimization: Compress the prompt using structured formatting, remove redundant instructions, move static examples to few-shot retrieval. Anosys lets you A/B test compressed vs. original prompts and measure both cost savings and quality impact.
2. Context Window Management¶
Problem: Conversation history grows linearly with each turn, and the full history is resent every time.
What Anosys shows: Token count per history resend, cost growth curve as conversations get longer, and the point at which marginal cost per message exceeds marginal value.
Optimization: Implement history summarization (condense older turns into a summary), sliding window (only keep the last N turns), or selective context (only include relevant past turns). Anosys measures the cost impact of each strategy.
3. Model Routing¶
Problem: Every request goes to the most expensive model, even when a cheaper one would produce identical results.
What Anosys shows: Per-step model usage, quality scores by model, and cost per model.
Optimization: Route simple requests (greetings, clarifications, factual lookups) to a smaller model. Use the large model only for complex reasoning, multi-step planning, or high-stakes decisions. Anosys tracks quality and cost by model so you can verify the routing rules don't degrade the experience.
4. Tool Call Optimization¶
Problem: Agents call tools redundantly or pass excessive context to tool calls.
What Anosys shows: Token cost per tool call, tool call frequency per conversation, and which tool calls are redundant (same input, same output).
Optimization: Cache tool results within a conversation, reduce context passed to tools (they rarely need the full history), and add early-exit conditions to prevent retry loops. Anosys shows you exactly which tool calls are costing the most and whether reducing them affects output quality.
5. Conversation Cost Caps¶
Problem: A small percentage of conversations consume disproportionate token spend with diminishing returns for the user.
What Anosys shows: Cost distribution across conversations, user satisfaction by conversation cost, and the cost threshold beyond which user outcomes don't improve.
Optimization: Implement soft caps (suggest the user start a new conversation), model downgrades after N turns (switch to a cheaper model for long conversations), or context pruning (aggressively summarize history beyond a cost threshold). Anosys helps you set the right thresholds by showing you the cost-quality trade-off curve.
Why You Need End-to-End Observability (Not Just Token Counters)¶
Several tools offer basic token counting. The difference with Anosys is end-to-end visibility — tokens in context of everything else:
| Capability | Basic Token Counters | Anosys |
|---|---|---|
| Total tokens per API call | ✅ | ✅ |
| Tokens per agent step / tool call | ❌ | ✅ |
| Cost attribution per conversation | ❌ | ✅ |
| Cost attribution per user segment | ❌ | ✅ |
| Cost anomaly detection (ML-based) | ❌ | ✅ |
| Correlation with user behavior | ❌ | ✅ |
| Correlation with application performance | ❌ | ✅ |
| Cross-layer root cause analysis | ❌ | ✅ |
| Development-phase cost projections | ❌ | ✅ |
| Production cost dashboards with alerting | ❌ | ✅ |
The reason end-to-end matters: token cost is never just a token problem. A cost spike might be caused by:
- A prompt change (AI layer)
- A retry loop triggered by an API timeout (application layer)
- A slow database making tool calls take longer (infrastructure layer)
- A new user segment with different interaction patterns (user behavior layer)
If your observability only covers one layer, you'll see the symptom (more tokens) but not the cause. Anosys covers all four layers in a single platform, so root cause analysis takes minutes, not days.
Vendor Comparison: Cost Observability¶
How does Anosys compare to alternatives for LLM cost management?
| Capability | Anosys | Arize AI | Langfuse | Helicone | Datadog LLM Obs |
|---|---|---|---|---|---|
| Token counting per call | ✅ | ✅ | ✅ | ✅ | ✅ |
| Per-conversation cost rollup | ✅ | ⚠️ Limited | ⚠️ Limited | ✅ | ⚠️ Limited |
| Per-agent-step cost attribution | ✅ | ⚠️ | ⚠️ | ❌ | ⚠️ |
| User segment cost breakdown | ✅ | ❌ | ❌ | ❌ | ❌ |
| ML-based cost anomaly detection | ✅ (included) | ❌ | ❌ | ❌ | ✅ (extra $) |
| User behavior correlation | ✅ (native) | ❌ | ❌ | ❌ | ❌ |
| Application + infra correlation | ✅ | ❌ | ❌ | ❌ | ✅ |
| Development cost projections | ✅ | ❌ | ❌ | ⚠️ | ❌ |
| Cost alerting with root cause | ✅ | ❌ | ❌ | ⚠️ Basic | ✅ (extra $) |
| Self-hosted option | ✅ | ❌ | ✅ | ❌ | ❌ |
Helicone and Langfuse offer token logging, but they're proxy-layer tools — they see API calls but not user behavior, infrastructure, or application context. When costs spike, they can tell you which API calls cost more, but not why.
Arize monitors model quality but has limited cost tooling and no user behavior layer.
Datadog LLM Observability has strong infrastructure coverage but charges extra for anomaly detection and has no native user behavior tracking to correlate with costs.
Anosys is the only platform that connects token costs to user behavior, agent logic, application health, and infrastructure state — giving you complete context for every dollar spent.
Getting Started¶
Step 1: Instrument Your Agents¶
Add Anosys tracing to your LLM agents using our native SDKs — OpenAI, Anthropic, or any OpenTelemetry-compatible framework. This takes 2 lines of code and immediately captures token usage per step.
Step 2: Add Cost Tracking Events¶
Use the REST API to report per-conversation and per-step token costs. This gives you the granular attribution needed to identify cost hotspots.
Step 3: Add User Behavior Tracking¶
Drop the Anosys JavaScript tag into your frontend to capture how users interact with your AI features. This is what connects token costs to user value.
Step 4: Build Cost Dashboards¶
Open the Anosys Console and build dashboards that show:
- Total token spend by day/week/month
- Cost per conversation by user segment
- Cost breakdown by agent step
- Cost anomaly alerts
Step 5: Set Up Alerts¶
Configure Slack or email alerts for:
- Daily spend exceeding threshold
- Per-conversation cost anomalies
- New cost patterns (e.g., a deployment that changed average token usage)
The platform starts learning your cost baselines within hours. By day two, you'll know exactly where your LLM budget is going — and where it's being wasted.
Next Steps¶
- Getting Started Guide — Create your account and send your first data in under 5 minutes
- Data Ingestion Options — Complete reference for JavaScript, image pixel, REST API, and OpenTelemetry
- OpenAI Agents Integration — Instrument your agents with two lines of code
- What Is AI Observability — Understand the full-stack observability approach that powers cost optimization
- Observability for Monetizable AI — How Anosys helps track ad revenue and agent-mediated conversions
- Website Analytics Tutorial — Deep dive into behavioral tracking with Anosys
- Schedule a Demo — See LLM cost observability in action with your own data