Post cover

How to Reduce LLM Costs with End-to-End Observability¶

A team ships an agentic AI travel assistant. Users love it — they ask it to find flights, compare hotels, suggest itineraries. The demo worked great on 50 test queries. In production, the agent serves 10,000 conversations a day.

The first monthly bill arrives: $47,000 in LLM API costs.

Nobody expected it. Nobody can explain it. The agent works, the users are happy, but the unit economics don't close. The team starts guessing: Is it the system prompt? Too many tool calls? Context windows getting too long? Users asking weird follow-ups?

They don't know, because they can't see.

This is the LLM cost visibility gap — and it's burning through AI budgets at companies of every size. The fix isn't cheaper models or shorter prompts. It's observability: knowing exactly where every token goes, why, and what it costs — in development and in production.

Why LLM Costs Surprise Everyone¶

LLM pricing is deceptively simple: you pay per token. But the total cost of an AI feature is a function of dozens of variables that interact in unpredictable ways:

Cost Factor	Why It's Hard to Predict
System prompt length	A 2,000-token system prompt costs nothing in testing — but at 10,000 conversations/day, it's $600/month just in prompt overhead
Conversation history	Each follow-up message resends the full context. A 10-turn conversation costs ~10x what a single-turn one does
Tool call chains	Agents that use tools (web search, APIs, databases) generate intermediate reasoning tokens that never reach the user but still cost money
Retries and loops	A bug that causes the agent to retry a failed tool call 5 times multiplies cost 5x — silently
User behavior patterns	Some user segments ask short questions; others paste entire documents and ask for analysis. The cost per conversation can vary 100x
Model selection	Routing every query to GPT-4o when 70% could be handled by GPT-4o-mini wastes 10–30x per routed query

The core problem: you don't know what you don't know. Without granular visibility into token usage — broken down by conversation, agent step, tool call, and user segment — cost optimization is guesswork.

The Two Phases Where Costs Go Wrong¶

LLM cost problems emerge at two distinct stages, and each requires different observability capabilities.

%%{init: {"flowchart": {"curve": "linear", "rankSpacing": 60}}}%%
graph LR
    subgraph "Development & Iteration"
        D1["Prompt Engineering"]
        D2["Agent Architecture"]
        D3["Tool Integration"]
        D4["Model Selection"]
    end

    subgraph "Production"
        P1["User Traffic Patterns"]
        P2["Conversation Dynamics"]
        P3["Agent Behavior Drift"]
        P4["Scaling Effects"]
    end

    D1 --> P1
    D2 --> P2
    D3 --> P3
    D4 --> P4

    style D1 stroke:#7c4dff,color:#fff
    style D2 stroke:#7c4dff,color:#fff
    style D3 stroke:#7c4dff,color:#fff
    style D4 stroke:#7c4dff,color:#fff
    style P1 stroke:#e53935,color:#fff
    style P2 stroke:#e53935,color:#fff
    style P3 stroke:#e53935,color:#fff
    style P4 stroke:#e53935,color:#fff

Phase 1: Development — Building Features You Don't Know Will Be Expensive¶

During development, teams iterate on prompts, agent architectures, and tool integrations. They test with synthetic queries or small user groups. Everything seems efficient. Then production reality hits:

The hidden cost multipliers that development doesn't reveal:

Prompt bloat — You add instructions to handle edge cases. Each instruction adds tokens. By the time the prompt covers all the edge cases QA found, it's 4x longer than the original — and 4x more expensive on every single call.
Unnecessary context — The agent passes the full conversation history to every tool call, even when the tool only needs the current query. That's thousands of wasted tokens per interaction.
Over-powered models — During development, you use the most capable model because you want the best outputs. Nobody goes back to test which steps could use a smaller, cheaper model.
Redundant reasoning — The agent's chain-of-thought produces 500 tokens of reasoning to answer a question that could be handled with a simple lookup. Those reasoning tokens are invisible to the user but fully visible on the bill.

Without observability during development, you ship cost problems to production. By the time you discover them, they've already burned through your budget at scale.

Phase 2: Production — Users and Agents Consuming More Than Expected¶

Even if you optimize during development, production introduces cost dynamics that no test suite can predict:

Real-world patterns that inflate costs:

Long-tail conversations — 5% of users generate 40% of your token spend because they treat the agent as a conversational partner, sending 20+ messages per session. Your average-based cost projections never accounted for them.
Copy-paste power users — Users paste entire emails, documents, or logs into the chat. A single message can contain 10,000 tokens. Your agent dutifully processes all of it.
Agent retry storms — A third-party API goes intermittently slow. The agent retries, each retry adding to the context window. A single conversation burns through $5 in tokens before timing out.
Seasonal spikes — Your ticket-booking agent handles 3x normal volume during holiday season. Costs scale linearly with volume, but your budget doesn't.
Feature interaction effects — A new feature that adds "memory" to the agent (recalling past conversations) also adds 2,000 tokens of context to every request. Nobody calculated the cost impact before shipping.

Scenario: An Agentic Ticket-Booking Service¶

Let's make this concrete. Consider an AI-powered ticket-booking service where users interact with an agent to search, compare, and book flights and events.

What the Agent Does¶

%%{init: {"flowchart": {"curve": "linear"}}}%%
graph LR
    U["User: 'Find me a round-trip to Tokyo in April'"] --> A["Agent Reasoning"]
    A --> T1["Tool: Flight Search API"]
    A --> T2["Tool: Price Comparison"]
    A --> T3["Tool: Seat Availability"]
    T1 --> R["Response Generation"]
    T2 --> R
    T3 --> R
    R --> F["'Here are 5 options for Tokyo in April...'"] 

    style A stroke:#7c4dff,color:#fff
    style R stroke:#4caf50,color:#fff

Each conversation involves:

User message — the request (50–500 tokens)
System prompt — agent instructions, persona, tool definitions (1,500–3,000 tokens)
Tool calls — search APIs, price lookups, availability checks (200–800 tokens each, including tool descriptions)
Agent reasoning — intermediate chain-of-thought (300–1,000 tokens, not shown to user)
Response — the answer shown to the user (200–600 tokens)
Conversation history — all of the above, resent with every follow-up message

A single booking conversation that takes 6 turns can easily consume 30,000–50,000 tokens. At GPT-4o pricing ($2.50/M input, $10/M output), that's roughly $0.15–$0.30 per conversation. At 10,000 conversations per day, you're looking at $1,500–$3,000/day — or $45,000–$90,000/month.

Where the Money Actually Goes (Without Observability)¶

Without granular token tracking, the team's cost breakdown looks like this:

"We're spending $60K/month on OpenAI. The agent handles 10K conversations/day. That's roughly $0.20 per conversation."

That average hides everything. With Anosys, the actual breakdown reveals:

Segment	% of Conversations	% of Token Spend	Cost per Conversation
Simple one-shot queries	35%	8%	$0.05
Standard booking flows (3–5 turns)	40%	30%	$0.15
Complex multi-destination searches	15%	32%	$0.43
Long exploratory sessions (10+ turns)	8%	22%	$0.55
Retry/error loops (agent failures)	2%	8%	$0.80

Now the team knows: 10% of conversations (complex searches + error loops) consume 30% of the budget. That's where optimization effort should go — not blanket prompt shortening or model downgrades.

How Anosys Gives You Cost Visibility¶

Anosys provides the end-to-end observability needed to see, understand, and reduce LLM token costs — across both development and production.

1. Token Usage Attribution per Conversation¶

Every conversation traced through Anosys captures token counts at each step: prompt tokens, completion tokens, tool call tokens, and reasoning tokens. This is available through native SDKs for OpenAI Agents, Anthropic, and any OpenTelemetry-compatible framework.

Step	Type	Tokens	Flag
System Prompt	Input	2,100	🔴 Cost hotspot — resent every turn
User Message 1	Input	85
Agent Reasoning	Output	340	🟠 Hidden — not shown to user, still billed
Tool: Flight Search	Input	620
Tool: Price Compare	Input	410
Response 1	Output	280
User Message 2	Input	120
History Context Resend	Input	3,855	🔴 Cost hotspot — grows every turn
Agent Reasoning	Output	510	🟠 Hidden — not shown to user, still billed
Response 2	Output	350

The biggest cost targets are the system prompt (resent every turn) and conversation history (growing with every turn). Reasoning tokens are invisible to users but fully billed. Anosys makes all of this visible automatically.

2. Agent Step-Level Cost Breakdown¶

Not all agent actions cost the same. Anosys traces every step — tool call, reasoning chain, handoff — and attributes token cost to each:

Agent Step	Avg Tokens	Avg Cost	Frequency
System prompt (per turn)	2,100	$0.005	Every message
Flight search tool call	620	$0.002	1.8x per conversation
Price comparison tool call	410	$0.001	1.2x per conversation
Chain-of-thought reasoning	450	$0.004	Every message
History context resend	3,200	$0.008	Every follow-up
Response generation	310	$0.003	Every message

This immediately reveals that history resend is the most expensive per-occurrence step — and it grows with every turn. A 6-turn conversation resends the history 5 times, at escalating cost each time.

3. Cost Anomaly Detection¶

Anosys runs ML-based anomaly detection on token usage metrics just like it does on latency, error rates, and user behavior. This catches cost spikes before they become budget crises:

Per-conversation cost anomalies — a sudden jump in average tokens per conversation (e.g., a prompt change that doubled the system prompt length)
Per-agent anomalies — one agent workflow suddenly consuming 5x more tokens than baseline (e.g., a retry loop introduced by a code change)
User segment anomalies — a new user cohort generating disproportionate token spend (e.g., enterprise users pasting large documents)
Cost rate anomalies — total spend per hour deviating from learned daily/weekly patterns

4. User Behavior → Token Cost Correlation¶

This is what no other observability tool provides. Anosys tracks both user behavior (via its JavaScript tag) and agent performance (via SDK traces) in the same platform. This lets you correlate how users interact with what it costs:

User Behavior Signal	Token Cost Insight
Session length (number of messages)	Which conversation lengths are cost-efficient vs. wasteful
Message length (characters pasted)	Which users are sending oversized inputs that inflate context
Feature usage (which agent tools triggered)	Which features are disproportionately expensive
User satisfaction (thumbs up/down)	Whether expensive conversations actually produce better outcomes
Abandonment point	Where users give up — often after expensive retry loops that produced nothing useful

Scenario: Developers Building Features That Burn Tokens¶

The cost problem isn't limited to end-user conversations. Development teams building agentic AI features are some of the biggest silent token consumers.

The Pattern¶

A developer is building a document analysis agent. During development, they:

Iterate on prompts — each iteration sends the full document to the LLM. A 50-page contract is ~25,000 tokens. Testing 20 prompt variations = 500,000 tokens just in prompt engineering.
Test tool integrations — the agent calls a summarization tool, a classification tool, and an extraction tool. Each tool call sends context. Testing the full pipeline 100 times = 3M+ tokens.
Debug with verbose output — they enable chain-of-thought logging, which generates 2x more tokens. They forget to disable it before merging. Now production generates 2x more reasoning tokens on every request.
Add features without cost analysis — a "conversation memory" feature that retrieves the last 5 conversations and includes them in the context. Nobody calculated that this adds 15,000 tokens to every request.

What Anosys Shows in Development¶

With Anosys instrumented during development, the team sees:

%%{init: {"flowchart": {"curve": "linear"}}}%%
graph LR
    subgraph "AI Agent Cost"
        A["Prompt Iterations — $12.50 / 500K tokens"]
        B["Tool Chain Testing — $45.00 / 3.2M tokens"]
        C["Debug Logging Overhead — +$8.00 / extra 1.1M tokens"]
        D["Memory Feature Impact — +15,000 tokens/request, Projected: +$4,500/month in prod"]
    end

    style A stroke:#4caf50,color:#fff
    style B stroke:#ff9800,color:#fff
    style C stroke:#e53935,color:#fff
    style D stroke:#e53935,color:#fff

Green items are expected development costs. Orange and red items are the problems: debug logging that shouldn't ship to production, and a new feature whose cost impact hasn't been evaluated.

Without Anosys, these issues ship silently. With Anosys, the developer sees the projected production cost of their changes before merging — and can make informed trade-offs.

The Optimization Playbook¶

Once you have observability, cost reduction becomes systematic rather than guesswork. Here are the most impactful optimizations Anosys helps you identify and measure:

1. Prompt Compression¶

Problem: System prompts grow over time as teams add instructions for edge cases.

What Anosys shows: The exact token count of your system prompt on every call, and the percentage of total cost it represents.

Optimization: Compress the prompt using structured formatting, remove redundant instructions, move static examples to few-shot retrieval. Anosys lets you A/B test compressed vs. original prompts and measure both cost savings and quality impact.

2. Context Window Management¶

Problem: Conversation history grows linearly with each turn, and the full history is resent every time.

What Anosys shows: Token count per history resend, cost growth curve as conversations get longer, and the point at which marginal cost per message exceeds marginal value.

Optimization: Implement history summarization (condense older turns into a summary), sliding window (only keep the last N turns), or selective context (only include relevant past turns). Anosys measures the cost impact of each strategy.

3. Model Routing¶

Problem: Every request goes to the most expensive model, even when a cheaper one would produce identical results.

What Anosys shows: Per-step model usage, quality scores by model, and cost per model.

Optimization: Route simple requests (greetings, clarifications, factual lookups) to a smaller model. Use the large model only for complex reasoning, multi-step planning, or high-stakes decisions. Anosys tracks quality and cost by model so you can verify the routing rules don't degrade the experience.

4. Tool Call Optimization¶

Problem: Agents call tools redundantly or pass excessive context to tool calls.

What Anosys shows: Token cost per tool call, tool call frequency per conversation, and which tool calls are redundant (same input, same output).

Optimization: Cache tool results within a conversation, reduce context passed to tools (they rarely need the full history), and add early-exit conditions to prevent retry loops. Anosys shows you exactly which tool calls are costing the most and whether reducing them affects output quality.

5. Conversation Cost Caps¶

Problem: A small percentage of conversations consume disproportionate token spend with diminishing returns for the user.

What Anosys shows: Cost distribution across conversations, user satisfaction by conversation cost, and the cost threshold beyond which user outcomes don't improve.

Optimization: Implement soft caps (suggest the user start a new conversation), model downgrades after N turns (switch to a cheaper model for long conversations), or context pruning (aggressively summarize history beyond a cost threshold). Anosys helps you set the right thresholds by showing you the cost-quality trade-off curve.

Why You Need End-to-End Observability (Not Just Token Counters)¶

Several tools offer basic token counting. The difference with Anosys is end-to-end visibility — tokens in context of everything else:

Capability	Basic Token Counters	Anosys
Total tokens per API call	✅	✅
Tokens per agent step / tool call	❌	✅
Cost attribution per conversation	❌	✅
Cost attribution per user segment	❌	✅
Cost anomaly detection (ML-based)	❌	✅
Correlation with user behavior	❌	✅
Correlation with application performance	❌	✅
Cross-layer root cause analysis	❌	✅
Development-phase cost projections	❌	✅
Production cost dashboards with alerting	❌	✅

The reason end-to-end matters: token cost is never just a token problem. A cost spike might be caused by:

A prompt change (AI layer)
A retry loop triggered by an API timeout (application layer)
A slow database making tool calls take longer (infrastructure layer)
A new user segment with different interaction patterns (user behavior layer)

If your observability only covers one layer, you'll see the symptom (more tokens) but not the cause. Anosys covers all four layers in a single platform, so root cause analysis takes minutes, not days.

Vendor Comparison: Cost Observability¶

How does Anosys compare to alternatives for LLM cost management?

Capability	Anosys	Arize AI	Langfuse	Helicone	Datadog LLM Obs
Token counting per call	✅	✅	✅	✅	✅
Per-conversation cost rollup	✅	⚠️ Limited	⚠️ Limited	✅	⚠️ Limited
Per-agent-step cost attribution	✅	⚠️	⚠️	❌	⚠️
User segment cost breakdown	✅	❌	❌	❌	❌
ML-based cost anomaly detection	✅ (included)	❌	❌	❌	✅ (extra $)
User behavior correlation	✅ (native)	❌	❌	❌	❌
Application + infra correlation	✅	❌	❌	❌	✅
Development cost projections	✅	❌	❌	⚠️	❌
Cost alerting with root cause	✅	❌	❌	⚠️ Basic	✅ (extra $)
Self-hosted option	✅	❌	✅	❌	❌

Helicone and Langfuse offer token logging, but they're proxy-layer tools — they see API calls but not user behavior, infrastructure, or application context. When costs spike, they can tell you which API calls cost more, but not why.

Arize monitors model quality but has limited cost tooling and no user behavior layer.

Datadog LLM Observability has strong infrastructure coverage but charges extra for anomaly detection and has no native user behavior tracking to correlate with costs.

Anosys is the only platform that connects token costs to user behavior, agent logic, application health, and infrastructure state — giving you complete context for every dollar spent.

Getting Started¶

Step 1: Instrument Your Agents¶

Add Anosys tracing to your LLM agents using our native SDKs — OpenAI, Anthropic, or any OpenTelemetry-compatible framework. This takes 2 lines of code and immediately captures token usage per step.

Step 2: Add Cost Tracking Events¶

Use the REST API to report per-conversation and per-step token costs. This gives you the granular attribution needed to identify cost hotspots.

Step 3: Add User Behavior Tracking¶

Drop the Anosys JavaScript tag into your frontend to capture how users interact with your AI features. This is what connects token costs to user value.

Step 4: Build Cost Dashboards¶

Open the Anosys Console and build dashboards that show:

Total token spend by day/week/month
Cost per conversation by user segment
Cost breakdown by agent step
Cost anomaly alerts

Step 5: Set Up Alerts¶

Configure Slack or email alerts for:

Daily spend exceeding threshold
Per-conversation cost anomalies
New cost patterns (e.g., a deployment that changed average token usage)

The platform starts learning your cost baselines within hours. By day two, you'll know exactly where your LLM budget is going — and where it's being wasted.

Next Steps¶

Getting Started Guide — Create your account and send your first data in under 5 minutes
Data Ingestion Options — Complete reference for JavaScript, image pixel, REST API, and OpenTelemetry
OpenAI Agents Integration — Instrument your agents with two lines of code
What Is AI Observability — Understand the full-stack observability approach that powers cost optimization
Observability for Monetizable AI — How Anosys helps track ad revenue and agent-mediated conversions
Website Analytics Tutorial — Deep dive into behavioral tracking with Anosys
Schedule a Demo — See LLM cost observability in action with your own data