Skip to content

Post cover

How to Reduce LLM Costs with End-to-End Observability

A team ships an agentic AI travel assistant. Users love it — they ask it to find flights, compare hotels, suggest itineraries. The demo worked great on 50 test queries. In production, the agent serves 10,000 conversations a day.

The first monthly bill arrives: $47,000 in LLM API costs.

Nobody expected it. Nobody can explain it. The agent works, the users are happy, but the unit economics don't close. The team starts guessing: Is it the system prompt? Too many tool calls? Context windows getting too long? Users asking weird follow-ups?

They don't know, because they can't see.

This is the LLM cost visibility gap — and it's burning through AI budgets at companies of every size. The fix isn't cheaper models or shorter prompts. It's observability: knowing exactly where every token goes, why, and what it costs — in development and in production.


Why LLM Costs Surprise Everyone

LLM pricing is deceptively simple: you pay per token. But the total cost of an AI feature is a function of dozens of variables that interact in unpredictable ways:

Cost Factor Why It's Hard to Predict
System prompt length A 2,000-token system prompt costs nothing in testing — but at 10,000 conversations/day, it's $600/month just in prompt overhead
Conversation history Each follow-up message resends the full context. A 10-turn conversation costs ~10x what a single-turn one does
Tool call chains Agents that use tools (web search, APIs, databases) generate intermediate reasoning tokens that never reach the user but still cost money
Retries and loops A bug that causes the agent to retry a failed tool call 5 times multiplies cost 5x — silently
User behavior patterns Some user segments ask short questions; others paste entire documents and ask for analysis. The cost per conversation can vary 100x
Model selection Routing every query to GPT-4o when 70% could be handled by GPT-4o-mini wastes 10–30x per routed query

The core problem: you don't know what you don't know. Without granular visibility into token usage — broken down by conversation, agent step, tool call, and user segment — cost optimization is guesswork.


The Two Phases Where Costs Go Wrong

LLM cost problems emerge at two distinct stages, and each requires different observability capabilities.

%%{init: {"flowchart": {"curve": "linear", "rankSpacing": 60}}}%%
graph LR
    subgraph "Development & Iteration"
        D1["Prompt Engineering"]
        D2["Agent Architecture"]
        D3["Tool Integration"]
        D4["Model Selection"]
    end

    subgraph "Production"
        P1["User Traffic Patterns"]
        P2["Conversation Dynamics"]
        P3["Agent Behavior Drift"]
        P4["Scaling Effects"]
    end

    D1 --> P1
    D2 --> P2
    D3 --> P3
    D4 --> P4

    style D1 stroke:#7c4dff,color:#fff
    style D2 stroke:#7c4dff,color:#fff
    style D3 stroke:#7c4dff,color:#fff
    style D4 stroke:#7c4dff,color:#fff
    style P1 stroke:#e53935,color:#fff
    style P2 stroke:#e53935,color:#fff
    style P3 stroke:#e53935,color:#fff
    style P4 stroke:#e53935,color:#fff

Phase 1: Development — Building Features You Don't Know Will Be Expensive

During development, teams iterate on prompts, agent architectures, and tool integrations. They test with synthetic queries or small user groups. Everything seems efficient. Then production reality hits:

The hidden cost multipliers that development doesn't reveal:

  • Prompt bloat — You add instructions to handle edge cases. Each instruction adds tokens. By the time the prompt covers all the edge cases QA found, it's 4x longer than the original — and 4x more expensive on every single call.

  • Unnecessary context — The agent passes the full conversation history to every tool call, even when the tool only needs the current query. That's thousands of wasted tokens per interaction.

  • Over-powered models — During development, you use the most capable model because you want the best outputs. Nobody goes back to test which steps could use a smaller, cheaper model.

  • Redundant reasoning — The agent's chain-of-thought produces 500 tokens of reasoning to answer a question that could be handled with a simple lookup. Those reasoning tokens are invisible to the user but fully visible on the bill.

Without observability during development, you ship cost problems to production. By the time you discover them, they've already burned through your budget at scale.

Phase 2: Production — Users and Agents Consuming More Than Expected

Even if you optimize during development, production introduces cost dynamics that no test suite can predict:

Real-world patterns that inflate costs:

  • Long-tail conversations — 5% of users generate 40% of your token spend because they treat the agent as a conversational partner, sending 20+ messages per session. Your average-based cost projections never accounted for them.

  • Copy-paste power users — Users paste entire emails, documents, or logs into the chat. A single message can contain 10,000 tokens. Your agent dutifully processes all of it.

  • Agent retry storms — A third-party API goes intermittently slow. The agent retries, each retry adding to the context window. A single conversation burns through $5 in tokens before timing out.

  • Seasonal spikes — Your ticket-booking agent handles 3x normal volume during holiday season. Costs scale linearly with volume, but your budget doesn't.

  • Feature interaction effects — A new feature that adds "memory" to the agent (recalling past conversations) also adds 2,000 tokens of context to every request. Nobody calculated the cost impact before shipping.


Scenario: An Agentic Ticket-Booking Service

Let's make this concrete. Consider an AI-powered ticket-booking service where users interact with an agent to search, compare, and book flights and events.

What the Agent Does

%%{init: {"flowchart": {"curve": "linear"}}}%%
graph LR
    U["User: 'Find me a round-trip to Tokyo in April'"] --> A["Agent Reasoning"]
    A --> T1["Tool: Flight Search API"]
    A --> T2["Tool: Price Comparison"]
    A --> T3["Tool: Seat Availability"]
    T1 --> R["Response Generation"]
    T2 --> R
    T3 --> R
    R --> F["'Here are 5 options for Tokyo in April...'"] 

    style A stroke:#7c4dff,color:#fff
    style R stroke:#4caf50,color:#fff

Each conversation involves:

  1. User message — the request (50–500 tokens)
  2. System prompt — agent instructions, persona, tool definitions (1,500–3,000 tokens)
  3. Tool calls — search APIs, price lookups, availability checks (200–800 tokens each, including tool descriptions)
  4. Agent reasoning — intermediate chain-of-thought (300–1,000 tokens, not shown to user)
  5. Response — the answer shown to the user (200–600 tokens)
  6. Conversation history — all of the above, resent with every follow-up message

A single booking conversation that takes 6 turns can easily consume 30,000–50,000 tokens. At GPT-4o pricing ($2.50/M input, $10/M output), that's roughly $0.15–$0.30 per conversation. At 10,000 conversations per day, you're looking at $1,500–$3,000/day — or $45,000–$90,000/month.

Where the Money Actually Goes (Without Observability)

Without granular token tracking, the team's cost breakdown looks like this:

"We're spending $60K/month on OpenAI. The agent handles 10K conversations/day. That's roughly $0.20 per conversation."

That average hides everything. With Anosys, the actual breakdown reveals:

Segment % of Conversations % of Token Spend Cost per Conversation
Simple one-shot queries 35% 8% $0.05
Standard booking flows (3–5 turns) 40% 30% $0.15
Complex multi-destination searches 15% 32% $0.43
Long exploratory sessions (10+ turns) 8% 22% $0.55
Retry/error loops (agent failures) 2% 8% $0.80

Now the team knows: 10% of conversations (complex searches + error loops) consume 30% of the budget. That's where optimization effort should go — not blanket prompt shortening or model downgrades.


How Anosys Gives You Cost Visibility

Anosys provides the end-to-end observability needed to see, understand, and reduce LLM token costs — across both development and production.

1. Token Usage Attribution per Conversation

Every conversation traced through Anosys captures token counts at each step: prompt tokens, completion tokens, tool call tokens, and reasoning tokens. This is available through native SDKs for OpenAI Agents, Anthropic, and any OpenTelemetry-compatible framework.

Step Type Tokens Flag
System Prompt Input 2,100 🔴 Cost hotspot — resent every turn
User Message 1 Input 85
Agent Reasoning Output 340 🟠 Hidden — not shown to user, still billed
Tool: Flight Search Input 620
Tool: Price Compare Input 410
Response 1 Output 280
User Message 2 Input 120
History Context Resend Input 3,855 🔴 Cost hotspot — grows every turn
Agent Reasoning Output 510 🟠 Hidden — not shown to user, still billed
Response 2 Output 350

The biggest cost targets are the system prompt (resent every turn) and conversation history (growing with every turn). Reasoning tokens are invisible to users but fully billed. Anosys makes all of this visible automatically.

2. Agent Step-Level Cost Breakdown

Not all agent actions cost the same. Anosys traces every step — tool call, reasoning chain, handoff — and attributes token cost to each:

Agent Step Avg Tokens Avg Cost Frequency
System prompt (per turn) 2,100 $0.005 Every message
Flight search tool call 620 $0.002 1.8x per conversation
Price comparison tool call 410 $0.001 1.2x per conversation
Chain-of-thought reasoning 450 $0.004 Every message
History context resend 3,200 $0.008 Every follow-up
Response generation 310 $0.003 Every message

This immediately reveals that history resend is the most expensive per-occurrence step — and it grows with every turn. A 6-turn conversation resends the history 5 times, at escalating cost each time.

3. Cost Anomaly Detection

Anosys runs ML-based anomaly detection on token usage metrics just like it does on latency, error rates, and user behavior. This catches cost spikes before they become budget crises:

  • Per-conversation cost anomalies — a sudden jump in average tokens per conversation (e.g., a prompt change that doubled the system prompt length)
  • Per-agent anomalies — one agent workflow suddenly consuming 5x more tokens than baseline (e.g., a retry loop introduced by a code change)
  • User segment anomalies — a new user cohort generating disproportionate token spend (e.g., enterprise users pasting large documents)
  • Cost rate anomalies — total spend per hour deviating from learned daily/weekly patterns

4. User Behavior → Token Cost Correlation

This is what no other observability tool provides. Anosys tracks both user behavior (via its JavaScript tag) and agent performance (via SDK traces) in the same platform. This lets you correlate how users interact with what it costs:

User Behavior Signal Token Cost Insight
Session length (number of messages) Which conversation lengths are cost-efficient vs. wasteful
Message length (characters pasted) Which users are sending oversized inputs that inflate context
Feature usage (which agent tools triggered) Which features are disproportionately expensive
User satisfaction (thumbs up/down) Whether expensive conversations actually produce better outcomes
Abandonment point Where users give up — often after expensive retry loops that produced nothing useful

Scenario: Developers Building Features That Burn Tokens

The cost problem isn't limited to end-user conversations. Development teams building agentic AI features are some of the biggest silent token consumers.

The Pattern

A developer is building a document analysis agent. During development, they:

  1. Iterate on prompts — each iteration sends the full document to the LLM. A 50-page contract is ~25,000 tokens. Testing 20 prompt variations = 500,000 tokens just in prompt engineering.

  2. Test tool integrations — the agent calls a summarization tool, a classification tool, and an extraction tool. Each tool call sends context. Testing the full pipeline 100 times = 3M+ tokens.

  3. Debug with verbose output — they enable chain-of-thought logging, which generates 2x more tokens. They forget to disable it before merging. Now production generates 2x more reasoning tokens on every request.

  4. Add features without cost analysis — a "conversation memory" feature that retrieves the last 5 conversations and includes them in the context. Nobody calculated that this adds 15,000 tokens to every request.

What Anosys Shows in Development

With Anosys instrumented during development, the team sees:

%%{init: {"flowchart": {"curve": "linear"}}}%%
graph LR
    subgraph "AI Agent Cost"
        A["Prompt Iterations — $12.50 / 500K tokens"]
        B["Tool Chain Testing — $45.00 / 3.2M tokens"]
        C["Debug Logging Overhead — +$8.00 / extra 1.1M tokens"]
        D["Memory Feature Impact — +15,000 tokens/request, Projected: +$4,500/month in prod"]
    end

    style A stroke:#4caf50,color:#fff
    style B stroke:#ff9800,color:#fff
    style C stroke:#e53935,color:#fff
    style D stroke:#e53935,color:#fff

Green items are expected development costs. Orange and red items are the problems: debug logging that shouldn't ship to production, and a new feature whose cost impact hasn't been evaluated.

Without Anosys, these issues ship silently. With Anosys, the developer sees the projected production cost of their changes before merging — and can make informed trade-offs.


The Optimization Playbook

Once you have observability, cost reduction becomes systematic rather than guesswork. Here are the most impactful optimizations Anosys helps you identify and measure:

1. Prompt Compression

Problem: System prompts grow over time as teams add instructions for edge cases.

What Anosys shows: The exact token count of your system prompt on every call, and the percentage of total cost it represents.

Optimization: Compress the prompt using structured formatting, remove redundant instructions, move static examples to few-shot retrieval. Anosys lets you A/B test compressed vs. original prompts and measure both cost savings and quality impact.

2. Context Window Management

Problem: Conversation history grows linearly with each turn, and the full history is resent every time.

What Anosys shows: Token count per history resend, cost growth curve as conversations get longer, and the point at which marginal cost per message exceeds marginal value.

Optimization: Implement history summarization (condense older turns into a summary), sliding window (only keep the last N turns), or selective context (only include relevant past turns). Anosys measures the cost impact of each strategy.

3. Model Routing

Problem: Every request goes to the most expensive model, even when a cheaper one would produce identical results.

What Anosys shows: Per-step model usage, quality scores by model, and cost per model.

Optimization: Route simple requests (greetings, clarifications, factual lookups) to a smaller model. Use the large model only for complex reasoning, multi-step planning, or high-stakes decisions. Anosys tracks quality and cost by model so you can verify the routing rules don't degrade the experience.

4. Tool Call Optimization

Problem: Agents call tools redundantly or pass excessive context to tool calls.

What Anosys shows: Token cost per tool call, tool call frequency per conversation, and which tool calls are redundant (same input, same output).

Optimization: Cache tool results within a conversation, reduce context passed to tools (they rarely need the full history), and add early-exit conditions to prevent retry loops. Anosys shows you exactly which tool calls are costing the most and whether reducing them affects output quality.

5. Conversation Cost Caps

Problem: A small percentage of conversations consume disproportionate token spend with diminishing returns for the user.

What Anosys shows: Cost distribution across conversations, user satisfaction by conversation cost, and the cost threshold beyond which user outcomes don't improve.

Optimization: Implement soft caps (suggest the user start a new conversation), model downgrades after N turns (switch to a cheaper model for long conversations), or context pruning (aggressively summarize history beyond a cost threshold). Anosys helps you set the right thresholds by showing you the cost-quality trade-off curve.


Why You Need End-to-End Observability (Not Just Token Counters)

Several tools offer basic token counting. The difference with Anosys is end-to-end visibility — tokens in context of everything else:

Capability Basic Token Counters Anosys
Total tokens per API call
Tokens per agent step / tool call
Cost attribution per conversation
Cost attribution per user segment
Cost anomaly detection (ML-based)
Correlation with user behavior
Correlation with application performance
Cross-layer root cause analysis
Development-phase cost projections
Production cost dashboards with alerting

The reason end-to-end matters: token cost is never just a token problem. A cost spike might be caused by:

  • A prompt change (AI layer)
  • A retry loop triggered by an API timeout (application layer)
  • A slow database making tool calls take longer (infrastructure layer)
  • A new user segment with different interaction patterns (user behavior layer)

If your observability only covers one layer, you'll see the symptom (more tokens) but not the cause. Anosys covers all four layers in a single platform, so root cause analysis takes minutes, not days.


Vendor Comparison: Cost Observability

How does Anosys compare to alternatives for LLM cost management?

Capability Anosys Arize AI Langfuse Helicone Datadog LLM Obs
Token counting per call
Per-conversation cost rollup ⚠️ Limited ⚠️ Limited ⚠️ Limited
Per-agent-step cost attribution ⚠️ ⚠️ ⚠️
User segment cost breakdown
ML-based cost anomaly detection ✅ (included) ✅ (extra $)
User behavior correlation ✅ (native)
Application + infra correlation
Development cost projections ⚠️
Cost alerting with root cause ⚠️ Basic ✅ (extra $)
Self-hosted option

Helicone and Langfuse offer token logging, but they're proxy-layer tools — they see API calls but not user behavior, infrastructure, or application context. When costs spike, they can tell you which API calls cost more, but not why.

Arize monitors model quality but has limited cost tooling and no user behavior layer.

Datadog LLM Observability has strong infrastructure coverage but charges extra for anomaly detection and has no native user behavior tracking to correlate with costs.

Anosys is the only platform that connects token costs to user behavior, agent logic, application health, and infrastructure state — giving you complete context for every dollar spent.


Getting Started

Step 1: Instrument Your Agents

Add Anosys tracing to your LLM agents using our native SDKs — OpenAI, Anthropic, or any OpenTelemetry-compatible framework. This takes 2 lines of code and immediately captures token usage per step.

Step 2: Add Cost Tracking Events

Use the REST API to report per-conversation and per-step token costs. This gives you the granular attribution needed to identify cost hotspots.

Step 3: Add User Behavior Tracking

Drop the Anosys JavaScript tag into your frontend to capture how users interact with your AI features. This is what connects token costs to user value.

Step 4: Build Cost Dashboards

Open the Anosys Console and build dashboards that show:

  • Total token spend by day/week/month
  • Cost per conversation by user segment
  • Cost breakdown by agent step
  • Cost anomaly alerts

Step 5: Set Up Alerts

Configure Slack or email alerts for:

  • Daily spend exceeding threshold
  • Per-conversation cost anomalies
  • New cost patterns (e.g., a deployment that changed average token usage)

The platform starts learning your cost baselines within hours. By day two, you'll know exactly where your LLM budget is going — and where it's being wasted.


Next Steps