Skip to main content
Reference

Cost Management

Token counting, budget enforcement, cost attribution, end-to-end cost calculations, cloud vs local break-even analysis, and optimization strategies.

When running AI agents at scale, costs can spiral rapidly. A single runaway agent or inefficient prompt can waste hundreds of dollars. This document covers token accounting, budget enforcement, cost tracking, rate limiting, and optimization strategies to keep costs under control.

In simple terms: How much does this actually cost? How do I prevent overspending? Where did the money go?


Part 1: Understanding Token Costs

How Tokens Are Counted

Tokens are the currency of LLM APIs. Understanding token counting is critical to predicting and controlling costs.

Input vs Output Tokens

User input: "Summarize this 100-page PDF"
Processed as: [t1, t2, t3, ..., t500]  ← 500 input tokens

Model response: "The document discusses..."
Generated as: [t1, t2, t3, ..., t85]   ← 85 output tokens

Cost calculation: (input_tokens * input_price) + (output_tokens * output_price)

Key insight: Input tokens are usually 3–5× cheaper than output tokens. Long prompts are cheap; long responses are expensive.

Approximating Token Count

Before running a request, estimate token usage:

def estimate_tokens(text: str) -> int:
    """
    Rough estimate: 1 token ≈ 4 characters (English)
    This is NOT exact, but good for budgeting
    
    For precise counts, use tiktoken (OpenAI) or anthropic.messages.count_tokens()
    """
    return len(text) // 4

# Examples
estimate_tokens("Hello world")                    # ~3 tokens
estimate_tokens("What is machine learning?")    # ~7 tokens
estimate_tokens("Write a 1000-word essay")      # 4 words = 1 token, so ~250 output tokens

# Precise counting with Anthropic SDK:
from anthropic import Anthropic

client = Anthropic()
response = client.messages.count_tokens(
    model="claude-sonnet-4",
    messages=[{"role": "user", "content": "Your prompt here"}]
)
print(f"Input tokens: {response.input_tokens}")

Practical rule of thumb:

  • Typical sentence: 20–30 tokens
  • Typical paragraph: 100–150 tokens
  • Typical webpage: 1000–3000 tokens
  • Typical Python file: 500–2000 tokens (depending on length)

Cost Per Model (April 2026 Pricing)

Cloud Models (API-based)

ModelInputOutputBest ForCost Ratio
Claude 3.5 Sonnet$3/1M$15/1MGeneral-purpose agent loop1.0x (baseline)
Claude 3 Opus$15/1M$75/1MComplex reasoning, verification5.0x
GPT-4o$5/1M$15/1MFast, competitive1.4x
Gemini 2.0 Flash$0.075/1M$0.30/1MUltra-cheap (copilot-class)0.06x
Mistral Large 2$2/1M$6/1MEuropean, competitive0.8x
Qwen 2.5 Max$0.8/1M$2.4/1MBudget option0.5x

Local Models (Self-hosted)

GPU Hardware Costs (amortized):
  RTX 4090:      $1.65 per hour (depreciation + electricity)
  RTX 4070 Ti:   $0.82 per hour
  NVIDIA H100:   $3.50 per hour

Throughput (tokens/second for 13B model on RTX 4090):
  Quantized (4-bit AWQ):  50 tokens/sec
  Full precision (16-bit): 25 tokens/sec

Cost per 1M tokens (local, RTX 4090 at $1.65/hr):
  50 tokens/sec × 3600 sec = 180K tokens/hour
  1M tokens = 5.56 hours
  Cost = 5.56 × $1.65 = $9.17 per 1M tokens

Interesting insight: Local costs MORE than cloud for small workloads!
Only cheaper if you use >100K tokens/day (break-even point)

Real-World Cost Example

Scenario: Running a web search agent for 100 customer queries/day

Setup 1: Claude 3.5 Sonnet (cloud)
  Per query: 1500 input tokens (query + context) + 500 output tokens
  Cost per query: (1500 × 3 / 1M) + (500 × 15 / 1M) = $0.0045 + $0.0075 = $0.0120
  Daily cost: 100 × $0.0120 = $1.20
  Monthly cost: 100 × 30 × $0.0120 = $36.00

Setup 2: Gemini 2.0 Flash (cloud, ultra-cheap)
  Same usage: (1500 × 0.075 / 1M) + (500 × 0.30 / 1M) = $0.000113 + $0.00015 = $0.00026
  Daily cost: 100 × $0.00026 = $0.026
  Monthly cost: 100 × 30 × $0.00026 = $0.78

Setup 3: Local Llama 3.1 13B (self-hosted)
  GPU cost: 100 queries × 2000 tokens × (1/180K tokens/hour) = 1.1 hours = $1.82/day
  Monthly cost: 1.1 × 30 = $54.50
  (Plus electricity, which is included in $1.65/hr estimate)

Verdict: For 100 queries/day, cloud is 10–50× cheaper than local!
Break-even for local is around 300K+ tokens/day.

Part 2: Token Reduction Techniques

Before buying more tokens, reduce the ones you use.

1. Prompt Compression

# BEFORE: 5000 tokens
system_prompt = """
You are an expert software engineer with 20 years of experience in 
building distributed systems, cloud infrastructure, and machine learning 
pipelines. You have deep knowledge of Python, Go, Rust, and TypeScript. 
You are familiar with all major cloud providers (AWS, GCP, Azure) and 
can recommend best practices for scalability, security, and cost optimization.

When answering questions, provide detailed explanations with code examples.
Think step-by-step and consider edge cases. Always prioritize security and 
performance. Use industry best practices and cite relevant papers or documentation.
"""

# AFTER: 350 tokens (7× compression)
system_prompt = """You are a senior software engineer. Answer with code examples, \
think step-by-step, prioritize security/performance."""

# Savings: (5000 - 350) × $3 / 1M = $0.014 per request
# Over 10,000 requests: $140 saved!

2. Context Trimming

Problem: Context grows over long sessions. After 50 agent steps, the session log alone is 50K tokens.

Solution: Keep only what’s needed

class ContextManager:
    """Trim old context to stay under token limit"""
    
    def __init__(self, max_tokens: int = 100000):
        self.max_tokens = max_tokens
        self.current_tokens = 0
        self.history = []
    
    def add_step(self, step: dict) -> None:
        """Add agent step, trim old context if needed"""
        step_tokens = self.estimate_tokens(step)
        self.current_tokens += step_tokens
        self.history.append(step)
        
        # Keep trimming until under limit
        while self.current_tokens > self.max_tokens:
            old_step = self.history.pop(0)
            self.current_tokens -= self.estimate_tokens(old_step)
    
    def estimate_tokens(self, obj: dict) -> int:
        """Estimate tokens for a step"""
        return len(str(obj)) // 4
    
    def get_context(self) -> list:
        """Return history that fits in context window"""
        return self.history

# Usage
ctx = ContextManager(max_tokens=100000)
for i in range(100):
    ctx.add_step({"iteration": i, "action": "tool_call", "result": "..."})
    # Automatically keeps only recent ~100K tokens

3. Batch Processing

Instead of making 100 separate API calls, batch them together:

# BEFORE: 100 separate calls
costs = 0
for query in queries:
    response = client.messages.create(
        model="claude-3-5-sonnet",
        max_tokens=200,
        messages=[{"role": "user", "content": query}]
    )
    costs += calculate_cost(response)
# Cost: High latency (parallel), high per-call overhead

# AFTER: Batch 10 queries per call
costs = 0
for i in range(0, len(queries), 10):
    batch = queries[i:i+10]
    batch_prompt = "\n".join([f"Query {j}: {q}" for j, q in enumerate(batch)])
    response = client.messages.create(
        model="claude-3-5-sonnet",
        max_tokens=2000,
        messages=[{"role": "user", "content": batch_prompt}]
    )
    costs += calculate_cost(response)
# Benefit: 20–30% token savings (shared context overhead)

Part 3: Token Counter Implementation

from typing import Dict, Optional
from datetime import datetime
import json

class TokenCounter:
    """Track token usage and costs across models"""
    
    # Pricing per 1M tokens (input, output)
    # Prices approximate as of early 2025. Check provider websites for current rates.
    PRICING = {
        "claude-3-5-sonnet": {"input": 3.0, "output": 15.0},
        "claude-3-opus": {"input": 15.0, "output": 75.0},
        "gpt-4o": {"input": 5.0, "output": 15.0},
        "gemini-2-flash": {"input": 0.075, "output": 0.30},
        "mistral-large": {"input": 2.0, "output": 6.0},
        "llama-3.1-70b": {"input": 0.8, "output": 2.4},  # Local equivalent
    }
    
    def __init__(self, session_id: str):
        self.session_id = session_id
        self.events = []
        self.totals = {
            "input_tokens": 0,
            "output_tokens": 0,
            "cost_usd": 0.0,
        }
        self.by_model = {}
    
    def record(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int,
        tool_name: Optional[str] = None,
        step: Optional[int] = None,
    ) -> Dict[str, float]:
        """Record token usage for a single API call"""
        
        if model not in self.PRICING:
            raise ValueError(f"Unknown model: {model}")
        
        prices = self.PRICING[model]
        input_cost = (input_tokens / 1e6) * prices["input"]
        output_cost = (output_tokens / 1e6) * prices["output"]
        total_cost = input_cost + output_cost
        
        # Update totals
        self.totals["input_tokens"] += input_tokens
        self.totals["output_tokens"] += output_tokens
        self.totals["cost_usd"] += total_cost
        
        # Track per-model
        if model not in self.by_model:
            self.by_model[model] = {
                "input_tokens": 0,
                "output_tokens": 0,
                "cost_usd": 0.0,
                "call_count": 0,
            }
        self.by_model[model]["input_tokens"] += input_tokens
        self.by_model[model]["output_tokens"] += output_tokens
        self.by_model[model]["cost_usd"] += total_cost
        self.by_model[model]["call_count"] += 1
        
        # Log event
        event = {
            "timestamp": datetime.utcnow().isoformat(),
            "model": model,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost_usd": round(total_cost, 6),
            "tool_name": tool_name,
            "step": step,
        }
        self.events.append(event)
        
        return {
            "input_cost": round(input_cost, 6),
            "output_cost": round(output_cost, 6),
            "total_cost": round(total_cost, 6),
        }
    
    def get_summary(self) -> Dict:
        """Get total usage and cost summary"""
        total_tokens = self.totals["input_tokens"] + self.totals["output_tokens"]
        
        return {
            "session_id": self.session_id,
            "total_input_tokens": self.totals["input_tokens"],
            "total_output_tokens": self.totals["output_tokens"],
            "total_tokens": total_tokens,
            "total_cost_usd": round(self.totals["cost_usd"], 4),
            "cost_per_token_usd": round(self.totals["cost_usd"] / total_tokens, 8) if total_tokens > 0 else 0,
            "by_model": self.by_model,
            "event_count": len(self.events),
        }
    
    def cost_per_model(self) -> Dict[str, float]:
        """Which models cost the most?"""
        return {
            model: round(data["cost_usd"], 4)
            for model, data in self.by_model.items()
        }
    
    def get_events_json(self) -> str:
        """Export all events as JSON for logging"""
        return json.dumps(self.events, indent=2)

# Usage
counter = TokenCounter(session_id="sess-abc123")

# After each API call
counter.record(
    model="claude-3-5-sonnet",
    input_tokens=1500,
    output_tokens=500,
    tool_name="web_search",
    step=1,
)

# Later, check costs
summary = counter.get_summary()
print(f"Total cost: ${summary['total_cost_usd']}")
print(f"By model: {counter.cost_per_model()}")

Part 4: Budget Enforcement

Hard Limits vs Soft Limits

class BudgetEnforcer:
    """Enforce spending limits with alerts"""
    
    def __init__(self, daily_budget_usd: float, monthly_budget_usd: float):
        self.daily_budget = daily_budget_usd
        self.monthly_budget = monthly_budget_usd
        self.daily_spent = 0.0
        self.monthly_spent = 0.0
    
    def check_budget(self, cost: float) -> bool:
        """Can we spend this amount? Hard limit enforcement"""
        if self.daily_spent + cost > self.daily_budget:
            return False  # Refuse the request
        if self.monthly_spent + cost > self.monthly_budget:
            return False  # Refuse the request
        return True
    
    def get_alert_level(self, cost: float) -> str:
        """Soft alerts before hitting hard limit"""
        new_daily = self.daily_spent + cost
        new_monthly = self.monthly_spent + cost
        
        # Check daily budget
        daily_pct = (new_daily / self.daily_budget) * 100
        if daily_pct >= 95:
            return "CRITICAL"  # 95% of daily budget
        elif daily_pct >= 75:
            return "WARNING"    # 75% of daily budget
        
        # Check monthly budget
        monthly_pct = (new_monthly / self.monthly_budget) * 100
        if monthly_pct >= 95:
            return "CRITICAL"
        elif monthly_pct >= 75:
            return "WARNING"
        
        return "OK"
    
    def record_spend(self, cost: float) -> None:
        """Record a spend event"""
        self.daily_spent += cost
        self.monthly_spent += cost

# Usage
enforcer = BudgetEnforcer(daily_budget_usd=10.0, monthly_budget_usd=300.0)

# Before making a request
estimated_cost = 0.05
alert = enforcer.get_alert_level(estimated_cost)

if alert == "CRITICAL":
    print("WARNING: 95% of budget spent!")
    # Could: log, alert ops team, pause non-critical agents

if not enforcer.check_budget(estimated_cost):
    print("ERROR: Budget exceeded! Request refused.")
    raise BudgetExceeded()

# After successful request
enforcer.record_spend(actual_cost)

Budget Alerts & Escalation

import logging
from enum import Enum

class AlertLevel(Enum):
    OK = "ok"
    WARNING = "warning"
    CRITICAL = "critical"
    EXCEEDED = "exceeded"

class BudgetAlerter:
    """Alert on budget issues"""
    
    def __init__(self, logger=None):
        self.logger = logger or logging.getLogger(__name__)
        self.last_alert_level = AlertLevel.OK
    
    def check_and_alert(
        self,
        spent: float,
        budget: float,
        budget_name: str = "budget"
    ) -> AlertLevel:
        """Check budget and emit alerts if needed"""
        pct = (spent / budget) * 100
        
        if spent >= budget:
            level = AlertLevel.EXCEEDED
            msg = f"EXCEEDED {budget_name}: ${spent:.2f} / ${budget:.2f} (100%+)"
        elif pct >= 95:
            level = AlertLevel.CRITICAL
            msg = f"CRITICAL {budget_name}: ${spent:.2f} / ${budget:.2f} ({pct:.1f}%)"
        elif pct >= 75:
            level = AlertLevel.WARNING
            msg = f"WARNING {budget_name}: ${spent:.2f} / ${budget:.2f} ({pct:.1f}%)"
        else:
            level = AlertLevel.OK
            msg = f"OK {budget_name}: ${spent:.2f} / ${budget:.2f} ({pct:.1f}%)"
        
        # Only alert if level changed
        if level != self.last_alert_level:
            self.logger.warning(msg)
            self.last_alert_level = level
        
        return level

# Usage
alerter = BudgetAlerter()
alerter.check_and_alert(spent=7.5, budget=10.0, budget_name="daily")
# Logs: WARNING daily: $7.50 / $10.00 (75.0%)

Part 5: Cost Tracking Per Agent & Task

Attribution: Which Agent/Task Cost How Much?

from typing import Dict, List
from dataclasses import dataclass
from datetime import datetime

@dataclass
class CostRecord:
    """Single cost record"""
    timestamp: datetime
    agent_id: str
    task_id: str
    model: str
    input_tokens: int
    output_tokens: int
    cost_usd: float

class CostAttributor:
    """Track costs by agent, task, and model"""
    
    def __init__(self):
        self.records: List[CostRecord] = []
    
    def record(
        self,
        agent_id: str,
        task_id: str,
        model: str,
        input_tokens: int,
        output_tokens: int,
        cost_usd: float
    ) -> None:
        """Record a cost event"""
        record = CostRecord(
            timestamp=datetime.utcnow(),
            agent_id=agent_id,
            task_id=task_id,
            model=model,
            input_tokens=input_tokens,
            output_tokens=output_tokens,
            cost_usd=cost_usd,
        )
        self.records.append(record)
    
    def cost_by_agent(self) -> Dict[str, float]:
        """Total cost per agent"""
        costs = {}
        for record in self.records:
            costs[record.agent_id] = costs.get(record.agent_id, 0) + record.cost_usd
        return {k: round(v, 4) for k, v in sorted(costs.items(), key=lambda x: x[1], reverse=True)}
    
    def cost_by_task(self) -> Dict[str, float]:
        """Total cost per task"""
        costs = {}
        for record in self.records:
            costs[record.task_id] = costs.get(record.task_id, 0) + record.cost_usd
        return {k: round(v, 4) for k, v in sorted(costs.items(), key=lambda x: x[1], reverse=True)}
    
    def cost_by_model(self) -> Dict[str, float]:
        """Total cost per model"""
        costs = {}
        for record in self.records:
            costs[record.model] = costs.get(record.model, 0) + record.cost_usd
        return {k: round(v, 4) for k, v in sorted(costs.items(), key=lambda x: x[1], reverse=True)}
    
    def cost_trend_by_day(self) -> Dict[str, float]:
        """Cost per day (for anomaly detection)"""
        by_day = {}
        for record in self.records:
            day = record.timestamp.date().isoformat()
            by_day[day] = by_day.get(day, 0) + record.cost_usd
        return {k: round(v, 4) for k, v in sorted(by_day.items())}
    
    def anomaly_detection(self, threshold_std_devs: float = 2.0) -> List[Dict]:
        """Detect unusually high-cost days"""
        daily_costs = self.cost_trend_by_day()
        values = list(daily_costs.values())
        
        if len(values) < 3:
            return []  # Need at least 3 data points
        
        mean = sum(values) / len(values)
        variance = sum((x - mean) ** 2 for x in values) / len(values)
        std_dev = variance ** 0.5
        
        threshold = mean + (std_dev * threshold_std_devs)
        
        anomalies = []
        for day, cost in daily_costs.items():
            if cost > threshold:
                anomalies.append({
                    "day": day,
                    "cost": round(cost, 4),
                    "threshold": round(threshold, 4),
                    "deviation": round((cost - mean) / std_dev, 2),
                })
        
        return anomalies

# Usage
attributor = CostAttributor()

# Record costs as agents run
attributor.record(
    agent_id="scraper-agent-1",
    task_id="task-collect-articles",
    model="claude-3-5-sonnet",
    input_tokens=1500,
    output_tokens=500,
    cost_usd=0.012,
)

# Later, analyze
print("Cost by agent:", attributor.cost_by_agent())
print("Cost by task:", attributor.cost_by_task())
print("Cost by model:", attributor.cost_by_model())
print("Daily trend:", attributor.cost_trend_by_day())
print("Anomalies:", attributor.anomaly_detection())

Part 6: Rate Limiting for Cost Control

Rate limiting prevents cost explosion from runaway agents or DoS attacks.

Request Rate Limiting

import time
from collections import deque

class RequestRateLimiter:
    """Limit requests per minute"""
    
    def __init__(self, max_requests_per_minute: int):
        self.max_per_minute = max_requests_per_minute
        self.requests = deque()  # Timestamps of recent requests
    
    def allow_request(self) -> bool:
        """Check if request is allowed"""
        now = time.time()
        
        # Remove requests older than 60 seconds
        while self.requests and (now - self.requests[0]) > 60:
            self.requests.popleft()
        
        # Check if under limit
        if len(self.requests) < self.max_per_minute:
            self.requests.append(now)
            return True
        return False
    
    def wait_until_available(self) -> float:
        """Wait until next slot is available, return wait time"""
        if self.allow_request():
            return 0.0
        
        # Oldest request was this many seconds ago
        oldest = self.requests[0]
        wait_time = (oldest + 60) - time.time()
        
        if wait_time > 0:
            time.sleep(wait_time)
        
        self.requests.append(time.time())
        return wait_time

# Usage
limiter = RequestRateLimiter(max_requests_per_minute=60)

for i in range(100):
    wait = 0
    if not limiter.allow_request():
        wait = limiter.wait_until_available()
    print(f"Request {i+1} (waited {wait:.2f}s)")

Token Rate Limiting

class TokenRateLimiter:
    """Limit tokens per hour (for long-running agents)"""
    
    def __init__(self, max_tokens_per_hour: int):
        self.max_per_hour = max_tokens_per_hour
        self.hour_start = time.time()
        self.tokens_used_this_hour = 0
    
    def allow_tokens(self, token_count: int) -> bool:
        """Check if we can use this many tokens"""
        now = time.time()
        
        # Reset if hour has passed
        if (now - self.hour_start) > 3600:
            self.hour_start = now
            self.tokens_used_this_hour = 0
        
        if self.tokens_used_this_hour + token_count <= self.max_per_hour:
            self.tokens_used_this_hour += token_count
            return True
        return False
    
    def get_capacity(self) -> Dict[str, int]:
        """Get token budget status"""
        return {
            "used_this_hour": self.tokens_used_this_hour,
            "remaining": max(0, self.max_per_hour - self.tokens_used_this_hour),
            "limit": self.max_per_hour,
        }

# Usage
token_limiter = TokenRateLimiter(max_tokens_per_hour=100000)

if token_limiter.allow_tokens(5000):
    print("Allowed")
else:
    print("Over token limit:", token_limiter.get_capacity())

Cost-Based Rate Limiting

class CostRateLimiter:
    """Limit spending per hour"""
    
    def __init__(self, max_cost_per_hour_usd: float):
        self.max_cost_per_hour = max_cost_per_hour_usd
        self.hour_start = time.time()
        self.cost_this_hour = 0.0
    
    def allow_cost(self, estimated_cost: float) -> bool:
        """Check if request fits in hourly budget"""
        now = time.time()
        
        # Reset if hour has passed
        if (now - self.hour_start) > 3600:
            self.hour_start = now
            self.cost_this_hour = 0.0
        
        if self.cost_this_hour + estimated_cost <= self.max_cost_per_hour:
            return True
        return False
    
    def record_cost(self, cost: float) -> None:
        """Record actual spend"""
        now = time.time()
        if (now - self.hour_start) > 3600:
            self.hour_start = now
            self.cost_this_hour = 0.0
        self.cost_this_hour += cost

# Usage
cost_limiter = CostRateLimiter(max_cost_per_hour_usd=5.0)

estimated = 0.05
if cost_limiter.allow_cost(estimated):
    result = run_agent()
    cost_limiter.record_cost(actual_cost)
else:
    print("Over hourly budget limit")

Part 7: Model Selection & Cost Optimization

The key insight: Use the cheapest model that solves your problem.

Task-Based Model Selection

class ModelRouter:
    """Choose model based on task complexity"""
    
    # Tasks and their recommended models
    ROUTING = {
        "classification": {
            "model": "gemini-2-flash",  # Ultra-cheap
            "reason": "Simple binary/multi-class decision",
            "cost_estimate": 0.0001,
        },
        "extraction": {
            "model": "gemini-2-flash",
            "reason": "Extract fields from structured text",
            "cost_estimate": 0.0002,
        },
        "summarization": {
            "model": "claude-3-5-sonnet",
            "reason": "Needs semantic understanding",
            "cost_estimate": 0.010,
        },
        "code_review": {
            "model": "claude-3-5-sonnet",
            "reason": "Needs reasoning and explanation",
            "cost_estimate": 0.015,
        },
        "complex_reasoning": {
            "model": "claude-3-opus",
            "reason": "Multi-step reasoning, edge cases",
            "cost_estimate": 0.050,
        },
        "verification": {
            "model": "claude-3-opus",
            "reason": "Safety-critical, needs deep reasoning",
            "cost_estimate": 0.050,
        },
    }
    
    def select_model(self, task_type: str) -> Dict[str, str]:
        """Select best model for task"""
        if task_type not in self.ROUTING:
            # Default to mid-tier
            return {"model": "claude-3-5-sonnet", "reason": "default"}
        
        routing = self.ROUTING[task_type]
        return {
            "model": routing["model"],
            "reason": routing["reason"],
            "estimated_cost": routing["cost_estimate"],
        }

# Usage
router = ModelRouter()

# For each task, pick the right model
selection = router.select_model("classification")
print(f"Use {selection['model']}: {selection['reason']}")
# Output: Use gemini-2-flash: Simple binary/multi-class decision

Hybrid Cloud/Local Routing

The ultimate cost optimization: Use cloud for complex tasks, local for simple ones.

class HybridRouter:
    """Route tasks to cloud or local models based on complexity"""
    
    def __init__(self):
        self.local_throughput = 50  # tokens/sec (4090)
        self.local_cost_per_hour = 1.65  # USD
        self.cloud_base_cost = 0.003  # Claude base price (input)
    
    def estimate_cloud_cost(self, tokens: int) -> float:
        """Estimate cost for cloud model"""
        return (tokens / 1e6) * self.cloud_base_cost
    
    def estimate_local_cost(self, tokens: int) -> float:
        """Estimate cost for local model"""
        seconds = tokens / self.local_throughput
        hours = seconds / 3600
        return hours * self.local_cost_per_hour
    
    def should_use_cloud(self, task_tokens: int) -> bool:
        """Cloud vs local decision"""
        cloud_cost = self.estimate_cloud_cost(task_tokens)
        local_cost = self.estimate_local_cost(task_tokens)
        
        # Use cloud if significantly cheaper (or task needs reasoning)
        return cloud_cost < local_cost * 0.8
    
    def route(self, task_type: str, estimated_tokens: int) -> Dict[str, str]:
        """Route task to cloud or local"""
        if task_type in ["classification", "extraction"]:
            # Always local for simple tasks
            return {"target": "local", "model": "llama-3.1-8b", "reason": "simple task"}
        
        if self.should_use_cloud(estimated_tokens):
            return {"target": "cloud", "model": "claude-3-5-sonnet", "reason": "cheaper or complex"}
        else:
            return {"target": "local", "model": "llama-3.1-70b", "reason": "local cheaper"}

# Usage
router = HybridRouter()

# For a 2000-token task
routing = router.route("code_review", 2000)
print(f"Route to {routing['target']}: {routing['model']}")
# Output: Route to cloud: claude-3-5-sonnet (cloud is cheaper for small tasks)

# For a 100K-token task
routing = router.route("data_analysis", 100000)
print(f"Route to {routing['target']}: {routing['model']}")
# Output: Route to local: llama-3.1-70b (local is cheaper at scale)

Cost Savings Calculation

Real-world scenario: Web scraper with 1000 tasks/day

BEFORE (always Claude 3.5 Sonnet):
  Per task: 2000 input + 500 output tokens = $0.009
  Daily: 1000 × $0.009 = $9.00
  Monthly: $270.00

AFTER (hybrid routing):
  80% simple tasks (classification) → local model
  20% complex tasks → Claude 3.5 Sonnet

  Simple (800 tasks): 800 × $0.0003 (local) = $0.24/day
  Complex (200 tasks): 200 × $0.009 (cloud) = $1.80/day
  Total daily: $2.04/day
  Monthly: $61.20

  Savings: $270 → $61 = 77% reduction!
  Payback period for GPU hardware: 1–2 weeks

Part 8: Local vs Cloud Economics

Break-Even Analysis

class LocalVsCloudAnalysis:
    """Determine when local models become cost-effective"""
    
    def __init__(self):
        # Equipment costs (one-time)
        self.gpu_cost = 1500  # RTX 4090
        self.setup_cost = 500  # Storage, power supply, cooling
        self.total_hardware = self.gpu_cost + self.setup_cost
        
        # Ongoing costs (per hour)
        self.local_cost_per_hour = 1.65  # Depreciation + electricity
        self.cloud_cost_per_token = 3.0 / 1e6  # Claude input
    
    def break_even_tokens_per_day(self) -> int:
        """Calculate daily token usage for break-even"""
        # At break-even: cost_cloud = cost_local
        # Assumption: amortize GPU over 1 year (8760 hours)
        annual_cost = self.total_hardware + (self.local_cost_per_hour * 8760)
        daily_hardware_cost = annual_cost / 365
        
        # Each token costs this much on cloud
        tokens_break_even = int(daily_hardware_cost / self.cloud_cost_per_token)
        return tokens_break_even
    
    def payback_period_days(self, tokens_per_day: int) -> float:
        """How many days until GPU investment pays for itself?"""
        daily_savings = (tokens_per_day * self.cloud_cost_per_token) - (self.local_cost_per_hour * 24)
        
        if daily_savings <= 0:
            return float('inf')  # Never pays back
        
        return self.total_hardware / daily_savings
    
    def cost_projection(self, tokens_per_day: int, days: int) -> Dict[str, float]:
        """Project costs over time"""
        daily_cloud = tokens_per_day * self.cloud_cost_per_token
        daily_local = self.local_cost_per_hour * 24
        
        cloud_total = daily_cloud * days
        local_total = self.total_hardware + (daily_local * days)
        
        return {
            "cloud_total": round(cloud_total, 2),
            "local_total": round(local_total, 2),
            "savings": round(cloud_total - local_total, 2),
        }

# Usage
analysis = LocalVsCloudAnalysis()

# What's the daily token usage for break-even?
tokens = analysis.break_even_tokens_per_day()
print(f"Break-even daily tokens: {tokens:,}")
# Output: Break-even daily tokens: 327,273

# If we do 500K tokens/day, payback period?
payback = analysis.payback_period_days(500000)
print(f"Payback period: {payback:.1f} days")
# Output: Payback period: 10.5 days

# Project 30 days of costs
projection = analysis.cost_projection(tokens_per_day=500000, days=30)
print(f"Cloud 30 days: ${projection['cloud_total']}")
print(f"Local 30 days: ${projection['local_total']}")
print(f"Savings: ${projection['savings']}")
# Output:
# Cloud 30 days: $45.00
# Local 30 days: $2630.00 (including GPU)
# Savings: -$2585.00 (local is more expensive initially)

# But at 1M tokens/day?
projection = analysis.cost_projection(tokens_per_day=1000000, days=30)
print(f"Savings at 1M tokens/day: ${projection['savings']}")
# Output: Savings at 1M tokens/day: $1215.00

Part 9: Cost Reduction Techniques

1. Prompt Caching (Amortize Expensive Prompts)

For APIs that support caching (Claude, some OpenAI models):

class CachedPromptRouter:
    """Reuse expensive prompts with prompt caching"""
    
    def __init__(self, client):
        self.client = client
        # System prompts that we'll cache
        self.cached_prompts = {
            "code_reviewer": {
                "prompt": """You are a world-class code reviewer...[1000 tokens]...""",
                "cache_tokens": 0,
            },
            "data_analyst": {
                "prompt": """You are a data analysis expert...[1200 tokens]...""",
                "cache_tokens": 0,
            },
        }
    
    def call_with_cache(
        self,
        prompt_name: str,
        user_request: str
    ) -> Dict:
        """Call model with cached system prompt"""
        
        if prompt_name not in self.cached_prompts:
            raise ValueError(f"Unknown prompt: {prompt_name}")
        
        cached = self.cached_prompts[prompt_name]
        system_prompt = cached["prompt"]
        
        # First call: Creates cache, pays full price
        # Subsequent calls: Reuses cache, 90% discount
        response = self.client.messages.create(
            model="claude-3-5-sonnet",
            max_tokens=500,
            system=[
                {
                    "type": "text",
                    "text": system_prompt,
                    "cache_control": {"type": "ephemeral"}  # Enable caching
                }
            ],
            messages=[
                {"role": "user", "content": user_request}
            ]
        )
        
        # Track cache hits
        usage = response.usage
        return {
            "result": response.content[0].text,
            "input_tokens": usage.input_tokens,
            "cache_creation_tokens": getattr(usage, 'cache_creation_input_tokens', 0),
            "cache_read_tokens": getattr(usage, 'cache_read_input_tokens', 0),
            "effective_cost_saved": getattr(usage, 'cache_read_input_tokens', 0) * 0.9 * (3.0 / 1e6),
        }

# Usage: First call (cache creation)
# response1 = router.call_with_cache("code_reviewer", "Review this code...")
# Cost: Full 1000 tokens (system) + user tokens

# Second call (cache reuse)
# response2 = router.call_with_cache("code_reviewer", "Review this other code...")
# Cost: 1000 × 0.1 = 100 tokens (90% discount on cached system prompt)

2. Context Window Optimization

Use smaller context when possible:

class ContextWindowOptimizer:
    """Choose context window based on actual needs"""
    
    MODELS = {
        "claude-3-5-sonnet-8k": {
            "context": 8192,
            "cost_per_1m": 3.0,  # Cheaper
            "best_for": "Short conversations, classification",
        },
        "claude-3-5-sonnet-200k": {
            "context": 200000,
            "cost_per_1m": 3.0,  # Same price!
            "best_for": "Long documents, RAG",
        },
    }
    
    def choose_model(self, required_context_tokens: int) -> str:
        """Pick smallest context window that fits"""
        for model_name, config in self.MODELS.items():
            if required_context_tokens <= config["context"]:
                return model_name
        
        # Fallback to largest
        return "claude-3-5-sonnet-200k"

# Usage
optimizer = ContextWindowOptimizer()
model = optimizer.choose_model(required_context_tokens=5000)
print(f"Use {model}")
# Output: Use claude-3-5-sonnet-8k (sufficient for 5K tokens)

3. Batch Processing

Group multiple requests to amortize overhead:

class BatchProcessor:
    """Batch multiple tasks into single API call"""
    
    def __init__(self, client, batch_size: int = 10):
        self.client = client
        self.batch_size = batch_size
        self.batch = []
    
    def add_task(self, task_id: str, content: str) -> None:
        """Add task to batch"""
        self.batch.append({"id": task_id, "content": content})
    
    def process_batch(self) -> Dict[str, str]:
        """Process entire batch in one API call"""
        if not self.batch:
            return {}
        
        # Combine all tasks into single prompt
        batch_prompt = "\n\n".join([
            f"Task {i+1} (ID: {task['id']}):\n{task['content']}"
            for i, task in enumerate(self.batch)
        ])
        
        response = self.client.messages.create(
            model="claude-3-5-sonnet",
            max_tokens=2000,
            messages=[
                {
                    "role": "user",
                    "content": f"Process these {len(self.batch)} tasks:\n{batch_prompt}"
                }
            ]
        )
        
        # Parse response and extract results per task
        results = self._parse_batch_response(response.content[0].text)
        self.batch = []  # Clear batch
        
        return results
    
    def _parse_batch_response(self, response_text: str) -> Dict[str, str]:
        """Extract per-task results from batch response"""
        # This would parse the structured response
        # For brevity, simplified here
        return {"task_1": "result_1"}

# Usage
processor = BatchProcessor(client, batch_size=10)

# Add 100 tasks
for i in range(100):
    processor.add_task(f"task_{i}", f"Classify this text...")
    
    if (i + 1) % 10 == 0:
        results = processor.process_batch()
        print(f"Processed 10 tasks, cost: ~0.05$")
# Total cost: ~0.50$ for 100 tasks
# vs. 100 individual calls: ~5.00$ (10× savings)

4. Temperature Tuning for Cost

Higher temperature = more diverse responses = more retries needed:

class TemperatureTuner:
    """Optimize temperature for cost vs quality"""
    
    @staticmethod
    def cost_vs_quality_recommendation(
        task_type: str
    ) -> Dict[str, float]:
        """Recommended temperature by task"""
        
        recommendations = {
            "classification": {"temperature": 0.0, "reason": "Deterministic needed"},
            "extraction": {"temperature": 0.1, "reason": "High accuracy needed"},
            "summarization": {"temperature": 0.5, "reason": "Natural variation OK"},
            "brainstorming": {"temperature": 1.0, "reason": "Diversity important"},
            "creative": {"temperature": 1.2, "reason": "Maximum creativity"},
        }
        
        return recommendations.get(task_type, {"temperature": 0.7, "reason": "default"})
    
    def estimate_retry_rate(self, temperature: float) -> float:
        """Higher temp = more likely to need retry"""
        # Empirical relationship: retry rate increases with temperature
        if temperature < 0.3:
            return 0.02  # 2% retry rate
        elif temperature < 0.7:
            return 0.05  # 5% retry rate
        elif temperature < 1.0:
            return 0.10  # 10% retry rate
        else:
            return 0.20  # 20% retry rate
    
    def cost_with_retries(
        self,
        base_cost: float,
        temperature: float
    ) -> float:
        """True cost including expected retries"""
        retry_rate = self.estimate_retry_rate(temperature)
        expected_calls = 1.0 + retry_rate
        return base_cost * expected_calls

# Usage
tuner = TemperatureTuner()

# For classification, what temperature?
rec = tuner.cost_vs_quality_recommendation("classification")
print(f"Temperature: {rec['temperature']} ({rec['reason']})")

# What's the true cost with retries?
true_cost = tuner.cost_with_retries(base_cost=0.05, temperature=0.0)
print(f"True cost (with retries): ${true_cost:.4f}")
# Output: True cost (with retries): $0.0510 (very stable)

true_cost = tuner.cost_with_retries(base_cost=0.05, temperature=1.2)
print(f"True cost at high temp: ${true_cost:.4f}")
# Output: True cost at high temp: $0.0600 (20% overhead from retries)

Part 10: Cost Dashboards & Reporting

Dashboard Template 1: Daily Cost Trend

{
  "title": "Daily Cost Trend",
  "metric": "cost_trend_usd_by_day",
  "display": "line_chart",
  "data": {
    "2026-04-18": 12.45,
    "2026-04-19": 14.23,
    "2026-04-20": 13.87,
    "2026-04-21": 18.92,
    "2026-04-22": 11.34
  },
  "alerts": {
    "yellow_threshold": 20.0,
    "red_threshold": 30.0,
    "message": "Daily cost spiked on 4/21, investigate why"
  }
}

Dashboard Template 2: Cost Per Agent

{
  "title": "Cost by Agent",
  "metric": "cost_by_agent_usd",
  "display": "bar_chart",
  "data": {
    "web-scraper-agent": 245.67,
    "data-processor-agent": 89.23,
    "code-reviewer-agent": 156.44,
    "qa-agent": 67.89,
    "archive-agent": 12.34
  },
  "insights": {
    "highest_cost": "web-scraper-agent ($245.67)",
    "action": "Review scraper efficiency, consider batching"
  }
}

Dashboard Template 3: Cost by Model

{
  "title": "Spend by Model",
  "metric": "cost_by_model_usd",
  "display": "pie_chart",
  "data": {
    "claude-3-5-sonnet": 412.56,
    "gpt-4o": 89.23,
    "local-llama": 34.12,
    "gemini-2-flash": 12.09
  },
  "total": 548.00,
  "insights": {
    "percentage_sonnet": 75.3,
    "recommendation": "Consider routing simple tasks to Gemini 2.0 Flash"
  }
}

Dashboard Template 4: Cost vs Budget

{
  "title": "Budget Status",
  "metrics": {
    "monthly_budget": 1000.0,
    "spent_so_far": 452.34,
    "remaining": 547.66,
    "percentage_used": 45.2,
    "days_into_month": 15,
    "daily_average": 30.16,
    "projected_end_of_month": 905.8
  },
  "status": "on_track",
  "alert_level": "green"
}

Cost Dashboard Implementation

class CostDashboard:
    """Generate cost dashboards"""
    
    def __init__(self, attributor: CostAttributor):
        self.attributor = attributor
    
    def daily_trend(self) -> Dict:
        """Daily cost trend for chart"""
        daily = self.attributor.cost_trend_by_day()
        return {
            "title": "Daily Cost Trend",
            "data": daily,
            "summary": {
                "average": round(sum(daily.values()) / len(daily), 2),
                "min": round(min(daily.values()), 2),
                "max": round(max(daily.values()), 2),
            }
        }
    
    def cost_by_agent(self) -> Dict:
        """Cost breakdown by agent"""
        by_agent = self.attributor.cost_by_agent()
        total = sum(by_agent.values())
        
        return {
            "title": "Cost by Agent",
            "data": by_agent,
            "total": round(total, 2),
            "percentages": {
                agent: round((cost / total) * 100, 1)
                for agent, cost in by_agent.items()
            }
        }
    
    def cost_by_model(self) -> Dict:
        """Cost breakdown by model"""
        by_model = self.attributor.cost_by_model()
        total = sum(by_model.values())
        
        return {
            "title": "Cost by Model",
            "data": by_model,
            "total": round(total, 2),
            "percentages": {
                model: round((cost / total) * 100, 1)
                for model, cost in by_model.items()
            }
        }
    
    def anomalies(self) -> Dict:
        """Anomaly report"""
        anomalies = self.attributor.anomaly_detection()
        
        return {
            "title": "Cost Anomalies",
            "anomalies": anomalies,
            "count": len(anomalies),
            "recommendation": "Investigate high-cost days for efficiency improvements"
        }
    
    def full_report(self) -> Dict:
        """Complete cost dashboard"""
        return {
            "generated_at": datetime.utcnow().isoformat(),
            "daily_trend": self.daily_trend(),
            "by_agent": self.cost_by_agent(),
            "by_model": self.cost_by_model(),
            "anomalies": self.anomalies(),
        }

Part 11: ROI Analysis

Cost Per Successful Task

class ROIAnalyzer:
    """Calculate return on investment"""
    
    def __init__(self):
        self.tasks = []  # List of {cost, success, revenue}
    
    def record_task(
        self,
        cost_usd: float,
        success: bool,
        revenue_usd: float = 0.0
    ) -> None:
        """Record task completion"""
        self.tasks.append({
            "cost": cost_usd,
            "success": success,
            "revenue": revenue_usd,
        })
    
    def cost_per_successful_task(self) -> float:
        """Average cost per successful completion"""
        successful = [t for t in self.tasks if t["success"]]
        if not successful:
            return float('inf')
        
        total_cost = sum(t["cost"] for t in successful)
        return total_cost / len(successful)
    
    def success_rate(self) -> float:
        """Percentage of tasks that succeeded"""
        if not self.tasks:
            return 0.0
        successful = sum(1 for t in self.tasks if t["success"])
        return (successful / len(self.tasks)) * 100
    
    def revenue_per_task(self) -> float:
        """Average revenue per task (if applicable)"""
        if not self.tasks:
            return 0.0
        total_revenue = sum(t["revenue"] for t in self.tasks)
        return total_revenue / len(self.tasks)
    
    def net_profit_per_task(self) -> float:
        """Revenue - Cost per task"""
        revenue = self.revenue_per_task()
        cost = sum(t["cost"] for t in self.tasks) / len(self.tasks)
        return revenue - cost
    
    def payback_period_for_gpu(self, gpu_cost: int = 1500) -> float:
        """How many tasks to break even on GPU investment?"""
        profit_per_task = self.net_profit_per_task()
        if profit_per_task <= 0:
            return float('inf')
        return gpu_cost / profit_per_task
    
    def roi_percentage(self) -> float:
        """Return on investment percentage"""
        total_cost = sum(t["cost"] for t in self.tasks)
        total_revenue = sum(t["revenue"] for t in self.tasks)
        
        if total_cost == 0:
            return 0.0
        
        profit = total_revenue - total_cost
        return (profit / total_cost) * 100

# Usage
analyzer = ROIAnalyzer()

# Record task outcomes
for i in range(100):
    cost = 0.05 if i % 10 == 0 else 0.02  # Some expensive, some cheap
    success = True if i % 5 != 0 else False  # 80% success rate
    revenue = 1.0 if success else 0.0  # $1 per successful task
    
    analyzer.record_task(cost, success, revenue)

# Get metrics
print(f"Success rate: {analyzer.success_rate():.1f}%")
print(f"Cost per successful task: ${analyzer.cost_per_successful_task():.4f}")
print(f"Revenue per task: ${analyzer.revenue_per_task():.2f}")
print(f"Net profit per task: ${analyzer.net_profit_per_task():.2f}")
print(f"ROI: {analyzer.roi_percentage():.1f}%")
print(f"Tasks to break even on $1500 GPU: {analyzer.payback_period_for_gpu():.0f}")

Part 12: Implementation Checklist

Phase 1: Token Counting (Week 1)

  • Implement TokenCounter class
  • Add pricing for your models
  • Integrate token counting into agent loop
  • Log all costs to structured JSON
  • Validate token estimates vs actual usage

Phase 2: Budget Enforcement (Week 2)

  • Implement BudgetEnforcer with hard limits
  • Set daily budget (start conservative)
  • Set monthly budget
  • Add soft alerts (75%, 95%)
  • Test budget rejection logic

Phase 3: Rate Limiting (Week 2)

  • Implement RequestRateLimiter
  • Implement TokenRateLimiter
  • Implement CostRateLimiter
  • Set appropriate limits for your scale
  • Test rate limiting under load

Phase 4: Cost Attribution (Week 3)

  • Implement CostAttributor
  • Track costs by agent
  • Track costs by task
  • Track costs by model
  • Set up daily cost trending

Phase 5: Cost Dashboards (Week 3)

  • Create CostDashboard class
  • Generate daily trend report
  • Generate agent breakdown
  • Generate model breakdown
  • Set up anomaly detection
  • Export to visualization tool (Grafana, Datadog, etc.)

Phase 6: Model Optimization (Week 4)

  • Create ModelRouter for task-based selection
  • Implement HybridRouter for cloud/local
  • Measure actual savings
  • Document cost per task type
  • Train team on router rules

Phase 7: Monitoring & Alerting (Week 4)

  • Set up budget alerts (email/Slack)
  • Set up anomaly alerts
  • Create runbook: “What to do if costs spike”
  • Set up automated cost reports
  • Configure escalation policies

Pre-Deployment Checklist

  • Cost counter tested and validated
  • Budget limits in place and tested
  • Rate limiters active
  • Dashboard shows real data
  • Anomaly detection working
  • Team understands cost model
  • Cost thresholds documented
  • Emergency shutdown procedure ready

Part 13: Real-World Scenarios

Scenario 1: Runaway Agent (From $100/day to $10/day)

Problem: Web scraper agent costs $100/day after launch.

Root causes:

  • Always using Claude 3 Opus (expensive verification model)
  • Fetching full webpage content (unnecessary tokens)
  • No batching of requests

Solutions:

  1. Switch to Claude 3.5 Sonnet for scraping (-50%)
  2. Use context trimming (keep only relevant paragraphs) (-40%)
  3. Batch 10 URLs per request (-20%)
  4. Local routing for classification (-90% for simple tasks)

Result:

  • Original: 100 × $1.00 = $100/day
  • After optimization:
    • 80 simple tasks (local): 80 × $0.003 = $0.24
    • 20 complex tasks: 20 × $0.10 = $2.00
    • Total: $2.24/day
  • Savings: $97.76/day (98% reduction!)

Scenario 2: When Local GPU Becomes Cost-Effective

Problem: Do we need a GPU for our agent?

Setup:

  • RTX 4090: $1500 hardware + $1.65/hour operating cost
  • Claude 3.5 Sonnet: $3/1M input tokens

Analysis:

GPU amortized over 1 year:
  Annual cost: $1500 + ($1.65 × 24 × 365) = $1500 + $14,436 = $15,936
  Daily cost: $43.64

Cloud model for same workload:
  1M tokens = 50 tokens/sec × (1M / 50) = 20,000 seconds = 5.56 hours
  5.56 × $0.003 / 1e6 = $16.68 per day for 1M tokens

Break-even: $43.64 / ($16.68/1M tokens) = 2.6M tokens/day

Conclusion:
  <500K tokens/day: Use cloud (cheaper)
  >2.6M tokens/day: GPU pays for itself
  500K–2.6M tokens/day: Hybrid (simple tasks local, complex cloud)

Scenario 3: Detecting Cost Spike Early

Problem: Cost spike from $20 to $80/day. Find root cause.

Detection strategy:

# Check 1: Which agent(s) cost more?
by_agent = cost_dashboard.cost_by_agent()
print(by_agent)  # spike in "data-processor-agent"

# Check 2: What changed in that agent?
# (Check git log, deployment notes)
# Found: New feature added, processes 10× more data

# Check 3: Which model is expensive?
by_model = cost_dashboard.cost_by_model()
print(by_model)  # Mostly Claude Opus (expensive)

# Check 4: What's the cost per token?
cost_per_token = total_cost / total_tokens
print(cost_per_token)  # Higher than before

# Solution: Route expensive tasks to Sonnet, cheap to local
# Expected result: Back to $20/day

Part 14: Cross-Reference

This document complements:

  • 01_foundation_models.md: Model selection strategies
  • 08_claw_code_python.md: Cost tracking in Python implementation
  • 09_operations_and_observability.md: Monitoring and alerting
  • 10_security_and_safety.md: Rate limiting for DoS prevention
  • 11_testing_and_qa.md: Cost validation in test suite

Summary: Cost Control Principles

  1. Measure everything: Token counts, costs, by agent/task/model
  2. Enforce budgets: Hard limits prevent overspending, soft alerts provide warning
  3. Route intelligently: Simple tasks → cheap models, complex → expensive
  4. Optimize ruthlessly: Compress prompts, trim context, batch requests
  5. Hybrid approach: Cloud for complex reasoning, local for volume
  6. Monitor continuously: Daily trends, anomalies, per-agent breakdown
  7. Plan for scale: Break-even analysis, payback periods, ROI metrics

The goal: Up to 80-90% cost reduction through smart routing and optimization when the majority of requests can be routed to local models, without sacrificing quality.


Part 15: End-to-End Cost Calculation Example

Scenario: 10,000 Requests/Day with Mistral 7B (Self-Hosted)

You are running a customer support triage agent. It classifies incoming tickets, extracts key fields, and routes them to the right team. The workload is 10,000 requests per day. You are considering self-hosting Mistral 7B on an RTX 4090 with AWQ 4-bit quantization.

Token Estimation

Average request:
  System prompt:       200 tokens  (fixed, cached)
  User input (ticket): 300 tokens  (average customer message)
  Few-shot examples:   400 tokens  (3 examples for classification)
  Total input:         900 tokens per request

  Model output:        150 tokens  (classification + extracted fields)
  Total output:        150 tokens per request

Daily totals:
  Input tokens:   10,000 × 900  =  9,000,000 tokens (9M)
  Output tokens:  10,000 × 150  =  1,500,000 tokens (1.5M)
  Total tokens:   10,500,000 tokens/day (10.5M)

Infrastructure Cost (Self-Hosted)

Hardware: RTX 4090 ($1,500 amortized over 2 years)
  Daily hardware cost:  $1,500 / 730 days = $2.05/day

Electricity: 450W TDP × 24 hours × $0.12/kWh
  Daily electricity:    0.45 × 24 × 0.12 = $1.30/day

Throughput: Mistral 7B AWQ on RTX 4090
  Generation speed:     80 tokens/sec (output)
  Time for 1.5M output: 1,500,000 / 80 = 18,750 seconds = 5.2 hours
  GPU utilization:      5.2 / 24 = 22% (plenty of headroom)

Total infrastructure:   $2.05 + $1.30 = $3.35/day

Model Cost (If Using Cloud Instead)

Mistral Large 2 (API): $2/1M input, $6/1M output
  Input cost:   9M × $2 / 1M  = $18.00/day
  Output cost:  1.5M × $6 / 1M = $9.00/day
  Total cloud:  $27.00/day

Alternative — Gemini 2.0 Flash: $0.075/1M input, $0.30/1M output
  Input cost:   9M × $0.075 / 1M = $0.675/day
  Output cost:  1.5M × $0.30 / 1M = $0.45/day
  Total cloud:  $1.13/day (cheaper than self-hosted!)

Monitoring Overhead

Prometheus + Grafana (self-hosted):
  Small VM for monitoring:        $0.50/day (t3.micro or equivalent)
  Log storage (10GB/month):       $0.10/day
  Total monitoring:               $0.60/day

Monthly Projection Formula

Monthly cost = (daily_model_cost + daily_infra_cost + daily_monitoring_cost) × 30

Self-hosted Mistral 7B:
  ($0.00 model + $3.35 infra + $0.60 monitoring) × 30 = $118.50/month

Cloud Mistral Large 2:
  ($27.00 model + $0.00 infra + $0.60 monitoring) × 30 = $828.00/month

Cloud Gemini 2.0 Flash:
  ($1.13 model + $0.00 infra + $0.60 monitoring) × 30 = $51.90/month

Verdict for This Scenario

OptionDailyMonthlyNotes
Self-hosted Mistral 7B$3.95$118.50Requires GPU hardware, ops overhead
Cloud Mistral Large 2$27.60$828.00Zero ops, but 7x more expensive
Cloud Gemini 2.0 Flash$1.73$51.90Cheapest option, if quality is sufficient

For a classification/extraction task, Gemini 2.0 Flash likely has sufficient quality and is the cheapest option. Self-hosted only wins when you need the model to run on-premises or process sensitive data that cannot leave your network.

When Does Cloud vs Local Break Even?

The break-even depends on daily token volume and which cloud model you are comparing against:

Break-even formula:
  cloud_daily_cost = (input_tokens × cloud_input_price / 1M) + (output_tokens × cloud_output_price / 1M)
  local_daily_cost = GPU_amortized_daily + electricity_daily

  Break-even tokens/day = local_daily_cost / cloud_cost_per_token

Specific numbers (vs Claude 3.5 Sonnet at $3/$15 per 1M):
  Local daily cost: $3.35
  Blended cloud rate: ~$6 per 1M tokens (assuming 6:1 input:output ratio)
  Break-even: $3.35 / ($6 / 1M) = 558,000 tokens/day

Specific numbers (vs Gemini 2.0 Flash at $0.075/$0.30 per 1M):
  Break-even: $3.35 / ($0.13 / 1M) = 25,800,000 tokens/day (25.8M)
  You would need 25.8M tokens/day before self-hosting beats Gemini Flash.

Summary:
  vs Claude 3.5 Sonnet:  Self-host above ~560K tokens/day
  vs Mistral Large 2:    Self-host above ~1.2M tokens/day
  vs GPT-4o:             Self-host above ~480K tokens/day
  vs Gemini 2.0 Flash:   Self-host above ~25.8M tokens/day (almost never worth it)

Quick Reference: Cost Per 1M Tokens (April 2026 Pricing)

ModelInput / 1MOutput / 1MBest For
Gemini 2.0 Flash$0.075$0.30High-volume, simple tasks (classification, extraction)
Mistral Large 2$2.00$6.00General-purpose, European data residency
Claude 3.5 Sonnet$3.00$15.00Agent loops, code generation, complex reasoning
GPT-4o$5.00$15.00Multimodal, fast responses
Claude 3 Opus$15.00$75.00Verification, safety-critical, deep analysis

Cost hierarchy (cheapest to most expensive): Gemini Flash (0.06x) < Mistral Large (0.5x) < Sonnet (1.0x baseline) < GPT-4o (1.4x) < Opus (5.0x)

Use hybrid routing (Doc 13, Part 7) to send simple tasks to cheap models and complex tasks to expensive ones — this typically reduces costs by 70-90%.


Local Inference vs Cloud API: When Does Local Pay Off?

The GPU-focused break-even analysis in Part 8 assumes dedicated NVIDIA hardware. But if you already own an Apple Silicon Mac, the economics are radically different — there is no additional hardware cost, and electricity is negligible.

Real Measured Numbers: M4 MacBook Pro 32GB Running Qwen 2.5 7B

These numbers come from actual benchmarking on consumer hardware, not theoretical estimates:

  • ~1,000 tokens per agent call (system prompt + input + output)
  • ~25ms per token generation (~40 tokens/sec)
  • ~20W power consumption during inference
  • At scale (1,000 calls): ~1M tokens total, ~7 hours wall time, ~0.14 kWh electricity

At UK electricity rates (~£0.28/kWh): 1M tokens costs approximately £0.04 in electricity.

Cost Comparison Table

ApproachCost for 1M tokensTimeNotes
Local M4 (Qwen 2.5 7B)~£0.04 electricity~7 hoursFree after hardware purchase
Claude Sonnet API~$7.80~2 hoursPay per use
Claude Opus API~$39.00~2 hoursPay per use
Claude Code subscriptionIncluded but burns allocation~2 hoursTokens unavailable for coding

The Allocation Insight

The most important number in this table is not a price — it is opportunity cost. If you are using Claude Code for repetitive agent tasks, those tokens are not available for software development. Running the same work locally costs pennies and preserves your full Claude Code allocation for coding.

This matters most for iterative agent workloads (research, data processing, classification) where the same prompt runs hundreds or thousands of times. Each run on Claude Code consumes tokens from a finite subscription allocation. Each run locally costs a fraction of a penny.

When Local Wins

  • You already own Apple Silicon hardware (no capital expenditure)
  • Workload is repetitive (same prompt, many inputs)
  • Quality requirements are met by a 7B model (classification, extraction, structured reasoning)
  • You want to preserve cloud API credits or subscription allocation for higher-value work

When Cloud Wins

  • You need the reasoning quality of a frontier model (Opus, GPT-4o)
  • Latency matters more than cost (cloud parallelises, local is sequential)
  • Workload is low volume (<100 calls/day — the cost is negligible either way)
  • You do not have local hardware with sufficient RAM

See Also

  • Doc 12 (Deployment Patterns) — Cost infrastructure (containers, scaling, resource limits) is configured during deployment
  • Doc 02 (KV Cache Optimization) — KV cache quantization (GQA, INT8/INT4) is a major cost lever (memory savings = higher throughput on same hardware)
  • Doc 01 (Foundation Models) — Model selection (SLM vs LLM) is the first cost decision; hybrid routing saves 80–90%
  • Doc 03 (Hugging Face Ecosystem) — Quantization (AWQ, GPTQ) reduces memory and compute cost; evaluation affects cost/quality trade-off

Changelog & Attribution

  • April 2026: Initial document
    • Token pricing based on April 2026 public rates
    • KV cache optimization techniques from 02_kv_cache_optimization.md
    • Rate limiting strategies from 10_security_and_safety.md
    • Local vs cloud analysis based on 2026 GPU/electricity costs

For implementation help, see Part 12 (Implementation Checklist).