AI Infrastructure Cost Management: Controlling the LLM Bill
Reviewed: June 4, 2026
AI compute costs can spiral from thousands to millions of dollars per month. Without deliberate cost management, your AI infrastructure bill will be the line item that keeps you up at night. Here’s how to stay in control.
Understanding the Cost Stack
The Three Pillars of AI Compute Cost
Training costs: One-time but massive
- GPT-4 scale: $50-100M for a single training run
- Llama 3 405B: ~$10M for pre-training
- Fine-tuning 70B: $500-5,000 per run depending on data size
Inference costs: Recurring and scaling
- ChatGPT at scale: estimated $700K/day in compute
- Per-user cost: $0.01-0.10 per session depending on model and length
- Annual inference budget for a mid-size company: $2-10M
Storage and data pipeline costs: Often forgotten
- Training data storage: $0.02/GB/month (S3) → $20K/month for 1TB corpus
- Vector database: $500-5,000/month for RAG pipelines
- Data transfer: Often the hidden cost at scale
Model Routing: The Highest-ROI Cost Optimization
Not every request needs the most expensive model:
User Query Complexity → Model Selection
─────────────────────────────────────────
Simple FAQ → GPT-3.5 / Llama 8B ($0.002/1K tokens)
Standard task → GPT-4o-mini / Llama 70B ($0.01/1K tokens)
Complex reasoning → GPT-4o / Claude 3.5 ($0.06/1K tokens)
Critical/difficult → GPT-4 / Claude 3 Opus ($0.15/1K tokens)
This tiered approach can reduce costs by 60-80% while maintaining quality for most requests.
Implementing Model Routing
def route_model(query: str, conversation_history: list) -> str:
complexity = classify_complexity(query, conversation_history)
if complexity < 0.3:
return "gpt-3.5-turbo" # $0.002/1K tokens
elif complexity < 0.6:
return "gpt-4o-mini" # $0.015/1K tokens
elif complexity < 0.85:
return "gpt-4o" # $0.06/1K tokens
else:
return "o1" # $0.15/1K tokens
def classify_complexity(query: str, history: list) -> float:
"""Heuristic complexity scoring."""
score = 0.0
if len(query) > 500: score += 0.2
if len(history) > 10: score += 0.2
if requires_code(query): score += 0.3
if requires_math(query): score += 0.3
return min(score, 1.0)
Caching Strategies
Semantic Caching
Cache responses for semantically similar queries:
- **Exact match cache**: Hash the query, return cached response
- **Semantic cache**: Embed the query, use vector similarity to find cached responses
- **Expected hit rate**: 30-50% for support applications
- **Cost savings**: Proportional to hit rate × average response cost
Prompt Caching
Cache the system prompt and shared context:
# Without caching: full context sent every request
$0.02 per call × 10,000 calls = $200
# With prompt caching: cached prefix at 50% discount
$0.01 (cached) + $0.005 (completion) = $0.015 per call
$0.015 × 10,000 = $150 (25% savings)
Both Anthropic and OpenAI support prompt caching natively.
KV-Cache Reuse
For repeated conversations with the same context:
- Cache the key-value tensors for the system prompt
- Only compute new tokens for the actual user query
- 5-10x faster responses for long system prompts
Spot Instances and Preemptible Compute
Training on Spot Instances
- **60-80% cheaper** than on-demand instances
- Risk: instance can be reclaimed with 2-30 seconds notice
- Mitigation: checkpoint every 10-15 minutes
- Best for: hyperparameter sweeps, large-scale fine-tuning, batch inference
Spot Instance Strategy
# Pseudo-code for fault-tolerant training
for epoch in range(num_epochs):
for batch in dataloader:
try:
loss = model(batch)
loss.backward()
optimizer.step()
except InstanceReclaimed:
load_latest_checkpoint()
continue
if steps % checkpoint_interval == 0:
save_checkpoint(f"checkpoint_step_{steps}.pt",
to="s3://bucket/checkpoints/")
Usage Tracking and Budgets
Per-Team Cost Attribution
Tag every API call with team/project metadata:
response = openai.chat.completions.create(
model="gpt-4o",
messages=messages,
metadata={
"team": "engineering",
"project": "code-review-bot",
"environment": "production"
}
)
Budget Alerts
Set up automated alerts:
| Threshold | Action |
|---|---|
| 50% of monthly budget | Slack notification to team leads |
| 75% of monthly budget | Email to team + manager |
| 90% of monthly budget | Automatic model downgrade to cheaper tier |
| 100% of monthly budget | Hard stop, read-only mode |
Right-Sizing: The Forgotten Optimization
The most impactful cost optimization is choosing the right resource:
- **Don’t use H100 for 7B model inference** — A10G or T4 is 5-10x cheaper
- **Don’t train on a single GPU** when data parallelism across 4 cheaper GPUs is faster AND cheaper
- **Don’t run 24/7** for services with bursty traffic — use auto-scaling to zero
The Cost-Conscious Engineering Culture
Sustainable AI cost management requires cultural change:
1. Make costs visible: Dashboard showing per-team, per-feature, per-model costs
2. Optimize as a review criterion: Include token efficiency in code review
3. Experiment with cheaper alternatives: A/B test smaller models before defaulting to largest
4. Set hard budgets: Unlimited spending leads to unlimited waste
Key Takeaways
- Model routing (tiered model selection) delivers the biggest cost savings
- Semantic caching can reduce inference costs by 30-50%
- Spot instances cut training costs by 60-80% with proper checkpointing
- Prompt caching is a free optimization — enable it everywhere
- Right-sizing resources is more impactful than any micro-optimization
The companies that manage AI costs effectively will have a structural advantage. Those that don’t will either burn through their runway or price themselves out of the market.
