Large Language Models deliver transformative capabilities, but their compute cost can be prohibitive at scale. This guide covers practical strategies to dramatically reduce LLM inference costs without sacrificing quality.
The Cost Problem
A production LLM serving 100,000 queries per day using GPT-4 Turbo ($0.01/1K input tokens, $0.03/1K output tokens) can cost $90,000/month. The same workload handled with a fine-tuned open-source model on self-hosted GPUs? Under $5,000/month. That is an 18x difference.
1. Model Selection and Tiered Routing
Not every query needs the most expensive model. Implement intelligent request routing:
- Simple queries to small models: Classification, extraction, formatting — handle with 1-7B parameter models (cost: $0.0001/query).
- Medium complexity to mid-range: Summarization, Q&A — 13-70B quantized models strike the right balance.
- Complex reasoning to frontier: Multi-step reasoning, creative coding, nuanced analysis — reserve for GPT-4, Claude 3.5 Sonnet, or Llama 3.1 405B.
Tools like RouteLLM, AI Gateway (LiteLLM), and OpenRouter implement this automatically with learned routing classifiers.
2. Prompt Caching
Most production prompts share a long system prompt or context. API providers increasingly support prompt caching:
- Anthropic Claude: Cache-aware API caches prompt prefixes at reduced cost ($0.0008/1K cached input tokens vs $0.003 standard).
- OpenAI: Automatic prompt caching for prompts over 1024 tokens, with 50% discount on cached input tokens.
- Implementation: Use deterministic prompt ordering (system message first, then context, then variable content) to maximize cache hit rates.
3. Batching and Dynamic Batching
Process multiple requests together to amortize GPU memory overhead:
- Online batching: Accumulate requests for 10-50ms, then process as a single batch. Trade slight latency for 2-4x throughput.
- Offline batching: For non-interactive workloads (data processing, evaluation), accumulate requests for hours. OpenAI Batch API offers 50% discount for 24-hour turnaround jobs.
- Throughput gains: A single H100 GPU can serve ~200 req/s for Llama 3 8B but only ~5 req/s for Llama 70B. Smaller models dramatically improve throughput per dollar.
4. Fine-tuning vs Prompting
For task-specific use cases, a fine-tuned small model often outperforms a prompted large model at 1/100th the cost:
- Fine-tuned Llama 3 8B: Via QLoRA on a single A100 for ~$50, achieves task-specific accuracy matching GPT-3.5. Inference cost: ~$0.001/query vs GPT-3.5s $0.002/query.
- Distilled models: Microsofts Phi-3 mini (3.8B) achieves 80% of GPT-3.5 performance on reasoning benchmarks at consumer-grade inference cost.
- Domain adaptation: For specialized domains (legal, medical, finance), a fine-tuned 7B model trained on domain data significantly outperforms general-purpose large models.
5. Infrastructure Optimization
Hardware choices directly impact cost per token:
- Spot/preemptible instances: AWS spot instances offer 60-90% discount for interruptible workloads. Use checkpointing for fault tolerance.
- Multi-model serving: NVIDIA Triton and vLLM support running multiple models on one GPU, improving utilization from typical 15-30% to 60-80%.
- Right-sizing clusters: Use autoscaling (KEDA + HPA) to scale GPU pods based on queue depth. Scale to zero during off-peak hours.
Practical Cost Comparison
| Approach | Cost per 1M tokens | Quality |
|---|---|---|
| GPT-4 Turbo API | ~$30 | Highest |
| Claude 3.5 Sonnet API | ~$15 | Very High |
| Fine-tuned 7B on self-hosted A100 | ~$0.50 | Task-specific high |
| Quantized 13B on spot instances | ~$0.15 | Good |
| Phi-3 on CPU inference | ~$0.02 | Moderate |
Conclusion
LLM cost optimization is a system-level challenge spanning model selection, prompt engineering, infrastructure, and deployment architecture. The biggest wins come from tiered routing (match model size to task complexity) and infrastructure optimization (batching, fine-tuning, and right-sizing). Start by auditing your query complexity distribution — most production workloads have a long tail of simple queries that can be handled by a fraction of the cost of your current approach.
