Model Serving at Scale: Production LLM Inference in 2026
Reviewed: June 4, 2026
Serving large language models in production is one of the hardest infrastructure challenges in AI. The gap between „it works on my laptop“ and „it serves 10,000 requests per second“ is enormous — and bridging it requires deep understanding of both ML and distributed systems.
The Production Serving Problem
LLM inference is fundamentally different from traditional web serving:
- **Memory-bound, not compute-bound**: GPUs are limited by VRAM, not FLOPs
- **Variable input/output length**: Request sizes vary by 10-100x
- **Autoregressive generation**: Each token depends on all previous tokens
- **Model size**: 70B+ parameter models require multiple GPUs or heavy quantization
The Serving Framework Landscape
vLLM: The Industry Standard
vLLM’s PagedAttention (inspired by OS virtual memory) revolutionized LLM serving by eliminating memory waste from padding and fragmentation. Key features:
- **PagedAttention**: 90%+ GPU memory utilization vs. 30-40% in naive serving
- **Continuous batching**: Interleave requests for maximum throughput
- **Speculative decoding**: Use small draft models for faster verification
- **OpenAI-compatible API**: Drop-in replacement compatibility
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct",
tensor_parallel_size=4,
gpu_memory_utilization=0.95,
max_model_len=32768)
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=512))
TensorRT-LLM: NVIDIA’s Performance King
TensorRT-LLM delivers the highest raw throughput on NVIDIA hardware through aggressive kernel fusion, quantization, and optimization:
- **FP4/FP8 inference**: Near-lossless 4-8 bit quantization on Hopper GPUs
- **In-flight batching**: Dynamic request scheduling without static batch sizes
- **Multi-GPU serving**: Optimized all-reduce and pipeline parallelism
llama.cpp: CPU and Edge Serving
Not every deployment needs GPUs. llama.cpp makes CPU inference practical:
- **GGUF format**: Unified quantized model format
- **Metal/CUDA/Vulkan backends**: Unified code across Apple, NVIDIA, and AMD
- **Speculative decoding**: Works on CPU with small draft models
Throughput Optimization Strategies
1. Continuous Batching
Static batching (waiting for N requests before processing) wastes GPU cycles. Continuous batching inserts new requests as soon as running requests complete:
Static batching: [req1 req2 req3] → wait → [req4 req5 req6] → wait → ...
Continuous batching: [req1 req2 req3] → [req2 req3 req4] → [req3 req4 req5] → ...
2. KV-Cache Optimization
The key-value cache dominates memory usage in long-sequence generation. Techniques:
- **PagedAttention** (vLLM): Virtual memory for KV-cache
- **Multi-Query Attention**: Share KV heads across query heads
- **Sliding window attention**: Limit KV-cache to recent tokens
- **Cross-layer KV sharing**: Reuse KV pairs across transformer layers
3. Speculative Decoding
A small „draft“ model generates candidate tokens, and the large model verifies them in parallel:
Draft model (7B): token_1, token_2, token_3, token_4
Target model (70B): verify [token_1-4] → accept 3, reject 4, generate new_4
Accept rates of 60-80% are typical, yielding 2-3x speedups.
4. Model Quantization
Quantization reduces model precision to save memory and increase throughput:
| Format | Bits | Quality | Speedup | Use Case |
|---|---|---|---|---|
| FP16 | 16 | Baseline | 1x | Maximum quality |
| INT8 | 8 | ~99% | 1.5-2x | Production default |
| INT4 | 4 | ~97% | 2-3x | Cost-sensitive |
| FP4 | 4 | ~98% | 2-4x | Hopper GPUs |
Scaling Architecture
┌─────────────────────┐
│ Load Balancer │
│ (Round-robin or │
│ least-loaded) │
└──────────┬──────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌──────▼─────┐ ┌─────▼──────┐ ┌─────▼──────┐
│ vLLM │ │ vLLM │ │ vLLM │
│ Replica 1 │ │ Replica 2 │ │ Replica N │
│ 4x A100 │ │ 4x A100 │ │ 4x A100 │
└────────────┘ └────────────┘ └────────────┘
│ │ │
┌──────▼───────────────▼───────────────▼──────┐
│ Shared Storage (Model Weights) │
│ /models/llama-3.1-70b-int4/ │
└─────────────────────────────────────────────┘
Cost Management
At scale, serving costs dominate AI budgets:
- **A100 80GB**: ~$2/hr = ~$0.015 per 1K tokens (continuous batching)
- **H100 80GB**: ~$3/hr = ~0.008 per 1K tokens (FP8 + speculative)
- **Spot/preemptible**: 60-80% savings for fault-tolerant workloads
- **Model routing**: Send simple queries to smaller, cheaper models
Key Takeaways
- vLLM is the safest default for production serving; TensorRT-LLM for max throughput
- Continuous batching and PagedAttention are non-negotiable for production
- Speculative decoding delivers 2-3x speedups with minimal quality loss
- Quantization (INT4/FP4) halves costs with <3% quality degradation
- Design for horizontal scaling from day one
The teams that master LLM serving infrastructure will have an insurmountable advantage. Those that don’t will be paying 5-10x more for the same capability.
