MLOps Inference Optimization: Production Patterns for 2026
Reviewed: June 4, 2026
Reference Guide | Updated: May 2026
This reference page captures the latest best practices for deploying and optimizing LLMs in production, covering vLLM, llama.cpp, quantization, and structured output patterns. Updated for the current state of the ecosystem as of May 2026.
vLLM Deployment Patterns
Basic Production Serving
# Install vLLM
pip install vllm
# Serve with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
--tensor-parallel-size 1
--max-model-len 32768
--gpu-memory-utilization 0.90
--port 8000
Key vLLM Features (2026)
- PagedAttention: Automatic memory management — eliminates OOM for long sequences.
- Continuous Batching: Groups incoming requests dynamically for maximum throughput.
- Speculative Decoding: 2-3x speedup using draft models. Enable with
--speculative-model. - Prefix Caching: Reuse KV cache for shared prompt prefixes — essential for RAG and few-shot workloads.
- Structured Output: Built-in guided decoding via
--guided-decoding-backend outlines. - Multi-LoRA: Serve multiple LoRA adapters from a single base model. Use
--lora-adapters.
Quantized Model Serving
# Serve AWQ-quantized model
python -m vllm.entrypoints.openai.api_server
--model TheBloke/Llama-4-Scout-AWQ
--quantization awq
--dtype float16
# Serve GPTQ-quantized model
python -m vllm.entrypoints.openai.api_server
--model Qwen/Qwen3-32B-GPTQ-Int4
--quantization gptq
--dtype float16
llama.cpp and GGUF Patterns
Building llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)
Serving GGUF Models
./build/bin/llama-server
--model models/llama-4-scout-17b-q4_k_m.gguf
--ctx-size 32768
--parallel 4
--host 0.0.0.0 --port 8080
--cont-batching
Recommended Quantization Levels
| Quant | Size (7B) | Quality | Use Case |
|---|---|---|---|
| Q2_K | 2.8GB | Basic | Edge, low-resource |
| Q4_K_M | 4.3GB | Good | Default for most tasks |
| Q5_K_M | 5.0GB | Very Good | Quality-critical |
| Q8_0 | 7.2GB | Excellent | Near-full quality |
| F16 | 14GB | Full | Fine-tuning, benchmarks |
Quantizing Your Own Models
# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py
/path/to/model
--outfile model-f16.gguf
--outtype f16
# Quantize to Q4_K_M
./build/bin/llama-quantize
model-f16.gguf
model-q4_k_m.gguf
Q4_K_M
Structured Output
Outlines Integration
from outlines.models import llamacpp
from outlines import generate
model = llamacpp(
"models/llama-4-scout-q4_k_m.gguf",
n_ctx=32768
)
# Generate valid JSON
generator = generate.json(model, '{"name": "string", "age": "int"}')
result = generator("Generate a person")
# Result: {"name": "Alice", "age": 30}
# Generate matching a regex
generator = generate.regex(model, r"[A-Z]{3}-d{4}")
result = generator("Product code:")
# Result: "ABC-1234"
Guidance (Microsoft Research)
from guidance import models, gen
lm = models.LlamaCpp(
model="models/llama-4-scout-q4_k_m.gguf",
n_ctx=32768,
n_gpu_layers=35
)
lm += f"Extract entities from: {text}n"
lm += "Entities:" + gen(
name="entities",
stop="n",
max_tokens=200
)
Cost Optimization
Model Right-Sizing Guide
| Task | Recommended Size | Notes |
|---|---|---|
| Classification | 0.5B–3B | Fast, accurate for simple labels |
| RAG Retrieval | 7B–13B | Good balance of quality and speed |
| Code Generation | 14B–34B | CodeLlama, Qwen-Coder, DeepSeek Coder |
| Summarization | 7B–13B | Sufficient for most summarization |
| Complex Reasoning | 34B–70B+ | DeepSeek, Qwen-72B, Llama-4 Maverick |
Multi-Model Routing Pattern
# Route by complexity to optimize cost
def route_and_generate(prompt):
complexity = estimate_complexity(prompt)
if complexity == "low":
return small_model.generate(prompt) # 7B, $0.02/1M tokens
elif complexity == "medium":
return medium_model.generate(prompt) # 30B, $0.10/1M tokens
else:
return large_model.generate(prompt) # 70B+, $0.40/1M tokens
# Savings: 60-80% cost reduction vs. always using large model
Monitoring Checklist
- GPU Utilization: Target 70-85%. Below 50% → increase batch size or reduce model.
- TTFT (Time to First Token): Target <500ms for chat, <2s for complex.
- Throughput: Monitor tokens/second per GPU. Benchmark monthly.
- Error Rate: Track timeouts, OOM, and malformed responses. Target <0.1%.
- Cost per Request: (GPU hourly cost × hours) / requests. Optimize weekly.
