GPU Optimization for AI Workloads: Memory, Speed & Cost
Reviewed: June 4, 2026
GPU resources are the most expensive and constrained component in AI systems. Optimizing GPU utilization isn’t just an engineering challenge — it’s a business imperative. Every percentage point of wasted GPU memory is dollars burned.
Understanding GPU Memory Architecture
The Memory Hierarchy
┌─────────────────────────────────────────────┐
│ HBM (High Bandwidth Memory) │
│ A100: 80GB @ 2TB/s │
│ H100: 80GB @ 3.35TB/s │
│ B100: 192GB @ 8TB/s │
├─────────────────────────────────────────────┤
│ L2 Cache │
│ A100: 40MB │
│ H100: 50MB │
├─────────────────────────────────────────────┤
│ Shared Memory / L1 Cache │
│ Per SM: 164KB (H100) │
├─────────────────────────────────────────────┤
│ Registers │
│ Per SM: 65,536 (H100) │
└─────────────────────────────────────────────┘
The key insight: HBM bandwidth is the bottleneck for inference, while compute (Tensor Cores) is the bottleneck for training.
Memory Optimization Techniques
1. Gradient Checkpointing (Training)
Trade compute for memory by recomputing activations during the backward pass instead of storing them:
- **Standard training**: Store all activations → O(n) memory
- **Gradient checkpointing**: Store every √n-th activation → O(√n) memory
- **Cost**: ~33% more compute time, but enables 5-10x larger batch sizes
from torch.utils.checkpoint import checkpoint
# Instead of: output = layer(input)
output = checkpoint(layer, input, use_reentrant=False)
2. Mixed Precision Training
Use FP16/BF16 for most operations, FP32 for critical accumulations:
- **BF16**: Preferred for training (same exponent range as FP32, no loss scaling needed)
- **FP16**: Requires gradient scaling to prevent underflow
- **FP8**: Emerging standard on Hopper GPUs for both training and inference
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast(dtype=torch.bfloat16):
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
3. Activation Offloading
Move activations to CPU RAM during forward pass, transfer back for backward pass:
- Enables training models 2-3x larger than GPU memory
- PCIe bandwidth is the bottleneck (64 GB/s on PCIe Gen5)
- NVLink/NVSwitch makes this much faster within a node
4. Flash Attention
Flash Attention reduces memory from O(n²) to O(n) for sequence length n:
- **Standard attention**: Materializes n×n attention matrix → 100K tokens = 40GB
- **Flash Attention**: Tiled computation, never materializes full matrix → 100K tokens = 400MB
- **Speedup**: 2-4x faster due to reduced HBM reads/writes
Multi-GPU Strategies
Data Parallelism
Each GPU holds a full model copy, processes different data batches. Simplest to implement but wastes memory.
Tensor Parallelism
Split individual layers across GPUs. Each GPU holds a fraction of each layer. Best for intra-node communication (NVLink).
Pipeline Parallelism
Assign different layers to different GPUs. Each GPU holds a contiguous chunk of the model. Best for inter-node communication.
The Winning Combination: 3D Parallelism
Production training uses all three simultaneously:
Node 1: [TP group: GPU 0-3] → Pipeline Stage 1
Node 2: [TP group: GPU 0-3] → Pipeline Stage 2
Node 3: [TP group: GPU 0-3] → Pipeline Stage 3
Node 4: [TP group: GPU 0-3] → Pipeline Stage 4
DP group: Nodes 1-4 (replicate pipeline for different data)
Quantization for Inference
Post-Training Quantization (PTQ)
Quantize a trained model without retraining:
- **GPTQ**: Layer-wise quantization with second-order information
- **AWQ**: Activation-aware weight quantization — preserves important weights
- **GGUF**: llama.cpp format with per-layer quantization granularity
Quantization-Aware Training (QAT)
Fine-tune with quantization simulated during training:
- Better quality than PTQ at the same bit width
- Requires training infrastructure and time
- Essential for sub-4-bit quantization
Cost Optimization Strategies
1. Right-Size Your GPUs
- **Inference**: A10G (24GB) often sufficient for 7B-13B models
- **Fine-tuning**: A100 (80GB) for 70B models with LoRA
- **Training**: H100 clusters for 100B+ models
2. Spot/Preemptible Instances
- 60-80% cheaper than on-demand
- Use checkpointing to handle preemption
- Ideal for training workloads with checkpoint intervals
3. Multi-Tenant Serving
Share GPU resources across multiple models using:
- NVIDIA MPS (Multi-Process Service)
- Time-slicing with Kubernetes
- Model switching with warm pools
4. Dynamic Batching
Group variable-length requests to maximize GPU utilization:
- Padding waste: 30-50% in naive batching
- Dynamic batching waste: 5-10%
- Continuous batching waste: <2%
Key Takeaways
- Memory is the primary constraint, not compute
- Flash Attention is essential for long-context workloads
- 3D parallelism (data + tensor + pipeline) is the standard for large model training
- Quantization (INT4/FP8) halves inference costs with minimal quality loss
- Right-sizing GPU selection saves more than any optimization trick
The difference between optimized and unoptimized GPU usage is often 3-5x in cost. In a world where AI compute is the primary cost center, GPU optimization is the highest-ROI engineering investment.
