Production AI agents fail in unpredictable ways. LLMs hallucinate, tools timeout, APIs change, and edge cases emerge that no amount of testing could have anticipated. The difference between a demo agent and a production-grade agent is error handling. This guide covers proven patterns for building resilient AI agents that gracefully recover from failures.
The AI Agent Failure Landscape
Before diving into solutions, let us understand what can go wrong:
- LLM Errors: Hallucinations, malformed outputs, context overflow, refusals
- Tool Failures: API timeouts, rate limits, schema changes, authentication expiry
- Logic Errors: Infinite loops, incorrect reasoning chains, state corruption
- Environmental Errors: Network issues, resource exhaustion, permission errors
Pattern 1: Structured Output Validation
Always validate LLM outputs before using them. Use structured output formats (JSON mode, function calling) and validate against schemas.
import json
from pydantic import BaseModel, ValidationError
class ToolCall(BaseModel):
tool_name: str
arguments: dict
def validate_llm_output(raw_output: str):
try:
parsed = json.loads(raw_output)
return ToolCall(**parsed)
except (json.JSONDecodeError, ValidationError) as e:
log_error(f"Invalid LLM output: {e}")
return None
Pattern 2: Retry with Exponential Backoff
Transient failures are common. Implement retry logic with exponential backoff for tool calls and LLM invocations.
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=30),
retry=retry_if_exception_type((TimeoutError, RateLimitError))
)
def call_llm_with_retry(prompt: str, model: str = "gpt-4"):
return openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
Pattern 3: Graceful Degradation
When a primary capability fails, fall back to a simpler but functional alternative.
| Primary | Fallback | Use Case |
|---|---|---|
| GPT-4 | GPT-3.5-turbo | Cost/rate-limit issues |
| Real API | Cached data | API downtime |
| Full context | Summary | Context window overflow |
| External search | Pre-indexed docs | Search API failure |
Pattern 4: Circuit Breaker
Prevent cascading failures by temporarily disabling failing components.
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.timeout = recovery_timeout
self.state = "CLOSED"
self.last_failure = 0
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure > self.timeout:
self.state = "HALF_OPEN"
else:
raise Exception("Circuit is OPEN")
try:
result = func(*args, **kwargs)
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.threshold:
self.state = "OPEN"
raise
Pattern 5: Human-in-the-Loop Escalation
When automated recovery fails, escalate to a human operator. Define clear escalation criteria:
- N consecutive failures on the same step
- Confidence score below threshold
- High-stakes decision (financial transaction, data deletion)
- User explicitly requests human review
Pattern 6: State Checkpointing
Save agent state at key checkpoints so failures can be recovered from intermediate points, not just the beginning.
class CheckpointedAgent:
def checkpoint(self, step, state):
"""Save state to persistent storage."""
self.state_store.save({
"agent_id": self.id,
"step": step,
"state": state,
"timestamp": datetime.utcnow()
})
def recover(self):
"""Recover from last checkpoint."""
last = self.state_store.get_latest(self.id)
if last:
return last["state"]
return None
Pattern 7: Timeout Guards
Never let an agent run indefinitely. Set timeouts at multiple levels:
- Per-step timeout: Maximum time for a single tool call or reasoning step
- Per-agent timeout: Maximum total time for the entire agent execution
- Per-workflow timeout: For multi-agent workflows, an overall deadline
Production Checklist
- [x] Output validation on all LLM responses
- [x] Retry logic on all external API calls
- [x] Circuit breakers on services with known reliability issues
- [x] Fallback models configured for primary LLM
- [x] State checkpointing at critical steps
- [x] Human escalation paths for high-stakes actions
- [x] Comprehensive error logging and alerting
- [x] Regular chaos engineering tests
Conclusion
Reliable AI agents are not built by preventing all failures â they are built by handling failures gracefully. Implement these seven patterns and your agents will recover from the inevitable surprises of production environments without human intervention.
