Production AI agents fail in unpredictable ways. LLMs hallucinate, tools timeout, APIs change, and edge cases emerge that no amount of testing could have anticipated. The difference between a demo agent and a production-grade agent is error handling. This guide covers proven patterns for building resilient AI agents that gracefully recover from failures.

The AI Agent Failure Landscape

Before diving into solutions, let us understand what can go wrong:

Pattern 1: Structured Output Validation

Always validate LLM outputs before using them. Use structured output formats (JSON mode, function calling) and validate against schemas.

import json
from pydantic import BaseModel, ValidationError

class ToolCall(BaseModel):
    tool_name: str
    arguments: dict

def validate_llm_output(raw_output: str):
    try:
        parsed = json.loads(raw_output)
        return ToolCall(**parsed)
    except (json.JSONDecodeError, ValidationError) as e:
        log_error(f"Invalid LLM output: {e}")
        return None

Pattern 2: Retry with Exponential Backoff

Transient failures are common. Implement retry logic with exponential backoff for tool calls and LLM invocations.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((TimeoutError, RateLimitError))
)
def call_llm_with_retry(prompt: str, model: str = "gpt-4"):
    return openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

Pattern 3: Graceful Degradation

When a primary capability fails, fall back to a simpler but functional alternative.

Primary Fallback Use Case
GPT-4 GPT-3.5-turbo Cost/rate-limit issues
Real API Cached data API downtime
Full context Summary Context window overflow
External search Pre-indexed docs Search API failure

Pattern 4: Circuit Breaker

Prevent cascading failures by temporarily disabling failing components.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failures = 0
        self.threshold = failure_threshold
        self.timeout = recovery_timeout
        self.state = "CLOSED"
        self.last_failure = 0
    
    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit is OPEN")
        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.threshold:
                self.state = "OPEN"
            raise

Pattern 5: Human-in-the-Loop Escalation

When automated recovery fails, escalate to a human operator. Define clear escalation criteria:

Pattern 6: State Checkpointing

Save agent state at key checkpoints so failures can be recovered from intermediate points, not just the beginning.

class CheckpointedAgent:
    def checkpoint(self, step, state):
        """Save state to persistent storage."""
        self.state_store.save({
            "agent_id": self.id,
            "step": step,
            "state": state,
            "timestamp": datetime.utcnow()
        })
    
    def recover(self):
        """Recover from last checkpoint."""
        last = self.state_store.get_latest(self.id)
        if last:
            return last["state"]
        return None

Pattern 7: Timeout Guards

Never let an agent run indefinitely. Set timeouts at multiple levels:

Production Checklist

Conclusion

Reliable AI agents are not built by preventing all failures — they are built by handling failures gracefully. Implement these seven patterns and your agents will recover from the inevitable surprises of production environments without human intervention.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert