Production AI Monitoring: Observability for LLM Applications

Reviewed: June 4, 2026

May 2026 — Your AI agent is deployed and users are happy. Then at 3 AM, the model starts hallucinating, costs spike, and nobody notices until the weekly report. This is why production AI monitoring isn’t optional — it’s infrastructure.

Why AI Monitoring Is Different

Traditional software monitoring tracks latency, errors, and throughput. AI applications need all of that plus:

The Five Pillars of AI Observability

Pillar 1: Latency Tracking (with Token Breakdown)

import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class LLMCallMetrics:
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    time_to_first_token_ms: float
    total_latency_ms: float
    cost_usd: float
    
    @property
    def tokens_per_second(self):
        return self.completion_tokens / (self.total_latency_ms / 1000)

class LLMMonitor:
    def __init__(self):
        self.calls = []
    
    def track(self, model: str, messages: list, response: dict):
        metrics = LLMCallMetrics(
            model=model,
            prompt_tokens=response['usage']['prompt_tokens'],
            completion_tokens=response['usage']['completion_tokens'],
            total_tokens=response['usage']['total_tokens'],
            time_to_first_token_ms=response.get('ttft_ms', 0),
            total_latency_ms=response.get('latency_ms', 0),
            cost_usd=self._calculate_cost(model, response['usage'])
        )
        self.calls.append(metrics)
        
        # Alert on anomalies
        if metrics.time_to_first_token_ms > 2000:
            self._alert(f"Slow TTFT: {metrics.time_to_first_token_ms}ms for {model}")
        if metrics.cost_usd > 0.05:  # Per-call cost threshold
            self._alert(f"Expensive call: ${metrics.cost_usd:.4f} for {model}")
        
        return metrics

Pillar 2: Quality Scoring

def evaluate_response(query, response, context, evaluator_llm):
    """Use an LLM-as-judge to score response quality."""
    evaluation = evaluator_llm.generate(f"""
    Rate the following AI response on these dimensions (1-5 each):
    
    Query: {query}
    Context provided: {context}
    Response: {response}
    
    Faithfulness: Is the response supported by the context? (not hallucinating)
    Relevance: Does the response actually answer the query?
    Completeness: Does the response cover all important aspects?
    Conciseness: Is the response appropriately brief?
    
    Return JSON: {{"faithfulness": N, "relevance": N, "completeness": N, "concision": N}}
    """)
    
    scores = json.loads(evaluation)
    
    # Alert on low quality
    if scores['faithfulness'] < 3:
        trigger_alert("Low faithfulness detected", query, response, scores)
    
    return scores

Pillar 3: Cost Budgets and Throttling

class TokenBudgetManager:
    def __init__(self, daily_budget_usd=100, per_user_daily_limit=5):
        self.daily_budget = daily_budget_usd
        self.per_user_limit = per_user_daily_limit
        self.spend_today = 0
        self.user_spend = {}  # user_id -> spend_today
    
    def can_execute(self, user_id, estimated_cost):
        if self.spend_today + estimated_cost > self.daily_budget:
            return False, "Daily budget exceeded"
        if self.user_spend.get(user_id, 0) + estimated_cost > self.per_user_limit:
            return False, "User daily limit reached"
        return True, "OK"
    
    def record_spend(self, user_id, actual_cost):
        self.spend_today += actual_cost
        self.user_spend[user_id] = self.user_spend.get(user_id, 0) + actual_cost
        
        # Budget alerts at 50%, 80%, 95%
        usage_pct = self.spend_today / self.daily_budget
        if usage_pct >= 0.95:
            trigger_critical_alert(f"Budget at {usage_pct:.0%}")
        elif usage_pct >= 0.80:
            trigger_warning_alert(f"Budget at {usage_pct:.0%}")

Pillar 4: Hallucination Detection Pipeline

class HallucinationDetector:
    def __init__(self, nli_model, fact_check_llm):
        self.nli = nli_model  # Natural Language Inference model
        self.fact_check = fact_check_llm
    
    def detect(self, response, retrieved_contexts):
        claims = self._extract_claims(response)
        results = []
        
        for claim in claims:
            # Check entailment against retrieved contexts
            max_entailment = max(
                self.nli.entailment_score(claim, ctx) 
                for ctx in retrieved_contexts
            )
            
            if max_entailment < 0.5:
                # Likely hallucination — verify with fact-checking LLM
                verification = self.fact_check.verify(claim)
                results.append({
                    'claim': claim,
                    'entailment_score': max_entailment,
                    'verified': verification.is_factual,
                    'confidence': verification.confidence
                })
        
        hallucination_rate = sum(1 for r in results if not r['verified']) / max(len(results), 1)
        return {
            'claims_checked': len(results),
            'hallucination_rate': hallucination_rate,
            'details': results
        }

Pillar 5: User Behavior Signals

class UserBehaviorTracker:
    """Track implicit quality signals from user behavior."""
    
    def track_session(self, session):
        signals = {
            'rephrase_count': session.count_rephrases(),  # User re-asking
            'copy_to_new_chat': session.switched_to_new_chat(),
            'response_time': session.time_to_next_message(),
            'follow_up_sentiment': session.analyze_followup_sentiment(),
            'used_copied_text': session.did_user_copy_response(),
            'abandoned_after_response': session.user_left_without_reply()
        }
        
        # Compute implicit satisfaction score (0-1)
        satisfaction = 1.0
        if signals['rephrase_count'] > 2: satisfaction -= 0.3
        if signals['copy_to_new_chat']: satisfaction -= 0.4
        if signals['abandoned_after_response']: satisfaction -= 0.2
        if signals['follow_up_sentiment'] == 'negative': satisfaction -= 0.3
        if signals['used_copied_text']: satisfaction += 0.1
        
        return max(satisfaction, 0.0), signals

Architecture: Production Monitoring Stack

┌─────────────┐     ┌──────────────┐     ┌─────────────────┐
│  User Query  │────▶│  AI Agent    │────▶│  Response       │
└─────────────┘     └──────┬───────┘     └────────┬────────┘
                           │                       │
                    ┌──────▼───────┐        ┌──────▼────────┐
                    │  Middleware   │        │  Evaluator     │
                    │  (logging +   │        │  (quality +    │
                    │   cost calc)  │        │   hallucination│
                    └──────┬───────┘        └──────┬────────┘
                           │                       │
                    ┌──────▼───────────────────────▼──────┐
                    │          Metrics Store               │
                    │     (Prometheus / InfluxDB)          │
                    └──────────────────┬──────────────────┘
                                       │
                    ┌──────────────────▼──────────────────┐
                    │       Grafana / Custom Dashboard      │
                    │  - Latency percentiles (p50/p95/p99)  │
                    │  - Cost per model / user / endpoint   │
                    │  - Quality scores over time           │
                    │  - Hallucination rate trends          │
                    │  - User satisfaction signals          │
                    └──────────────────┬──────────────────┘
                                       │
                    ┌──────────────────▼──────────────────┐
                    │         Alert Manager                 │
                    │  - Budget threshold breaches          │
                    │  - Quality score drops                │
                    │  - Hallucination rate spikes           │
                    │  - Latency degradation                │
                    └─────────────────────────────────────┘

Recommended Tools (2026)

Tool Focus Pricing Best For
LangFuse Full LLM observability Open-source + Cloud Most teams (start here)
Helicone Proxy-based monitoring Free tier + paid Drop-in monitoring
Arize Phoenix Tracing + evaluation Open-source Deep debugging
Braintrust Evaluation + testing Per-test pricing Systematic evaluation
Weights & Biases Experiment tracking Per-seat ML teams with existing W&B
Grafana + Prometheus Metrics + dashboards Open-source Custom dashboards
New Relic / Datadog APM + LLM Per-host pricing Existing infrastructure

>

Implementation Checklist

Conclusion

Production AI monitoring isn’t a nice-to-have — it’s the difference between catching a hallucination at 3 AM and discovering it in a customer complaint next month. Start with token tracking and cost budgets (they’re easy), add quality scoring next, and build toward full observability as your application matures.

Related: Advanced RAG Patterns | LLM Fine-Tuning Cost Guide

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert