Production AI Monitoring: Observability for LLM Applications
Reviewed: June 4, 2026
May 2026 — Your AI agent is deployed and users are happy. Then at 3 AM, the model starts hallucinating, costs spike, and nobody notices until the weekly report. This is why production AI monitoring isn’t optional — it’s infrastructure.
Why AI Monitoring Is Different
Traditional software monitoring tracks latency, errors, and throughput. AI applications need all of that plus:
- Output quality: Is the model’s response actually correct?
- Hallucination detection: Is the model making things up?
- Token cost tracking: Is the model being efficient with context?
- Prompt injection detection: Are users trying to break the model?
- Drift detection: Is data distribution changing over time?
- User satisfaction: Are users rephrasing or abandoning queries?
The Five Pillars of AI Observability
Pillar 1: Latency Tracking (with Token Breakdown)
import time
from dataclasses import dataclass
from typing import Optional
@dataclass
class LLMCallMetrics:
model: str
prompt_tokens: int
completion_tokens: int
total_tokens: int
time_to_first_token_ms: float
total_latency_ms: float
cost_usd: float
@property
def tokens_per_second(self):
return self.completion_tokens / (self.total_latency_ms / 1000)
class LLMMonitor:
def __init__(self):
self.calls = []
def track(self, model: str, messages: list, response: dict):
metrics = LLMCallMetrics(
model=model,
prompt_tokens=response['usage']['prompt_tokens'],
completion_tokens=response['usage']['completion_tokens'],
total_tokens=response['usage']['total_tokens'],
time_to_first_token_ms=response.get('ttft_ms', 0),
total_latency_ms=response.get('latency_ms', 0),
cost_usd=self._calculate_cost(model, response['usage'])
)
self.calls.append(metrics)
# Alert on anomalies
if metrics.time_to_first_token_ms > 2000:
self._alert(f"Slow TTFT: {metrics.time_to_first_token_ms}ms for {model}")
if metrics.cost_usd > 0.05: # Per-call cost threshold
self._alert(f"Expensive call: ${metrics.cost_usd:.4f} for {model}")
return metrics
Pillar 2: Quality Scoring
def evaluate_response(query, response, context, evaluator_llm):
"""Use an LLM-as-judge to score response quality."""
evaluation = evaluator_llm.generate(f"""
Rate the following AI response on these dimensions (1-5 each):
Query: {query}
Context provided: {context}
Response: {response}
Faithfulness: Is the response supported by the context? (not hallucinating)
Relevance: Does the response actually answer the query?
Completeness: Does the response cover all important aspects?
Conciseness: Is the response appropriately brief?
Return JSON: {{"faithfulness": N, "relevance": N, "completeness": N, "concision": N}}
""")
scores = json.loads(evaluation)
# Alert on low quality
if scores['faithfulness'] < 3:
trigger_alert("Low faithfulness detected", query, response, scores)
return scores
Pillar 3: Cost Budgets and Throttling
class TokenBudgetManager:
def __init__(self, daily_budget_usd=100, per_user_daily_limit=5):
self.daily_budget = daily_budget_usd
self.per_user_limit = per_user_daily_limit
self.spend_today = 0
self.user_spend = {} # user_id -> spend_today
def can_execute(self, user_id, estimated_cost):
if self.spend_today + estimated_cost > self.daily_budget:
return False, "Daily budget exceeded"
if self.user_spend.get(user_id, 0) + estimated_cost > self.per_user_limit:
return False, "User daily limit reached"
return True, "OK"
def record_spend(self, user_id, actual_cost):
self.spend_today += actual_cost
self.user_spend[user_id] = self.user_spend.get(user_id, 0) + actual_cost
# Budget alerts at 50%, 80%, 95%
usage_pct = self.spend_today / self.daily_budget
if usage_pct >= 0.95:
trigger_critical_alert(f"Budget at {usage_pct:.0%}")
elif usage_pct >= 0.80:
trigger_warning_alert(f"Budget at {usage_pct:.0%}")
Pillar 4: Hallucination Detection Pipeline
class HallucinationDetector:
def __init__(self, nli_model, fact_check_llm):
self.nli = nli_model # Natural Language Inference model
self.fact_check = fact_check_llm
def detect(self, response, retrieved_contexts):
claims = self._extract_claims(response)
results = []
for claim in claims:
# Check entailment against retrieved contexts
max_entailment = max(
self.nli.entailment_score(claim, ctx)
for ctx in retrieved_contexts
)
if max_entailment < 0.5:
# Likely hallucination — verify with fact-checking LLM
verification = self.fact_check.verify(claim)
results.append({
'claim': claim,
'entailment_score': max_entailment,
'verified': verification.is_factual,
'confidence': verification.confidence
})
hallucination_rate = sum(1 for r in results if not r['verified']) / max(len(results), 1)
return {
'claims_checked': len(results),
'hallucination_rate': hallucination_rate,
'details': results
}
Pillar 5: User Behavior Signals
class UserBehaviorTracker:
"""Track implicit quality signals from user behavior."""
def track_session(self, session):
signals = {
'rephrase_count': session.count_rephrases(), # User re-asking
'copy_to_new_chat': session.switched_to_new_chat(),
'response_time': session.time_to_next_message(),
'follow_up_sentiment': session.analyze_followup_sentiment(),
'used_copied_text': session.did_user_copy_response(),
'abandoned_after_response': session.user_left_without_reply()
}
# Compute implicit satisfaction score (0-1)
satisfaction = 1.0
if signals['rephrase_count'] > 2: satisfaction -= 0.3
if signals['copy_to_new_chat']: satisfaction -= 0.4
if signals['abandoned_after_response']: satisfaction -= 0.2
if signals['follow_up_sentiment'] == 'negative': satisfaction -= 0.3
if signals['used_copied_text']: satisfaction += 0.1
return max(satisfaction, 0.0), signals
Architecture: Production Monitoring Stack
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ User Query │────▶│ AI Agent │────▶│ Response │
└─────────────┘ └──────┬───────┘ └────────┬────────┘
│ │
┌──────▼───────┐ ┌──────▼────────┐
│ Middleware │ │ Evaluator │
│ (logging + │ │ (quality + │
│ cost calc) │ │ hallucination│
└──────┬───────┘ └──────┬────────┘
│ │
┌──────▼───────────────────────▼──────┐
│ Metrics Store │
│ (Prometheus / InfluxDB) │
└──────────────────┬──────────────────┘
│
┌──────────────────▼──────────────────┐
│ Grafana / Custom Dashboard │
│ - Latency percentiles (p50/p95/p99) │
│ - Cost per model / user / endpoint │
│ - Quality scores over time │
│ - Hallucination rate trends │
│ - User satisfaction signals │
└──────────────────┬──────────────────┘
│
┌──────────────────▼──────────────────┐
│ Alert Manager │
│ - Budget threshold breaches │
│ - Quality score drops │
│ - Hallucination rate spikes │
│ - Latency degradation │
└─────────────────────────────────────┘
Recommended Tools (2026)
| Tool | Focus | Pricing | Best For |
|---|---|---|---|
| LangFuse | Full LLM observability | Open-source + Cloud | Most teams (start here) |
| Helicone | Proxy-based monitoring | Free tier + paid | Drop-in monitoring |
| Arize Phoenix | Tracing + evaluation | Open-source | Deep debugging |
| Braintrust | Evaluation + testing | Per-test pricing | Systematic evaluation |
| Weights & Biases | Experiment tracking | Per-seat | ML teams with existing W&B |
| Grafana + Prometheus | Metrics + dashboards | Open-source | Custom dashboards |
| New Relic / Datadog | APM + LLM | Per-host pricing | Existing infrastructure |
>
Implementation Checklist
- Instrument every LLM call with token counts and latency
- Set up cost budgets with 50/80/95% alerts
- Implement LLM-as-judge quality scoring on a sample of responses
- Build hallucination detection for high-stakes outputs
- Track implicit user satisfaction signals (rephrases, abandonment)
- Create a real-time dashboard with latency, cost, and quality tiles
- Set up PagerDuty/Opsgenie alerts for quality drops
- Implement weekly automated quality reports
- Run regression tests when switching models or prompts
Conclusion
Production AI monitoring isn’t a nice-to-have — it’s the difference between catching a hallucination at 3 AM and discovering it in a customer complaint next month. Start with token tracking and cost budgets (they’re easy), add quality scoring next, and build toward full observability as your application matures.
Related: Advanced RAG Patterns | LLM Fine-Tuning Cost Guide
