body{font-family:-apple-system,BlinkMacSystemFont,’Segoe UI‘,Roboto,sans-serif;max-width:900px;margin:0 auto;padding:2rem;line-height:1.7;color:#1a1a1a}
h1{color:#1a1a1a;border-bottom:3px solid #6366f1;padding-bottom:.5rem}
h2{color:#334155;margin-top:2rem}
h3{color:#475569}
code{background:#f1f5f9;padding:.2rem .5rem;border-radius:4px;font-size:.9em}
pre{background:#1e293b;color:#e2e8f0;padding:1.5rem;border-radius:8px;overflow-x:auto;font-size:.9em}
blockquote{border-left:4px solid #6366f1;padding-left:1rem;color:#64748b;font-style:italic}
table{border-collapse:collapse;width:100%;margin:1rem 0}
th,td{border:1px solid #e2e8f0;padding:.75rem;text-align:left}
th{background:#f8fafc}
.tag{display:inline-block;background:#e0e7ff;color:#4338ca;padding:.2rem .6rem;border-radius:999px;font-size:.85em;margin-right:.5rem}
Prompt Injection Defense Strategies for AI Agents
Reviewed: June 4, 2026
Published: May 26, 2026 | Reading time: 12 min | Topics: AI Security Prompt Injection Agent Architecture
Prompt injection remains the #1 attack vector against AI agents in 2026. As agents gain more autonomy — calling tools, accessing files, sending emails — the blast radius of a single injection grows dramatically. This guide covers defense-in-depth strategies that actually work.
Understanding the Threat Landscape
Prompt injection attacks against AI agents fall into three categories:
| Attack Type | Description | Severity |
|---|---|---|
| Direct Injection | Attacker embeds malicious instructions in user input | High |
| Indirect Injection | Malicious content in tool outputs, files, or web pages the agent reads | Critical |
| Multi-turn Injection | Attacker gradually manipulates context across multiple interactions | Medium-High |
Unlike chatbot-only systems, AI agents amplify injection risk because they act on instructions. A compromised agent can exfiltrate data, send unauthorized messages, or execute destructive commands.
Defense Layer 1: Input Sanitization
The first line of defense is treating all user input as untrusted. Here’s a practical sanitization pipeline:
import re
from dataclasses import dataclass
from typing import Optional
@dataclass
class SanitizationResult:
cleaned: str
warnings: list[str]
blocked: bool
class InputSanitizer:
"""Multi-layer input sanitization for AI agent prompts."""
# Patterns commonly used in injection attempts
INJECTION_PATTERNS = [
r'ignores+(alls+)?(previous|prior|above|earlier)s+instructions?',
r'forgets+(everything|all|what)s+(you|i)s+(said|told|know)',
r'yous+ares+nows+a?s*(different|new|other)',
r'systems*:s*',
r'</?system>',
r'[INST]|[/INST]',
r'###s*(SYSTEM|HUMAN|ASSISTANT)',
]
def __init__(self):
self.compiled = [re.compile(p, re.IGNORECASE) for p in self.INJECTION_PATTERNS]
def sanitize(self, user_input: str) -> SanitizationResult:
warnings = []
cleaned = user_input
# Check for injection patterns
for pattern in self.compiled:
matches = pattern.findall(cleaned)
if matches:
warnings.append(f"Injection pattern detected: {pattern.pattern[:40]}...")
cleaned = pattern.sub('[REDACTED]', cleaned)
# Enforce length limits
if len(cleaned) > 10000:
warnings.append("Input truncated to 10000 chars")
cleaned = cleaned[:10000]
# Block if too many patterns triggered
blocked = len(warnings) >= 3
return SanitizationResult(
cleaned=cleaned,
warnings=warnings,
blocked=blocked
)
# Usage
sanitizer = InputSanitizer()
result = sanitizer.sanitize(user_message)
if result.blocked:
raise SecurityError(f"Input blocked: {result.warnings}")
Key principles:
- Never concatenate user input directly into system prompts — use structured message formats
- Escape delimiters — attackers exploit XML tags, markdown, and special tokens
- Length limits — oversized inputs are often injection carriers
Defense Layer 2: Output Filtering
Even with clean inputs, agents can be manipulated through tool outputs. Output filtering checks the agent’s responses before they reach the user or trigger actions:
class OutputFilter:
"""Filter agent outputs to prevent data exfiltration and action abuse."""
# Patterns indicating potential data exfiltration
SUSPICIOUS_OUTPUTS = [
r'(?:password|secret|key|token|credential)s*[:=]s*S+',
r'b[A-Za-z0-9+/]{40,}={0,2}b', # Base64 blobs
r'(?:send|post|email|upload)s+(?:to|at)s+S+',
]
BLOCKED_DOMAINS = ['evil.com', 'attacker.net', 'pastebin.com']
def filter_response(self, response: str, action: Optional[str] = None) -> dict:
issues = []
# Check for credential leakage
for pattern in self.SUSPICIOUS_OUTPUTS:
if re.search(pattern, response, re.IGNORECASE):
issues.append("Potential credential/sensitive data in output")
break
# Check actions for unauthorized destinations
if action:
for domain in self.BLOCKED_DOMAINS:
if domain in action:
issues.append(f"Action targets blocked domain: {domain}")
return {
"allowed": len(issues) == 0,
"filtered_response": response if not issues else "[Output filtered for security]",
"issues": issues
}
# Wrap tool calls
def safe_tool_call(tool_func, output_filter, *args, **kwargs):
result = tool_func(*args, **kwargs)
check = output_filter.filter_response(str(result))
if not check["allowed"]:
raise SecurityError(f"Tool output filtered: {check['issues']}")
return result
Defense Layer 3: Sandboxing and Capability Restrictions
The most robust defense is limiting what an agent can do. Principle of least privilege applies:
class AgentSandbox:
"""Capability-restricted execution environment for AI agents."""
def __init__(self, allowed_tools: list[str], max_iterations: int = 10):
self.allowed_tools = set(allowed_tools)
self.max_iterations = max_iterations
self.execution_count = 0
self.audit_log = []
def call_tool(self, tool_name: str, params: dict) -> dict:
# Enforce tool whitelist
if tool_name not in self.allowed_tools:
raise SecurityError(
f"Tool '{tool_name}' not in allowed set: {self.allowed_tools}"
)
# Enforce iteration limits (prevents infinite loops from injection)
self.execution_count += 1
if self.execution_count > self.max_iterations:
raise SecurityError(f"Max iterations ({self.exhausted}) exceeded")
# Log every action for audit
self.audit_log.append({
"tool": tool_name,
"params": {k: v for k, v in params.items() if k != 'password'},
"timestamp": datetime.utcnow().isoformat()
})
return execute_tool(tool_name, params)
def require_approval(self, tool_name: str, params: dict) -> bool:
"""Require human approval for high-risk actions."""
high_risk_tools = {'send_email', 'execute_code', 'delete_file', 'http_post'}
return tool_name in high_risk_tools
# Example: Restricted agent for customer support
sandbox = AgentSandbox(
allowed_tools=['search_knowledge_base', 'lookup_order', 'send_ticket'],
max_iterations=5
)
Defense Layer 4: Structured Prompt Architecture
Instead of string concatenation, use structured message formats that separate instructions from data:
# ❌ VULNERABLE: String concatenation
prompt = f"""
You are a helpful assistant. Answer the user's question.
User says: {user_input}
"""
# ✅ SECURE: Structured messages with explicit role separation
messages = [
{
"role": "system",
"content": "You are a helpful customer support agent. You may only answer questions about orders and products.",
# System instructions are NOT influenced by user content
"metadata": {"immutable": True}
},
{
"role": "user",
"content": user_input, # Treated as data, not instructions
"metadata": {"source": "user", "trust_level": "untrusted"}
}
]
# Even better: Use tool/function calling instead of free-text instructions
tools = [{
"type": "function",
"function": {
"name": "lookup_order",
"description": "Look up order status",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string", "pattern": "^ORD-[0-9]{6}$"}
},
"required": ["order_id"]
}
}
}]
Defense Layer 5: Anomaly Detection
Monitor agent behavior for signs of compromise:
class AgentAnomalyDetector:
"""Detect anomalous agent behavior that may indicate injection."""
BASELINE = {
"avg_tools_per_request": 2.5,
"avg_response_length": 500,
"max_external_calls": 3,
"max_tool_chain_depth": 4
}
def check_session(self, session_metrics: dict) -> list[str]:
alerts = []
# Unusual tool usage
if session_metrics["tools_called"] > self.BASELINE["avg_tools_per_request"] * 3:
alerts.append("Excessive tool usage — possible injection loop")
# Unexpected external calls
if session_metrics["external_calls"] > self.BASELINE["max_external_calls"]:
alerts.append(f"Too many external calls: {session_metrics['external_calls']}")
# Unusual response patterns
if session_metrics["response_length"] > self.BASELINE["avg_response_length"] * 5:
alerts.append("Abnormally long response — possible data exfiltration")
# Tool usage outside normal patterns
unusual_tools = set(session_metrics["tools_used"]) - set(self.BASELINE.get("expected_tools", []))
if unusual_tools:
alerts.append(f"Unexpected tools called: {unusual_tools}")
return alerts
Putting It All Together: Defense-in-Depth
A production AI agent should implement all five layers:
- Input Sanitization — clean user inputs before they reach the LLM
- Structured Prompts — separate instructions from data using message roles
- Sandboxing — whitelist tools, limit iterations, require approval for sensitive actions
- Output Filtering — scan responses for data exfiltration and unauthorized actions
- Anomaly Detection — monitor for behavioral deviations that signal compromise
Quick Reference: Security Checklist
| Check | Status |
|---|---|
| User input never concatenated into system prompts | ☐ |
| Tool whitelist enforced at execution layer | ☐ |
| Iteration limits prevent infinite loops | ☐ |
| High-risk actions require human approval | ☐ |
| All tool calls logged for audit | ☐ |
| Output filtered for credential leakage | ☐ |
| Anomaly detection alerts configured | ☐ |
| Regular red team testing scheduled | ☐ |
This guide is part of the DataGate.ch AI Security series. For more on securing AI systems, see our Red Teaming Guide and Safety Benchmarks pages.
