Prompt Injection Defense Strategies for AI Agents

body{font-family:-apple-system,BlinkMacSystemFont,’Segoe UI‘,Roboto,sans-serif;max-width:900px;margin:0 auto;padding:2rem;line-height:1.7;color:#1a1a1a}
h1{color:#1a1a1a;border-bottom:3px solid #6366f1;padding-bottom:.5rem}
h2{color:#334155;margin-top:2rem}
h3{color:#475569}
code{background:#f1f5f9;padding:.2rem .5rem;border-radius:4px;font-size:.9em}
pre{background:#1e293b;color:#e2e8f0;padding:1.5rem;border-radius:8px;overflow-x:auto;font-size:.9em}
blockquote{border-left:4px solid #6366f1;padding-left:1rem;color:#64748b;font-style:italic}
table{border-collapse:collapse;width:100%;margin:1rem 0}
th,td{border:1px solid #e2e8f0;padding:.75rem;text-align:left}
th{background:#f8fafc}
.tag{display:inline-block;background:#e0e7ff;color:#4338ca;padding:.2rem .6rem;border-radius:999px;font-size:.85em;margin-right:.5rem}

Prompt Injection Defense Strategies for AI Agents

Reviewed: June 4, 2026

Published: May 26, 2026 | Reading time: 12 min | Topics: AI Security Prompt Injection Agent Architecture

Prompt injection remains the #1 attack vector against AI agents in 2026. As agents gain more autonomy — calling tools, accessing files, sending emails — the blast radius of a single injection grows dramatically. This guide covers defense-in-depth strategies that actually work.

Understanding the Threat Landscape

Prompt injection attacks against AI agents fall into three categories:

Attack Type Description Severity
Direct Injection Attacker embeds malicious instructions in user input High
Indirect Injection Malicious content in tool outputs, files, or web pages the agent reads Critical
Multi-turn Injection Attacker gradually manipulates context across multiple interactions Medium-High

Unlike chatbot-only systems, AI agents amplify injection risk because they act on instructions. A compromised agent can exfiltrate data, send unauthorized messages, or execute destructive commands.

Defense Layer 1: Input Sanitization

The first line of defense is treating all user input as untrusted. Here’s a practical sanitization pipeline:

import re
from dataclasses import dataclass
from typing import Optional

@dataclass
class SanitizationResult:
    cleaned: str
    warnings: list[str]
    blocked: bool

class InputSanitizer:
    """Multi-layer input sanitization for AI agent prompts."""
    
    # Patterns commonly used in injection attempts
    INJECTION_PATTERNS = [
        r'ignores+(alls+)?(previous|prior|above|earlier)s+instructions?',
        r'forgets+(everything|all|what)s+(you|i)s+(said|told|know)',
        r'yous+ares+nows+a?s*(different|new|other)',
        r'systems*:s*',
        r'</?system>',
        r'[INST]|[/INST]',
        r'###s*(SYSTEM|HUMAN|ASSISTANT)',
    ]
    
    def __init__(self):
        self.compiled = [re.compile(p, re.IGNORECASE) for p in self.INJECTION_PATTERNS]
    
    def sanitize(self, user_input: str) -> SanitizationResult:
        warnings = []
        cleaned = user_input
        
        # Check for injection patterns
        for pattern in self.compiled:
            matches = pattern.findall(cleaned)
            if matches:
                warnings.append(f"Injection pattern detected: {pattern.pattern[:40]}...")
                cleaned = pattern.sub('[REDACTED]', cleaned)
        
        # Enforce length limits
        if len(cleaned) > 10000:
            warnings.append("Input truncated to 10000 chars")
            cleaned = cleaned[:10000]
        
        # Block if too many patterns triggered
        blocked = len(warnings) >= 3
        
        return SanitizationResult(
            cleaned=cleaned,
            warnings=warnings,
            blocked=blocked
        )

# Usage
sanitizer = InputSanitizer()
result = sanitizer.sanitize(user_message)
if result.blocked:
    raise SecurityError(f"Input blocked: {result.warnings}")

Key principles:

  • Never concatenate user input directly into system prompts — use structured message formats
  • Escape delimiters — attackers exploit XML tags, markdown, and special tokens
  • Length limits — oversized inputs are often injection carriers

Defense Layer 2: Output Filtering

Even with clean inputs, agents can be manipulated through tool outputs. Output filtering checks the agent’s responses before they reach the user or trigger actions:

class OutputFilter:
    """Filter agent outputs to prevent data exfiltration and action abuse."""
    
    # Patterns indicating potential data exfiltration
    SUSPICIOUS_OUTPUTS = [
        r'(?:password|secret|key|token|credential)s*[:=]s*S+',
        r'b[A-Za-z0-9+/]{40,}={0,2}b',  # Base64 blobs
        r'(?:send|post|email|upload)s+(?:to|at)s+S+',
    ]
    
    BLOCKED_DOMAINS = ['evil.com', 'attacker.net', 'pastebin.com']
    
    def filter_response(self, response: str, action: Optional[str] = None) -> dict:
        issues = []
        
        # Check for credential leakage
        for pattern in self.SUSPICIOUS_OUTPUTS:
            if re.search(pattern, response, re.IGNORECASE):
                issues.append("Potential credential/sensitive data in output")
                break
        
        # Check actions for unauthorized destinations
        if action:
            for domain in self.BLOCKED_DOMAINS:
                if domain in action:
                    issues.append(f"Action targets blocked domain: {domain}")
        
        return {
            "allowed": len(issues) == 0,
            "filtered_response": response if not issues else "[Output filtered for security]",
            "issues": issues
        }

# Wrap tool calls
def safe_tool_call(tool_func, output_filter, *args, **kwargs):
    result = tool_func(*args, **kwargs)
    check = output_filter.filter_response(str(result))
    if not check["allowed"]:
        raise SecurityError(f"Tool output filtered: {check['issues']}")
    return result

Defense Layer 3: Sandboxing and Capability Restrictions

The most robust defense is limiting what an agent can do. Principle of least privilege applies:


class AgentSandbox:
    """Capability-restricted execution environment for AI agents."""
    
    def __init__(self, allowed_tools: list[str], max_iterations: int = 10):
        self.allowed_tools = set(allowed_tools)
        self.max_iterations = max_iterations
        self.execution_count = 0
        self.audit_log = []
    
    def call_tool(self, tool_name: str, params: dict) -> dict:
        # Enforce tool whitelist
        if tool_name not in self.allowed_tools:
            raise SecurityError(
                f"Tool '{tool_name}' not in allowed set: {self.allowed_tools}"
            )
        
        # Enforce iteration limits (prevents infinite loops from injection)
        self.execution_count += 1
        if self.execution_count > self.max_iterations:
            raise SecurityError(f"Max iterations ({self.exhausted}) exceeded")
        
        # Log every action for audit
        self.audit_log.append({
            "tool": tool_name,
            "params": {k: v for k, v in params.items() if k != 'password'},
            "timestamp": datetime.utcnow().isoformat()
        })
        
        return execute_tool(tool_name, params)

    def require_approval(self, tool_name: str, params: dict) -> bool:
        """Require human approval for high-risk actions."""
        high_risk_tools = {'send_email', 'execute_code', 'delete_file', 'http_post'}
        return tool_name in high_risk_tools

# Example: Restricted agent for customer support
sandbox = AgentSandbox(
    allowed_tools=['search_knowledge_base', 'lookup_order', 'send_ticket'],
    max_iterations=5
)

Defense Layer 4: Structured Prompt Architecture

Instead of string concatenation, use structured message formats that separate instructions from data:

# ❌ VULNERABLE: String concatenation
prompt = f"""
You are a helpful assistant. Answer the user's question.
User says: {user_input}
"""

# ✅ SECURE: Structured messages with explicit role separation
messages = [
    {
        "role": "system",
        "content": "You are a helpful customer support agent. You may only answer questions about orders and products.",
        # System instructions are NOT influenced by user content
        "metadata": {"immutable": True}
    },
    {
        "role": "user", 
        "content": user_input,  # Treated as data, not instructions
        "metadata": {"source": "user", "trust_level": "untrusted"}
    }
]

# Even better: Use tool/function calling instead of free-text instructions
tools = [{
    "type": "function",
    "function": {
        "name": "lookup_order",
        "description": "Look up order status",
        "parameters": {
            "type": "object",
            "properties": {
                "order_id": {"type": "string", "pattern": "^ORD-[0-9]{6}$"}
            },
            "required": ["order_id"]
        }
    }
}]

Defense Layer 5: Anomaly Detection

Monitor agent behavior for signs of compromise:


class AgentAnomalyDetector:
    """Detect anomalous agent behavior that may indicate injection."""
    
    BASELINE = {
        "avg_tools_per_request": 2.5,
        "avg_response_length": 500,
        "max_external_calls": 3,
        "max_tool_chain_depth": 4
    }
    
    def check_session(self, session_metrics: dict) -> list[str]:
        alerts = []
        
        # Unusual tool usage
        if session_metrics["tools_called"] > self.BASELINE["avg_tools_per_request"] * 3:
            alerts.append("Excessive tool usage — possible injection loop")
        
        # Unexpected external calls
        if session_metrics["external_calls"] > self.BASELINE["max_external_calls"]:
            alerts.append(f"Too many external calls: {session_metrics['external_calls']}")
        
        # Unusual response patterns
        if session_metrics["response_length"] > self.BASELINE["avg_response_length"] * 5:
            alerts.append("Abnormally long response — possible data exfiltration")
        
        # Tool usage outside normal patterns
        unusual_tools = set(session_metrics["tools_used"]) - set(self.BASELINE.get("expected_tools", []))
        if unusual_tools:
            alerts.append(f"Unexpected tools called: {unusual_tools}")
        
        return alerts

Putting It All Together: Defense-in-Depth

A production AI agent should implement all five layers:

  1. Input Sanitization — clean user inputs before they reach the LLM
  2. Structured Prompts — separate instructions from data using message roles
  3. Sandboxing — whitelist tools, limit iterations, require approval for sensitive actions
  4. Output Filtering — scan responses for data exfiltration and unauthorized actions
  5. Anomaly Detection — monitor for behavioral deviations that signal compromise

Quick Reference: Security Checklist

Check Status
User input never concatenated into system prompts
Tool whitelist enforced at execution layer
Iteration limits prevent infinite loops
High-risk actions require human approval
All tool calls logged for audit
Output filtered for credential leakage
Anomaly detection alerts configured
Regular red team testing scheduled

This guide is part of the DataGate.ch AI Security series. For more on securing AI systems, see our Red Teaming Guide and Safety Benchmarks pages.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert