-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
A comprehensive benchmark + multi-layer defense framework for prompt injection in RAG-enabled agents. Reduces attack success from 73.2% to 8.7% across 847 adversarial test cases in 5 attack categories.
Source: arXiv 2511.15759 — Securing AI Agents Against Prompt Injection Attacks: A Comprehensive Benchmark and Defense Framework
Badrinath Ramakrishnan, Akshaya Balaji. Published 2025-11-19.
Key Results
- 847 adversarial test cases, 5 categories: direct injection, context manipulation, instruction override, data exfiltration, cross-context contamination
- Defense = content filtering + prompt architecture improvements + response verification (post-LLM check)
- 89.4% attack mitigation, 94.3% legitimate functionality preserved
- Evaluated across 7 LLMs — model-specific vulnerability profiles identified
Applicability to Zeph
Zeph already has ContentSanitizer + ExfiltrationGuard (epic #1195) covering content filtering and exfiltration.
Gap: The response verification layer is missing — no post-LLM check that the agent's output wasn't compromised by injected instructions.
Integration point: AgentLoop::turn() after LLM response, before tool execution dispatch.
- Scan LLM response for injected-instruction patterns (overrides of
autonomy_level, unauthorized memory writes, unexpected exfiltration paths) - Cross-reference with known injection patterns from
ContentSanitizer::injection_patterns() - If flagged: escalate to WARN, optionally block tool execution (configurable)
Complements: #1651 (PromptArmor — pre-screen at input), this adds post-LLM response verification.
Implementation Sketch
ResponseVerifierstruct inzeph-core::securityverify_response(response: &str, injection_context: &InjectionContext) -> VerificationResult- Config:
[security.response_verification] enabled = true, block_on_detection = false - TUI: show SEC panel alert when response verification fires