Skip to content

research(security): multi-layer prompt injection defense with response verification (SecureAgent) #1862

@bug-ops

Description

@bug-ops

Summary

A comprehensive benchmark + multi-layer defense framework for prompt injection in RAG-enabled agents. Reduces attack success from 73.2% to 8.7% across 847 adversarial test cases in 5 attack categories.

Source: arXiv 2511.15759 — Securing AI Agents Against Prompt Injection Attacks: A Comprehensive Benchmark and Defense Framework
Badrinath Ramakrishnan, Akshaya Balaji. Published 2025-11-19.

Key Results

  • 847 adversarial test cases, 5 categories: direct injection, context manipulation, instruction override, data exfiltration, cross-context contamination
  • Defense = content filtering + prompt architecture improvements + response verification (post-LLM check)
  • 89.4% attack mitigation, 94.3% legitimate functionality preserved
  • Evaluated across 7 LLMs — model-specific vulnerability profiles identified

Applicability to Zeph

Zeph already has ContentSanitizer + ExfiltrationGuard (epic #1195) covering content filtering and exfiltration.

Gap: The response verification layer is missing — no post-LLM check that the agent's output wasn't compromised by injected instructions.

Integration point: AgentLoop::turn() after LLM response, before tool execution dispatch.

  1. Scan LLM response for injected-instruction patterns (overrides of autonomy_level, unauthorized memory writes, unexpected exfiltration paths)
  2. Cross-reference with known injection patterns from ContentSanitizer::injection_patterns()
  3. If flagged: escalate to WARN, optionally block tool execution (configurable)

Complements: #1651 (PromptArmor — pre-screen at input), this adds post-LLM response verification.

Implementation Sketch

  • ResponseVerifier struct in zeph-core::security
  • verify_response(response: &str, injection_context: &InjectionContext) -> VerificationResult
  • Config: [security.response_verification] enabled = true, block_on_detection = false
  • TUI: show SEC panel alert when response verification fires

Metadata

Metadata

Assignees

No one assigned

    Labels

    P4Long-term / exploratoryresearchResearch-driven improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions