Skip to content

research(testing): TDAD behavioral spec testing for skills and system prompt blocks #1842

@bug-ops

Description

@bug-ops

Research Finding

Paper: Test-Driven AI Agent Definition (TDAD) (arXiv:2603.08806, 2026)

Treats agent system prompts and skill definitions as compiled artifacts: behavioral specs → executable tests (via coding agent) → iterative prompt refinement until tests pass. Adds semantic mutation testing (faulty prompt variants) to measure test-suite robustness. Reports 92% compilation success, 86-100% mutation scores.

Applicability to Zeph

Directly applicable to Zeph's continuous improvement protocol and self-learning pipeline:

1. Skill behavioral specs

Each SKILL.md could have a companion SKILL_TESTS.md with expected input/output behavior pairs. After self-learning mutates a skill, run the behavioral tests to validate the mutation didn't regress.

2. System prompt block testing

Zeph's system prompt has stable blocks (Block 1: base identity, Block 2: volatile env). TDAD mutation testing could verify that removing or altering a block causes measurable behavior change — confirming the block is actually load-bearing.

3. Two-agent loop integration

The TDAD two-agent loop (test writer + prompt refiner) maps naturally onto Zeph's orchestration: spawn a test-writer sub-agent to generate behavioral tests for a skill, then a skill-refiner to improve the skill until tests pass. Uses existing AgentTestHarness (ARCH-08) as test executor.

References

  • arXiv:2603.08806
  • Zeph crates: zeph-skills (learning.rs, registry.rs), zeph-core (agent/), AgentTestHarness (ARCH-08)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P4Long-term / exploratoryresearchResearch-driven improvementskillszeph-skills crate

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions