Deeptrace is a multimodal sensitive-data leak simulation engine that generates realistic synthetic leak artifacts across image, video, audio, and document channels.
The project is designed for security testing, DLP model training, red-team exercises, and awareness programs where real leaked data is either unavailable or too risky to use.
Modern leak incidents are no longer text-only. In practice, secrets leak through screenshots, recordings, meetings, and office documents. Traditional DLP coverage is often insufficient for these channels.
Deeptrace simulates realistic leak trajectories end to end. Instead of randomly inserting secret strings, it uses LLM-driven contextual generation to place secrets into coherent host environments, then renders them into multimodal artifacts.
- DLP cold start: generate large synthetic datasets without exposing production secrets.
- Red/blue exercises: produce believable honey artifacts for adversary engagement.
- Security awareness: create immersive leak examples for internal training.
Simulating the unseen, to trace the unknown.
-
Multimodal coverage instead of text-only assumptions. Traditional DLP workflows mostly focus on text logs. Real incidents often happen through screenshots, recordings, shared slides, and meeting audio. Deeptrace fills that practical gap.
-
Safe synthetic data for model training. High-quality leak datasets are hard to collect because real secrets are regulated and dangerous to circulate. Deeptrace produces high-fidelity synthetic samples without exposing production credentials.
-
Better realism for offensive and defensive drills. Security exercises become more actionable when bait files and leak traces look operationally authentic.
-
LLM-based contextual generation. Deeptrace does not only paste a secret string into random text. It generates semantically coherent host context such as code snippets, config fragments, chat logs, and business documents.
-
Adaptive scene routing. For visual modalities, the system can infer where a secret is most likely to appear (IDE, CLI, chat, config, UI), reducing manual rule authoring.
-
End-to-end multimodal rendering pipeline. The same scenario can be rendered into static images, scrolling videos, speech, PDF, Word, and PPT outputs to emulate real-world leakage surfaces.
-
Robustness-oriented augmentation. Optional noise and compression-like perturbations help evaluate OCR and vision models under imperfect capture conditions.
Deeptrace is intentionally built in layers, and this evolution is important to understand the current architecture:
-
Phase 1: Scenario generation and visual rendering. The initial version focused on LLM-driven scene synthesis and image/video outputs.
-
Phase 2: Audio and document modalities. Audio, PDF, Word, and PPT generators were added to cover broader enterprise leakage channels.
-
Phase 3: Secret generator module (added later). The key and network-string generator was introduced as a later enhancement to support scalable synthetic secret creation. This module is an incremental extension, not the original core, and is now integrated as an optional upstream source for scenario payloads.
This staged growth explains why some modules look more mature than others and why interfaces evolved over time.
- Image: IDE/CLI/chat/config/UI screenshots.
- Video: Pan/scroll videos generated from visual scenes.
- Audio: Secret mention simulation via TTS.
- PDF: Formal leak-like document artifacts.
- Word: Internal memo and handoff style documents.
- PPT: Presentation-style sensitive content artifacts.
.
├── main.py
├── llm_utils.py
├── config/
│ ├── config.json
│ └── config_sample.json
├── generators/
│ ├── secret_generators/
│ │ ├── Key_Gen.py
│ │ ├── NetworkStr_Gen.py
│ │ └── secret_gen.py
│ ├── audio_gen.py
│ ├── chat_gen.py
│ ├── cli_gen.py
│ ├── config_gen.py
│ ├── pdf_gen.py
│ ├── ppt_gen.py
│ ├── ui_gen.py
│ ├── video_utils.py
│ ├── vscode_gen.py
│ └── word_gen.py
├── output/
└── output_sample/
- Python 3.8+
- Node.js 18+
- LLM API key
pip install -r requirements.txt
sudo apt-get install -y espeak-ngCreate or edit config/config.json based on config/config_sample.json.
Key fields:
api_key: LLM API keybase_url: LLM API endpointoutput_dir: output folderadd_noise: whether to add visual noiseitems: generation list (secret,secret_type,modality, optionalscene)
python main.py
python main.py --config config/config_sample.jsoncd backend
pip install -r requirements.txt
python main.pyBackend endpoints:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
cd frontend
npm install
npm run devFrontend: http://localhost:3000
Typical web flow:
- Register and log in.
- Create a task from dashboard.
- Configure secret type, modality, and scene.
- Submit task and review outputs.
- Analyze
secret_typeand modality. - Select or infer scene for image/video tasks.
- Generate context-rich content with embedded secret.
- Render with the modality-specific generator.
The secret generator was added after the initial multimodal pipeline and is now available as a reusable component.
- Key generator APIs:
Keygen(modeset: int) -> strRandKeygen() -> str
- Network string APIs:
Strgen(modeset: int) -> strRandStrgen() -> str
- Unified wrapper:
AllRandgen() -> str
See source files under generators/secret_generators/ for supported enum types and format constraints.
To keep docs maintainable, previous standalone pitch and usage docs are merged into this README.