Highlights · Environment Setup · Quick Start · Repository Structure · Acknowledgements · License · Citation
SpecEyes is a speculative perception and planning framework for agentic multimodal LLMs. It uses a lightweight vision-language model to quickly assess a visual input and question, then applies answer separability gating to either accept the fast answer or defer to a more powerful large model with tool usage. This approach significantly reduces latency and computation in complex multimodal reasoning while maintaining strong accuracy. This repository includes evaluation code, judge scripts, confidence analysis, and result aggregation tools for SpecEyes.
| Direction | Description |
|---|---|
| Stateful Bottleneck Analysis | Reveal the sequential tool-use dependency limiting latency and concurrency in agentic MLLMs. |
| Agentic-Level Speculation | Propose speculative reasoning that skips full tool invocation loops for easy queries. |
| Answer Separability Gating | Introduce a new confidence metric based on top-K logit gaps to decide safe bypass. |
- Highlights ✨
- Table of Contents
- 1. Environment Setup 🛠️
- 2. Quick Start 🚀
- 3. Repository Structure 🗂️
- 4. Acknowledgements 🙏
- 5. License ⚖️
- 6. Citation 📚
We recommend Python 3.11. Install the PyTorch build matching your CUDA version first, then install the project requirements:
pip install -r requirements.txtRecommended optional packages:
flash-attn: useful for higher throughput on supported GPUsvllm==0.12.0: recommended in a separate environment for the judge model service
This repository also relies on a patched image-loading behavior in qwen-vl-utils. After installing qwen-vl-utils, run:
python scripts/patch_qwen_vl_utils.pyDownload the datasets and models into the following directories, or pass explicit paths at runtime:
- V*:
data/vstar - HR-Bench:
data/HR-Bench - POPE:
data/POPE - Deepeyes:
ChenShawn/DeepEyes-7B - Thyme:
Kwai-Keye/Thyme-RL - Qwen3-VL-2B:
Qwen/Qwen3-VL-2B-Instruct - Qwen2.5-72B:
Qwen/Qwen2.5-72B-Instruct
# Deepeyes baseline
python eval_code_deepeyes/SpecEyes.py --baseline
# Deepeyes with confidence gating
python eval_code_deepeyes/SpecEyes.py --score_threshold 0.98
# Thyme baseline
python eval_code_thyme/SpecEyes.py --baseline
# Thyme with confidence gating
python eval_code_thyme/SpecEyes.py --score_threshold 0.98For the code-reasoning variant, replace SpecEyes.py with SpecReason.py.
bash scripts/start_qwen2.5_72b_vllm.shThe default judge endpoint is http://localhost:23333/v1. Override it with --api_url if needed.
bash scripts/run_judges.shYou can also run them manually:
python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_pope.py --input_folder eval_results_qwen3vl-2b-Instruct# Run batched small-model inference
python scripts/small_model_batch_inference.py
# Judge the generated outputs
python judge_code/judge_vstar.py --input_folder eval_results_qwen3vl-2b-Instruct
python judge_code/judge_hr.py --input_folder eval_results_qwen3vl-2b-Instruct
# Analyze judge results
python scripts/analyze_small_confidence.py --input_folder judge_results_qwen3vl-2b-Instruct
python scripts/analyze_small_conf_percentage.py --input_folder judge_results_qwen3vl-2b-InstructSpecEyes/
├── data/
│ ├── vstar/
│ ├── HR-Bench/
│ └── POPE/
├── eval_code_deepeyes/
├── eval_code_thyme/
├── judge_code/
├── scripts/
├── vis/
├── eval_results_deepeyes/
├── eval_results_thyme/
└── ...
Core directories:
| Path | Description |
|---|---|
eval_code_deepeyes/ |
SpecEyes and SpecReason evaluation code built on Deepeyes |
eval_code_thyme/ |
SpecEyes and SpecReason evaluation code built on Thyme |
judge_code/ |
Judge scripts using a vLLM OpenAI-compatible endpoint |
scripts/small_model_batch_inference.py |
Batched small-model inference and confidence signal export |
scripts/gather_result.py |
Aggregation of speedup, and accuracy results |
scripts/analyze_small_confidence.py |
Confidence-distribution and performance analysis |
vis/ |
Plotting and visualization utilities used in the paper |
Additional notes:
eval_code_thyme/sandbox.pyis a localized sandbox copy used by the Thyme evaluation pipeline- Temporary processed images are written to
eval_code_thyme/temp_processed_images/ - Result folders and cache directories are intentionally excluded through
.gitignore
This repository benefits from code references from the DeepEyes repository. We sincerely thank the authors and maintainers for their open-source contributions, which helped inform parts of our implementation and experimentation workflow.
This repository is released under Apache-2.0. See LICENSE for the full license text.
The repository also includes notes about third-party code and patches, including:
- the upstream source attribution for
eval_code_thyme/sandbox.py - the patching behavior for
qwen-vl-utils
See THIRD_PARTY_NOTICES.md for the relevant attribution and redistribution notes. If you redistribute or modify those third-party-related components, you should also follow the corresponding upstream license requirements.
If you use this repository, please cite the corresponding paper:
@article{huang2026,
title={SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning},
author={Huang, Haoyu and Huang, Jinfa and Wan, Zhongwei and Zheng, Xiawu and Ji, Rongrong and Luo, Jiebo},
journal={arXiv preprint arXiv:2603.23483},
year={2026}
}