A Modular Platform for Agent Inference, Evaluation, and Training
AgentInferKit is an open-source platform for building, running, and analyzing LLM/VLM/Agent systems across text, multimodal, RAG, and tool-use settings. It is designed for agent inference today, and built to extend toward agent reasoning, agent training, and RL-based optimization in the future.
AgentInferKit follows a three-layer design:
- Platform Layer: unified model access, inference execution, tool simulation, batch evaluation, visualization, and engineering management
- Data Layer: dataset organization, preprocessing, standardization, versioning, and custom data loading
- Experiment Layer: benchmark protocols, controlled comparisons, and research-oriented analysis
At the current stage, the project mainly focuses on the platform layer and data layer.
- Unified access to API models, local models, and multimodal models
- Pluggable reasoning strategies such as Direct Prompting, CoT, Long-CoT, and ToT
- Built-in RAG pipeline with chunking, indexing, retrieval, and evidence tracking
- Support for API / function calling and tool-use simulation
- Batch inference, single-sample debugging, logging, retry, and resume
- Configurable evaluation with metrics for text, retrieval, and agent tasks
- Research-friendly visualization for predictions, traces, evidence, and errors
- Modular architecture for future extension to training and RL
AgentInferKit currently targets the following task types:
- Text QA
- Knowledge-oriented text exam
- Image understanding
- API / function calling
- Retrieval-augmented reasoning
- Prompt-based reasoning strategy comparison
The engineering foundation of the project, including:
- model adapters
- reasoning strategies
- RAG pipeline
- task runners
- tool simulation
- evaluators
- visualization dashboard
- config and logging system
Standardizes heterogeneous data into reusable benchmark assets, including:
- QA data
- text-exam data
- image understanding data
- agent API function calling data
The data layer is designed to make data runnable, evaluable, traceable, and versioned.
# Clone the repo
git clone https://github.com/CodeSoul-co/AgentInferKit.git
cd AgentInferKit
# Create conda environment
conda create -n benchmark python=3.11 -y
conda activate benchmark
pip install -r requirements.txt
# Configure API key
cp .env.example .env
# Edit .env and fill in your DEEPSEEK_API_KEYPYTHONPATH=$(pwd) uvicorn src.main:app --host 0.0.0.0 --port 8000Open browser: http://localhost:8000/docs to see all API endpoints.
Direct mode (fast, concise):
curl -s -X POST http://localhost:8000/chat/complete \
-H "Content-Type: application/json" \
-d '{
"model_id": "deepseek-chat",
"strategy": "direct",
"messages": [{"role": "user", "content": "What is machine learning?"}]
}' | python3 -m json.toolChain-of-Thought mode (step-by-step reasoning):
curl -s -X POST http://localhost:8000/chat/complete \
-H "Content-Type: application/json" \
-d '{
"model_id": "deepseek-chat",
"strategy": "cot",
"messages": [{"role": "user", "content": "A train travels 120km in 2 hours. What is its speed?"}]
}' | python3 -m json.toolStreaming mode (token-by-token output):
curl -N -X POST http://localhost:8000/chat/stream \
-H "Content-Type: application/json" \
-d '{
"model_id": "deepseek-chat",
"strategy": "direct",
"messages": [{"role": "user", "content": "Write a short poem about spring"}]
}'# Direct strategy experiment
PYTHONPATH=$(pwd) python scripts/run_experiment.py \
--config configs/experiments/demo_exam_direct.yaml
# CoT strategy experiment
PYTHONPATH=$(pwd) python scripts/run_experiment.py \
--config configs/experiments/demo_exam_cot.yamlResults are saved to outputs/predictions/ and outputs/metrics/.
We ran 5 exam questions (math, physics, CS) with two strategies:
| Metric | Direct | CoT |
|---|---|---|
| Accuracy | 80% (4/5) | 100% (5/5) |
| Avg Latency | 2.2s | 10.7s |
| Avg Tokens | 69.6 | 281.4 |
CoT reasoning improves accuracy at the cost of higher latency and token usage.
AgentInferKit/
├── src/
│ ├── adapters/ # LLM provider adapters (DeepSeek, OpenAI, Anthropic, Qwen)
│ ├── strategies/ # Inference strategies (direct, cot, long_cot, tot, react, self_refine, self_consistency)
│ ├── rag/ # RAG pipeline (chunker, embedder, milvus_store, retriever, pipeline)
│ ├── runners/ # Task runners (qa, exam, batch, agent)
│ ├── evaluators/ # Metrics (text, choice, rag, efficiency)
│ ├── toolsim/ # Tool simulation (registry, executor, tracer)
│ ├── api/ # FastAPI routes (chat, datasets, results, system)
│ └── utils/ # Shared utilities
├── scripts/ # CLI scripts (run_experiment, build_chunks, build_index, build_mcq)
├── configs/ # YAML configs for models and experiments
├── data/ # Datasets and schemas
└── outputs/ # Experiment results (gitignored)
| Endpoint | Method | Description |
|---|---|---|
/chat/complete |
POST | Single chat completion with strategy selection |
/chat/stream |
POST | Streaming chat completion (SSE) |
/datasets |
GET | List available datasets |
/datasets/upload |
POST | Upload a new dataset |
/results/{id}/metrics |
GET | Get experiment metrics |
/results/{id}/predictions |
GET | Get experiment predictions |
/results/compare |
POST | Compare multiple experiments |
/api/v1/system/health |
GET | Health check |
Full interactive docs at: http://localhost:8000/docs
| Provider | Model | Status |
|---|---|---|
| DeepSeek | deepseek-chat | Verified |
| OpenAI | gpt-4o, gpt-4o-mini | Ready (needs API key) |
| Anthropic | claude-3.5-sonnet | Ready (needs API key) |
| Qwen | qwen-plus | Ready (needs API key) |
| Strategy | Key | Description |
|---|---|---|
| Direct | direct |
Simple prompt, fast response |
| Chain-of-Thought | cot |
Step-by-step reasoning |
| Long CoT | long_cot |
Extended multi-step reasoning |
| Tree-of-Thought | tot |
Multiple reasoning paths + evaluation |
| ReAct | react |
Reasoning + tool actions interleaved |
| Self-Refine | self_refine |
Generate -> critique -> improve loop |
| Self-Consistency | self_consistency |
Multiple paths + majority voting |
Contributions are welcome, especially in:
- model adapters
- task runners
- evaluators
- RAG pipelines
- tool simulation
- visualization
- data preprocessing
- documentation
@misc{agentinferkit,
title={AgentInferKit: A Modular Platform for Agent Inference, Evaluation, and Training},
author={Zhenke Duan},
year={2026},
howpublished={GitHub repository}
}