AgentInferKit

A Modular Platform for Agent Inference, Evaluation, and Training

AgentInferKit is an open-source platform for building, running, and analyzing LLM/VLM/Agent systems across text, multimodal, RAG, and tool-use settings. It is designed for agent inference today, and built to extend toward agent reasoning, agent training, and RL-based optimization in the future.

Overview

AgentInferKit follows a three-layer design:

Platform Layer: unified model access, inference execution, tool simulation, batch evaluation, visualization, and engineering management
Data Layer: dataset organization, preprocessing, standardization, versioning, and custom data loading
Experiment Layer: benchmark protocols, controlled comparisons, and research-oriented analysis

At the current stage, the project mainly focuses on the platform layer and data layer.

Features

Unified access to API models, local models, and multimodal models
Pluggable reasoning strategies such as Direct Prompting, CoT, Long-CoT, and ToT
Built-in RAG pipeline with chunking, indexing, retrieval, and evidence tracking
Support for API / function calling and tool-use simulation
Batch inference, single-sample debugging, logging, retry, and resume
Configurable evaluation with metrics for text, retrieval, and agent tasks
Research-friendly visualization for predictions, traces, evidence, and errors
Modular architecture for future extension to training and RL

Current Scope

AgentInferKit currently targets the following task types:

Text QA
Knowledge-oriented text exam
Image understanding
API / function calling
Retrieval-augmented reasoning
Prompt-based reasoning strategy comparison

Architecture

Platform Layer

The engineering foundation of the project, including:

model adapters
reasoning strategies
RAG pipeline
task runners
tool simulation
evaluators
visualization dashboard
config and logging system

Data Layer

Standardizes heterogeneous data into reusable benchmark assets, including:

QA data
text-exam data
image understanding data
agent API function calling data

The data layer is designed to make data runnable, evaluable, traceable, and versioned.

Quick Start

1. Environment Setup

# Clone the repo
git clone https://github.com/CodeSoul-co/AgentInferKit.git
cd AgentInferKit

# Create conda environment
conda create -n benchmark python=3.11 -y
conda activate benchmark
pip install -r requirements.txt

# Configure API key
cp .env.example .env
# Edit .env and fill in your DEEPSEEK_API_KEY

2. Start API Server

PYTHONPATH=$(pwd) uvicorn src.main:app --host 0.0.0.0 --port 8000

Open browser: http://localhost:8000/docs to see all API endpoints.

3. Chat with AI (Terminal)

Direct mode (fast, concise):

curl -s -X POST http://localhost:8000/chat/complete \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "deepseek-chat",
    "strategy": "direct",
    "messages": [{"role": "user", "content": "What is machine learning?"}]
  }' | python3 -m json.tool

Chain-of-Thought mode (step-by-step reasoning):

curl -s -X POST http://localhost:8000/chat/complete \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "deepseek-chat",
    "strategy": "cot",
    "messages": [{"role": "user", "content": "A train travels 120km in 2 hours. What is its speed?"}]
  }' | python3 -m json.tool

Streaming mode (token-by-token output):

curl -N -X POST http://localhost:8000/chat/stream \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "deepseek-chat",
    "strategy": "direct",
    "messages": [{"role": "user", "content": "Write a short poem about spring"}]
  }'

4. Run Batch Experiment

# Direct strategy experiment
PYTHONPATH=$(pwd) python scripts/run_experiment.py \
  --config configs/experiments/demo_exam_direct.yaml

# CoT strategy experiment
PYTHONPATH=$(pwd) python scripts/run_experiment.py \
  --config configs/experiments/demo_exam_cot.yaml

Results are saved to outputs/predictions/ and outputs/metrics/.

5. Demo Experiment Results

We ran 5 exam questions (math, physics, CS) with two strategies:

Metric	Direct	CoT
Accuracy	80% (4/5)	100% (5/5)
Avg Latency	2.2s	10.7s
Avg Tokens	69.6	281.4

CoT reasoning improves accuracy at the cost of higher latency and token usage.

Project Structure

AgentInferKit/
├── src/
│   ├── adapters/       # LLM provider adapters (DeepSeek, OpenAI, Anthropic, Qwen)
│   ├── strategies/     # Inference strategies (direct, cot, long_cot, tot, react, self_refine, self_consistency)
│   ├── rag/            # RAG pipeline (chunker, embedder, milvus_store, retriever, pipeline)
│   ├── runners/        # Task runners (qa, exam, batch, agent)
│   ├── evaluators/     # Metrics (text, choice, rag, efficiency)
│   ├── toolsim/        # Tool simulation (registry, executor, tracer)
│   ├── api/            # FastAPI routes (chat, datasets, results, system)
│   └── utils/          # Shared utilities
├── scripts/            # CLI scripts (run_experiment, build_chunks, build_index, build_mcq)
├── configs/            # YAML configs for models and experiments
├── data/               # Datasets and schemas
└── outputs/            # Experiment results (gitignored)

API Endpoints

Endpoint	Method	Description
`/chat/complete`	POST	Single chat completion with strategy selection
`/chat/stream`	POST	Streaming chat completion (SSE)
`/datasets`	GET	List available datasets
`/datasets/upload`	POST	Upload a new dataset
`/results/{id}/metrics`	GET	Get experiment metrics
`/results/{id}/predictions`	GET	Get experiment predictions
`/results/compare`	POST	Compare multiple experiments
`/api/v1/system/health`	GET	Health check

Full interactive docs at: http://localhost:8000/docs

Supported Models

Provider	Model	Status
DeepSeek	deepseek-chat	Verified
OpenAI	gpt-4o, gpt-4o-mini	Ready (needs API key)
Anthropic	claude-3.5-sonnet	Ready (needs API key)
Qwen	qwen-plus	Ready (needs API key)

Inference Strategies

Strategy	Key	Description
Direct	`direct`	Simple prompt, fast response
Chain-of-Thought	`cot`	Step-by-step reasoning
Long CoT	`long_cot`	Extended multi-step reasoning
Tree-of-Thought	`tot`	Multiple reasoning paths + evaluation
ReAct	`react`	Reasoning + tool actions interleaved
Self-Refine	`self_refine`	Generate -> critique -> improve loop
Self-Consistency	`self_consistency`	Multiple paths + majority voting

Contributing

Contributions are welcome, especially in:

model adapters
task runners
evaluators
RAG pipelines
tool simulation
visualization
data preprocessing
documentation

Citation

@misc{agentinferkit,
  title={AgentInferKit: A Modular Platform for Agent Inference, Evaluation, and Training},
  author={Zhenke Duan},
  year={2026},
  howpublished={GitHub repository}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
configs		configs
data		data
outputs		outputs
scripts		scripts
src		src
tests		tests
webui		webui
.env.example		.env.example
.gitignore		.gitignore
AgentInferKit_WebUI_设计方案.md		AgentInferKit_WebUI_设计方案.md
FORMAT_AND_METRICS.md		FORMAT_AND_METRICS.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentInferKit

Overview

Features

Current Scope

Architecture

Platform Layer

Data Layer

Quick Start

1. Environment Setup

2. Start API Server

3. Chat with AI (Terminal)

4. Run Batch Experiment

5. Demo Experiment Results

Project Structure

API Endpoints

Supported Models

Inference Strategies

Contributing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentInferKit

Overview

Features

Current Scope

Architecture

Platform Layer

Data Layer

Quick Start

1. Environment Setup

2. Start API Server

3. Chat with AI (Terminal)

4. Run Batch Experiment

5. Demo Experiment Results

Project Structure

API Endpoints

Supported Models

Inference Strategies

Contributing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages