Skip to content

CodeSoul-co/AgentInferKit

Repository files navigation

AgentInferKit

A Modular Platform for Agent Inference, Evaluation, and Training

AgentInferKit is an open-source platform for building, running, and analyzing LLM/VLM/Agent systems across text, multimodal, RAG, and tool-use settings. It is designed for agent inference today, and built to extend toward agent reasoning, agent training, and RL-based optimization in the future.


Overview

AgentInferKit follows a three-layer design:

  • Platform Layer: unified model access, inference execution, tool simulation, batch evaluation, visualization, and engineering management
  • Data Layer: dataset organization, preprocessing, standardization, versioning, and custom data loading
  • Experiment Layer: benchmark protocols, controlled comparisons, and research-oriented analysis

At the current stage, the project mainly focuses on the platform layer and data layer.


Features

  • Unified access to API models, local models, and multimodal models
  • Pluggable reasoning strategies such as Direct Prompting, CoT, Long-CoT, and ToT
  • Built-in RAG pipeline with chunking, indexing, retrieval, and evidence tracking
  • Support for API / function calling and tool-use simulation
  • Batch inference, single-sample debugging, logging, retry, and resume
  • Configurable evaluation with metrics for text, retrieval, and agent tasks
  • Research-friendly visualization for predictions, traces, evidence, and errors
  • Modular architecture for future extension to training and RL

Current Scope

AgentInferKit currently targets the following task types:

  • Text QA
  • Knowledge-oriented text exam
  • Image understanding
  • API / function calling
  • Retrieval-augmented reasoning
  • Prompt-based reasoning strategy comparison

Architecture

Platform Layer

The engineering foundation of the project, including:

  • model adapters
  • reasoning strategies
  • RAG pipeline
  • task runners
  • tool simulation
  • evaluators
  • visualization dashboard
  • config and logging system

Data Layer

Standardizes heterogeneous data into reusable benchmark assets, including:

  • QA data
  • text-exam data
  • image understanding data
  • agent API function calling data

The data layer is designed to make data runnable, evaluable, traceable, and versioned.


Quick Start

1. Environment Setup

# Clone the repo
git clone https://github.com/CodeSoul-co/AgentInferKit.git
cd AgentInferKit

# Create conda environment
conda create -n benchmark python=3.11 -y
conda activate benchmark
pip install -r requirements.txt

# Configure API key
cp .env.example .env
# Edit .env and fill in your DEEPSEEK_API_KEY

2. Start API Server

PYTHONPATH=$(pwd) uvicorn src.main:app --host 0.0.0.0 --port 8000

Open browser: http://localhost:8000/docs to see all API endpoints.

3. Chat with AI (Terminal)

Direct mode (fast, concise):

curl -s -X POST http://localhost:8000/chat/complete \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "deepseek-chat",
    "strategy": "direct",
    "messages": [{"role": "user", "content": "What is machine learning?"}]
  }' | python3 -m json.tool

Chain-of-Thought mode (step-by-step reasoning):

curl -s -X POST http://localhost:8000/chat/complete \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "deepseek-chat",
    "strategy": "cot",
    "messages": [{"role": "user", "content": "A train travels 120km in 2 hours. What is its speed?"}]
  }' | python3 -m json.tool

Streaming mode (token-by-token output):

curl -N -X POST http://localhost:8000/chat/stream \
  -H "Content-Type: application/json" \
  -d '{
    "model_id": "deepseek-chat",
    "strategy": "direct",
    "messages": [{"role": "user", "content": "Write a short poem about spring"}]
  }'

4. Run Batch Experiment

# Direct strategy experiment
PYTHONPATH=$(pwd) python scripts/run_experiment.py \
  --config configs/experiments/demo_exam_direct.yaml

# CoT strategy experiment
PYTHONPATH=$(pwd) python scripts/run_experiment.py \
  --config configs/experiments/demo_exam_cot.yaml

Results are saved to outputs/predictions/ and outputs/metrics/.

5. Demo Experiment Results

We ran 5 exam questions (math, physics, CS) with two strategies:

Metric Direct CoT
Accuracy 80% (4/5) 100% (5/5)
Avg Latency 2.2s 10.7s
Avg Tokens 69.6 281.4

CoT reasoning improves accuracy at the cost of higher latency and token usage.


Project Structure

AgentInferKit/
├── src/
│   ├── adapters/       # LLM provider adapters (DeepSeek, OpenAI, Anthropic, Qwen)
│   ├── strategies/     # Inference strategies (direct, cot, long_cot, tot, react, self_refine, self_consistency)
│   ├── rag/            # RAG pipeline (chunker, embedder, milvus_store, retriever, pipeline)
│   ├── runners/        # Task runners (qa, exam, batch, agent)
│   ├── evaluators/     # Metrics (text, choice, rag, efficiency)
│   ├── toolsim/        # Tool simulation (registry, executor, tracer)
│   ├── api/            # FastAPI routes (chat, datasets, results, system)
│   └── utils/          # Shared utilities
├── scripts/            # CLI scripts (run_experiment, build_chunks, build_index, build_mcq)
├── configs/            # YAML configs for models and experiments
├── data/               # Datasets and schemas
└── outputs/            # Experiment results (gitignored)

API Endpoints

Endpoint Method Description
/chat/complete POST Single chat completion with strategy selection
/chat/stream POST Streaming chat completion (SSE)
/datasets GET List available datasets
/datasets/upload POST Upload a new dataset
/results/{id}/metrics GET Get experiment metrics
/results/{id}/predictions GET Get experiment predictions
/results/compare POST Compare multiple experiments
/api/v1/system/health GET Health check

Full interactive docs at: http://localhost:8000/docs


Supported Models

Provider Model Status
DeepSeek deepseek-chat Verified
OpenAI gpt-4o, gpt-4o-mini Ready (needs API key)
Anthropic claude-3.5-sonnet Ready (needs API key)
Qwen qwen-plus Ready (needs API key)

Inference Strategies

Strategy Key Description
Direct direct Simple prompt, fast response
Chain-of-Thought cot Step-by-step reasoning
Long CoT long_cot Extended multi-step reasoning
Tree-of-Thought tot Multiple reasoning paths + evaluation
ReAct react Reasoning + tool actions interleaved
Self-Refine self_refine Generate -> critique -> improve loop
Self-Consistency self_consistency Multiple paths + majority voting

Contributing

Contributions are welcome, especially in:

  • model adapters
  • task runners
  • evaluators
  • RAG pipelines
  • tool simulation
  • visualization
  • data preprocessing
  • documentation

Citation

@misc{agentinferkit,
  title={AgentInferKit: A Modular Platform for Agent Inference, Evaluation, and Training},
  author={Zhenke Duan},
  year={2026},
  howpublished={GitHub repository}
}

About

A Modular Platform for Agent Inference, Evaluation, Training, and Tool Use

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors