Squeeze out the juice, leave the pulp behind.
LLM coding agents waste 80-95% of context tokens on irrelevant tool output. Squeez extracts only the lines that matter, compressing tool output by ~91% while keeping 86% of the relevant information.
Squeez uses a fine-tuned Qwen 3.5 2B model to read tool output alongside a task description and return only the relevant lines.
Task: "Find the test failure related to authentication"
| Before (45 lines, ~1,500 tokens) | After (6 lines, ~200 tokens) |
|---|---|
|
87% compression. Only the failing test and its traceback survive. |
$ python -m pytest tests/ -v 2>&1 | squeez "find the test failure related to authentication"
More examples
Filtering git log:
$ git log --oneline -25 | squeez "find the commit that changed the authentication timeout"
u6v7w8x Change auth timeout from 30m to 1h
Filtering build output:
$ npm run build 2>&1 | squeez "find the TypeScript error"
src/components/Auth.tsx(34,5): error TS2345: Argument of type 'string' is
not assignable to parameter of type 'AuthToken'.
Filtering kubectl output:
$ kubectl describe pod api-server-7d4b | squeez "why is the pod failing"
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Warning BackOff 3m (x5) kubelet Back-off restarting failed container
Evaluated on 617 held-out test samples from SWE-bench, across 14 tool types:
| Model | Precision | Recall | F1 | Compression |
|---|---|---|---|---|
| Squeez-2B | 0.8043 | 0.8624 | 0.7895 | 0.9150 |
| Qwen 3.5 35B A3B (zero-shot) | 0.7402 | 0.7498 | 0.7000 | 0.9177 |
| Qwen 3.5 2B (untrained) | 0.4154 | 0.5299 | 0.4075 | 0.8197 |
| BM25 (10%) | 0.1277 | 0.2172 | 0.1314 | 0.9036 |
| Random (10%) | 0.0738 | 0.1009 | 0.0697 | 0.9067 |
Squeez-2B (2B params) outperforms a 35B MoE model at zero-shot and is 6x better than BM25 on Span F1.
pip install squeez
# Start the server
pip install vllm
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
# Use from squeez CLI
pip install squeez
export SQUEEZ_SERVER_URL=http://localhost:8000/v1
cat output.txt | squeez "find the bug"
# Or pipe directly
python -m pytest tests/ -v 2>&1 | squeez "find the test failure related to authentication"
vLLM keeps the model warm in memory with batched inference and high throughput.
pip install squeez
cat output.txt | squeez "Find the failing traceback block"
squeez "Fix the CSRF bug" --input-file output.txt
Note: Local mode loads the model on every call. Fine for one-off use, but for repeated calls (e.g. an agent piping every tool through squeez), use vLLM.
Works with Groq, Together, or any OpenAI-compatible server. Set the URL, model name, and API key:
export SQUEEZ_SERVER_URL=https://api.groq.com/openai/v1
export SQUEEZ_SERVER_MODEL=squeez
export SQUEEZ_API_KEY=gsk_...
from squeez.inference.extractor import ToolOutputExtractor
# Default: loads KRLabsOrg/squeez-2b locally
extractor = ToolOutputExtractor()
# Or connect to a server
extractor = ToolOutputExtractor(base_url="http://localhost:8000/v1")
# Or use a custom local model
extractor = ToolOutputExtractor(model_path="./output/squeez_qwen")
filtered = extractor.extract(
task="Find the referer validation block",
tool_output=raw_output,
)
Add to your CLAUDE.md:
Always when you invoke a shell command, pipe it through `squeez` and tell exactly what you want to know.
Examples:
- `bun test 2>&1 | squeez "did the tests pass?"`
- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
- `cat src/auth/middleware.py | squeez "find the referer validation logic"`
Do NOT use squeez when:
- You need exact, uncompressed output (e.g. writing a patch)
- The command is interactive
Works with other coding agents (Codex CLI, OpenCode, etc.) via their equivalent instruction files.
Configuration
Resolved in order: CLI flags > environment variables > config file.
Config file is loaded from the first found: ./squeez.yaml, ./configs/default.yaml, ~/.config/squeez/config.yaml.
# squeez.yaml
server_url: "http://localhost:8000/v1"
# local_model_path: "./output/squeez_qwen" # for local inference instead
# backend: null # auto-detect; or "transformers", "vllm", "encoder"
Environment variables:
| Variable | Description |
|---|---|
SQUEEZ_SERVER_URL |
Server URL (vLLM, Ollama, etc.) |
SQUEEZ_LOCAL_MODEL |
Path to local model directory |
SQUEEZ_SERVER_MODEL |
Model name on the server |
SQUEEZ_API_KEY |
API key (if needed) |
SQUEEZ_BACKEND |
Force backend: transformers, vllm, encoder |
Encoder models
Squeez also supports encoder-based extraction (ModernBERT, etc.) as an alternative to the generative model. These are faster but less accurate.
Two encoder approaches:
- Token encoder: per-token binary classification, aggregated per line via max-pool
- Pooled encoder: single-pass encoder with line-level mean-pool classification
from squeez.inference.extractor import ToolOutputExtractor
extractor = ToolOutputExtractor(model_path="./output/squeez_encoder")
filtered = extractor.extract(task="Find the bug", tool_output=raw_output)
Standalone loading without squeez installed:
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("output/squeez_pooled", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("output/squeez_pooled")
result = model.process(
task="Find the traceback",
tool_output=open("output.log").read(),
tokenizer=tokenizer,
)
print(result["highlighted_lines"])
Training
See TRAINING.md for full training and evaluation commands.
# Download dataset
python scripts/download_data.py
# Train generative model (Qwen 3.5 2B + LoRA)
squeez train --train-file data/train.jsonl --eval-file data/dev.jsonl
# Train token encoder
python -m squeez.encoder.train \
--classifier-type token \
--train-file data/encoder_train.jsonl \
--eval-file data/encoder_dev.jsonl \
--base-model answerdotai/ModernBERT-base \
--output-dir output/squeez_encoder
# Evaluate
squeez eval --extractor-model output/squeez_qwen --eval-file data/test.jsonl
Dataset
Training data: KRLabsOrg/tool-output-extraction-swebench
Built from SWE-bench repositories. Each sample has:
query: a focused extraction request or agent subgoaltool_output: raw tool output as seen by the agentgold_spans: contiguous spans over the raw output
From this canonical format, Squeez derives generative SFT files and encoder training files.
To regenerate from scratch:
python scripts/build_full_dataset.py \
--output-dir data/v3 \
--teacher-model openai/gpt-oss-120b \
--teacher-base-url http://localhost:8000/v1
@software{kovacs2026squeez,
title={Squeez: Compressing Tool Output for LLM Coding Agents},
author={Adam Kovacs},
year={2026},
url={https://github.com/KRLabsOrg/squeez}
}
Apache 2.0