A command-line tool for HPC cluster administrators to detect Slurm jobs that reserved GPUs but barely used them. It reads a Slurm accounting CSV export, decodes embedded GPU metrics from the admin_comment field, and flags jobs with low GPU utilization and/or memory usage. Output is a per-job CSV of flagged squatters and a per-user aggregate CSV sorted by total wasted GPU-hours.
Slurm's jobstats plugin periodically samples GPU utilization and memory usage for running jobs, then writes the results into each job's admin_comment field using the format:
JS1:<base64(gzip(json))>
The JS1: prefix identifies the encoding version. The payload is a base64-encoded, gzip-compressed JSON object containing per-node, per-GPU metrics:
{
"nodes": {
"gpu01": {
"gpu_utilization": { "0": 45.2, "1": 38.7 },
"gpu_total_memory": { "0": 81920.0, "1": 81920.0 },
"gpu_used_memory": { "0": 40960.0, "1": 12288.0 }
}
}
}Two sentinel values exist: JS1:Short (job too brief to collect stats) and JS1:None (no stats available).
The pipeline has four stages:
- Lazy CSV filtering — Polars
LazyFramepushes row-level filters (GPU count, minimum runtime, QoS, partition) down to the scan phase, avoiding loading the full CSV into memory. - Parallel decoding — Candidate rows are decoded in parallel via Rayon. Each decode performs base64 decoding, gzip decompression, and JSON parsing — a CPU-bound workload that benefits from multi-core parallelism.
- Squat classification — Successfully decoded rows are filtered by GPU utilization and (optionally) memory thresholds to identify squatters.
- Output generation — Flagged jobs are written as a per-job CSV. A per-user aggregate CSV is computed with total GPU-hours wasted, average utilization, and average memory ratio. An optional decode-failed CSV captures jobs that couldn't be decoded for audit.
| Crate | Role |
|---|---|
polars |
Lazy CSV scanning, filtering, aggregation |
rayon |
Parallel iteration for CPU-bound decoding |
base64 |
Decoding the base64 payload from admin_comment |
flate2 |
Gzip decompression of the decoded payload |
serde_json |
Parsing the decompressed JSON metrics |
clap |
CLI argument parsing with derive macros |
anyhow |
Ergonomic error propagation |
The input CSV must contain these columns (as produced by sacct --format=... or equivalent export):
| Column | Type | Description |
|---|---|---|
user |
string | Slurm username |
job_name |
string | Job name |
qos_name |
string | Quality of Service name |
partition |
string | Slurm partition |
gpus_req |
int | Number of GPUs requested |
sec_runtime |
int | Job runtime in seconds |
admin_comment |
string | Encoded GPU metrics (see Background) |
cargo build --releasecargo run --release -- [OPTIONS]| Option | Default | Description |
|---|---|---|
--input |
input/oscar_all_jobs.csv |
Path to the input Slurm accounting CSV |
--output-jobs |
output/gpu_squat_jobs.csv |
Path for per-job flagged output CSV |
--output-users |
output/gpu_squat_users.csv |
Path for per-user aggregate output CSV |
--output-decode-failed |
(none) | Optional path for CSV of jobs that failed decoding |
--min-runtime-sec |
300 |
Minimum runtime (seconds) for a job to be evaluated |
--util-threshold |
5.0 |
Mean GPU utilization threshold (percent) |
--mem-threshold |
10.0 |
GPU memory usage threshold (percent) |
--squat-mode |
util-and-mem |
Classification mode: util-only or util-and-mem |
--qos |
(all) | Case-sensitive QoS filter; repeatable (OR within QoS) |
--partition |
(all) | Case-sensitive partition filter; repeatable (OR within) |
If both --qos and --partition are provided, both filters are applied (AND across dimensions, OR within each).
cargo run --release -- \
--input slurm_jobs.csv \
--output-jobs gpu_squat_jobs.csv \
--output-users gpu_squat_users.csv \
--min-runtime-sec 300 \
--util-threshold 5.0 \
--mem-threshold 10.0 \
--squat-mode util-and-mem \
--qos pri-gpu+ \
--partition gpu \
--output-decode-failed gpu_decode_failed.csvutil-only— Flags jobs wheremean_gpu_util < util_threshold. Use this when you only care about compute utilization (e.g., some workflows legitimately use large GPU memory with low compute).util-and-mem— Flags jobs where bothmean_gpu_util < util_thresholdandgpu_mem_used_ratio < mem_threshold / 100. Use this (the default) to avoid false positives from jobs that reserve GPUs primarily for their memory.
| Column | Description |
|---|---|
user |
Slurm username |
job_name |
Job name |
qos_name |
Quality of Service |
partition |
Slurm partition |
gpus_req |
GPUs requested |
runtime_sec |
Job runtime in seconds |
mean_gpu_util |
Mean GPU utilization across all GPUs (percent, 0–100) |
max_gpu_util |
Max GPU utilization across all GPUs (percent, 0–100) |
gpu_mem_used_ratio |
Ratio of used to total GPU memory (0.0–1.0) |
decode_status |
Decode outcome (always decoded in this file) |
squat_reason |
Why the job was flagged (see squat modes) |
| Column | Description |
|---|---|
user |
Slurm username |
flagged_jobs |
Number of flagged squatting jobs |
total_gpu_hours_flagged |
Total GPU-hours wasted (runtime_sec / 3600 * gpus_req) |
avg_mean_gpu_util |
Average mean GPU utilization across flagged jobs |
avg_gpu_mem_used_ratio |
Average GPU memory ratio across flagged jobs |
| Column | Description |
|---|---|
user |
Slurm username |
job_name |
Job name |
qos_name |
Quality of Service |
partition |
Slurm partition |
gpus_req |
GPUs requested |
runtime_sec |
Job runtime in seconds |
decode_status |
Why decoding failed (see decode status values below) |
| Status | Meaning |
|---|---|
decoded |
Successfully decoded and extracted GPU metrics |
missing_prefix |
admin_comment did not start with JS1: |
empty_payload |
Payload after JS1: was empty |
short_value |
Payload was Short — job too brief for stats collection |
none_value |
Payload was None — no stats available for this job |
base64_error |
Base64 decoding failed |
gzip_error |
Gzip decompression failed |
json_error |
JSON parsing failed |
missing_nodes |
JSON lacked a nodes object |
missing_gpu_util |
No gpu_utilization data found in any node |
cargo testThe test suite includes:
- Unit tests — decode logic (
decode.rs), filter case-sensitivity (filters.rs), squat classification thresholds (pipeline.rs) - Integration tests — end-to-end runs with fixture CSVs verifying row counts, output content, filter behavior, and squat mode differences (
tests/integration_api.rs)