gpu-node-squatters

A command-line tool for HPC cluster administrators to detect Slurm jobs that reserved GPUs but barely used them. It reads a Slurm accounting CSV export, decodes embedded GPU metrics from the admin_comment field, and flags jobs with low GPU utilization and/or memory usage. Output is a per-job CSV of flagged squatters and a per-user aggregate CSV sorted by total wasted GPU-hours.

Background

Slurm's jobstats plugin periodically samples GPU utilization and memory usage for running jobs, then writes the results into each job's admin_comment field using the format:

JS1:<base64(gzip(json))>

The JS1: prefix identifies the encoding version. The payload is a base64-encoded, gzip-compressed JSON object containing per-node, per-GPU metrics:

{
  "nodes": {
    "gpu01": {
      "gpu_utilization": { "0": 45.2, "1": 38.7 },
      "gpu_total_memory": { "0": 81920.0, "1": 81920.0 },
      "gpu_used_memory":  { "0": 40960.0, "1": 12288.0 }
    }
  }
}

Two sentinel values exist: JS1:Short (job too brief to collect stats) and JS1:None (no stats available).

Architecture

The pipeline has four stages:

Lazy CSV filtering — Polars LazyFrame pushes row-level filters (GPU count, minimum runtime, QoS, partition) down to the scan phase, avoiding loading the full CSV into memory.
Parallel decoding — Candidate rows are decoded in parallel via Rayon. Each decode performs base64 decoding, gzip decompression, and JSON parsing — a CPU-bound workload that benefits from multi-core parallelism.
Squat classification — Successfully decoded rows are filtered by GPU utilization and (optionally) memory thresholds to identify squatters.
Output generation — Flagged jobs are written as a per-job CSV. A per-user aggregate CSV is computed with total GPU-hours wasted, average utilization, and average memory ratio. An optional decode-failed CSV captures jobs that couldn't be decoded for audit.

Key dependencies

Crate	Role
`polars`	Lazy CSV scanning, filtering, aggregation
`rayon`	Parallel iteration for CPU-bound decoding
`base64`	Decoding the base64 payload from `admin_comment`
`flate2`	Gzip decompression of the decoded payload
`serde_json`	Parsing the decompressed JSON metrics
`clap`	CLI argument parsing with derive macros
`anyhow`	Ergonomic error propagation

Input format

The input CSV must contain these columns (as produced by sacct --format=... or equivalent export):

Column	Type	Description
`user`	string	Slurm username
`job_name`	string	Job name
`qos_name`	string	Quality of Service name
`partition`	string	Slurm partition
`gpus_req`	int	Number of GPUs requested
`sec_runtime`	int	Job runtime in seconds
`admin_comment`	string	Encoded GPU metrics (see Background)

Build

cargo build --release

Usage

cargo run --release -- [OPTIONS]

CLI options

Option	Default	Description
`--input`	`input/oscar_all_jobs.csv`	Path to the input Slurm accounting CSV
`--output-jobs`	`output/gpu_squat_jobs.csv`	Path for per-job flagged output CSV
`--output-users`	`output/gpu_squat_users.csv`	Path for per-user aggregate output CSV
`--output-decode-failed`	(none)	Optional path for CSV of jobs that failed decoding
`--min-runtime-sec`	`300`	Minimum runtime (seconds) for a job to be evaluated
`--util-threshold`	`5.0`	Mean GPU utilization threshold (percent)
`--mem-threshold`	`10.0`	GPU memory usage threshold (percent)
`--squat-mode`	`util-and-mem`	Classification mode: `util-only` or `util-and-mem`
`--qos`	(all)	Case-sensitive QoS filter; repeatable (OR within QoS)
`--partition`	(all)	Case-sensitive partition filter; repeatable (OR within)

If both --qos and --partition are provided, both filters are applied (AND across dimensions, OR within each).

Example

cargo run --release -- \
  --input slurm_jobs.csv \
  --output-jobs gpu_squat_jobs.csv \
  --output-users gpu_squat_users.csv \
  --min-runtime-sec 300 \
  --util-threshold 5.0 \
  --mem-threshold 10.0 \
  --squat-mode util-and-mem \
  --qos pri-gpu+ \
  --partition gpu \
  --output-decode-failed gpu_decode_failed.csv

Squat modes

util-only — Flags jobs where mean_gpu_util < util_threshold. Use this when you only care about compute utilization (e.g., some workflows legitimately use large GPU memory with low compute).
util-and-mem — Flags jobs where both mean_gpu_util < util_threshold and gpu_mem_used_ratio < mem_threshold / 100. Use this (the default) to avoid false positives from jobs that reserve GPUs primarily for their memory.

Output columns

`gpu_squat_jobs.csv`

Column	Description
`user`	Slurm username
`job_name`	Job name
`qos_name`	Quality of Service
`partition`	Slurm partition
`gpus_req`	GPUs requested
`runtime_sec`	Job runtime in seconds
`mean_gpu_util`	Mean GPU utilization across all GPUs (percent, 0–100)
`max_gpu_util`	Max GPU utilization across all GPUs (percent, 0–100)
`gpu_mem_used_ratio`	Ratio of used to total GPU memory (0.0–1.0)
`decode_status`	Decode outcome (always `decoded` in this file)
`squat_reason`	Why the job was flagged (see squat modes)

`gpu_squat_users.csv`

Column	Description
`user`	Slurm username
`flagged_jobs`	Number of flagged squatting jobs
`total_gpu_hours_flagged`	Total GPU-hours wasted (`runtime_sec / 3600 * gpus_req`)
`avg_mean_gpu_util`	Average mean GPU utilization across flagged jobs
`avg_gpu_mem_used_ratio`	Average GPU memory ratio across flagged jobs

`gpu_decode_failed.csv` (optional)

Column	Description
`user`	Slurm username
`job_name`	Job name
`qos_name`	Quality of Service
`partition`	Slurm partition
`gpus_req`	GPUs requested
`runtime_sec`	Job runtime in seconds
`decode_status`	Why decoding failed (see decode status values below)

Decode status values

Status	Meaning
`decoded`	Successfully decoded and extracted GPU metrics
`missing_prefix`	`admin_comment` did not start with `JS1:`
`empty_payload`	Payload after `JS1:` was empty
`short_value`	Payload was `Short` — job too brief for stats collection
`none_value`	Payload was `None` — no stats available for this job
`base64_error`	Base64 decoding failed
`gzip_error`	Gzip decompression failed
`json_error`	JSON parsing failed
`missing_nodes`	JSON lacked a `nodes` object
`missing_gpu_util`	No `gpu_utilization` data found in any node

Testing

cargo test

The test suite includes:

Unit tests — decode logic (decode.rs), filter case-sensitivity (filters.rs), squat classification thresholds (pipeline.rs)
Integration tests — end-to-end runs with fixture CSVs verifying row counts, output content, filter behavior, and squat mode differences (tests/integration_api.rs)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
sql		sql
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gpu-node-squatters

Background

Architecture

Key dependencies

Input format

Build

Usage

CLI options

Example

Squat modes

Output columns

`gpu_squat_jobs.csv`

`gpu_squat_users.csv`

`gpu_decode_failed.csv` (optional)

Decode status values

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gpu-node-squatters

Background

Architecture

Key dependencies

Input format

Build

Usage

CLI options

Example

Squat modes

Output columns

gpu_squat_jobs.csv

gpu_squat_users.csv

gpu_decode_failed.csv (optional)

Decode status values

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`gpu_squat_jobs.csv`

`gpu_squat_users.csv`

`gpu_decode_failed.csv` (optional)

Packages