Skip to content

brown-ccv/gpu-node-squatters

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpu-node-squatters

A command-line tool for HPC cluster administrators to detect Slurm jobs that reserved GPUs but barely used them. It reads a Slurm accounting CSV export, decodes embedded GPU metrics from the admin_comment field, and flags jobs with low GPU utilization and/or memory usage. Output is a per-job CSV of flagged squatters and a per-user aggregate CSV sorted by total wasted GPU-hours.

Background

Slurm's jobstats plugin periodically samples GPU utilization and memory usage for running jobs, then writes the results into each job's admin_comment field using the format:

JS1:<base64(gzip(json))>

The JS1: prefix identifies the encoding version. The payload is a base64-encoded, gzip-compressed JSON object containing per-node, per-GPU metrics:

{
  "nodes": {
    "gpu01": {
      "gpu_utilization": { "0": 45.2, "1": 38.7 },
      "gpu_total_memory": { "0": 81920.0, "1": 81920.0 },
      "gpu_used_memory":  { "0": 40960.0, "1": 12288.0 }
    }
  }
}

Two sentinel values exist: JS1:Short (job too brief to collect stats) and JS1:None (no stats available).

Architecture

The pipeline has four stages:

  1. Lazy CSV filtering — Polars LazyFrame pushes row-level filters (GPU count, minimum runtime, QoS, partition) down to the scan phase, avoiding loading the full CSV into memory.
  2. Parallel decoding — Candidate rows are decoded in parallel via Rayon. Each decode performs base64 decoding, gzip decompression, and JSON parsing — a CPU-bound workload that benefits from multi-core parallelism.
  3. Squat classification — Successfully decoded rows are filtered by GPU utilization and (optionally) memory thresholds to identify squatters.
  4. Output generation — Flagged jobs are written as a per-job CSV. A per-user aggregate CSV is computed with total GPU-hours wasted, average utilization, and average memory ratio. An optional decode-failed CSV captures jobs that couldn't be decoded for audit.

Key dependencies

Crate Role
polars Lazy CSV scanning, filtering, aggregation
rayon Parallel iteration for CPU-bound decoding
base64 Decoding the base64 payload from admin_comment
flate2 Gzip decompression of the decoded payload
serde_json Parsing the decompressed JSON metrics
clap CLI argument parsing with derive macros
anyhow Ergonomic error propagation

Input format

The input CSV must contain these columns (as produced by sacct --format=... or equivalent export):

Column Type Description
user string Slurm username
job_name string Job name
qos_name string Quality of Service name
partition string Slurm partition
gpus_req int Number of GPUs requested
sec_runtime int Job runtime in seconds
admin_comment string Encoded GPU metrics (see Background)

Build

cargo build --release

Usage

cargo run --release -- [OPTIONS]

CLI options

Option Default Description
--input input/oscar_all_jobs.csv Path to the input Slurm accounting CSV
--output-jobs output/gpu_squat_jobs.csv Path for per-job flagged output CSV
--output-users output/gpu_squat_users.csv Path for per-user aggregate output CSV
--output-decode-failed (none) Optional path for CSV of jobs that failed decoding
--min-runtime-sec 300 Minimum runtime (seconds) for a job to be evaluated
--util-threshold 5.0 Mean GPU utilization threshold (percent)
--mem-threshold 10.0 GPU memory usage threshold (percent)
--squat-mode util-and-mem Classification mode: util-only or util-and-mem
--qos (all) Case-sensitive QoS filter; repeatable (OR within QoS)
--partition (all) Case-sensitive partition filter; repeatable (OR within)

If both --qos and --partition are provided, both filters are applied (AND across dimensions, OR within each).

Example

cargo run --release -- \
  --input slurm_jobs.csv \
  --output-jobs gpu_squat_jobs.csv \
  --output-users gpu_squat_users.csv \
  --min-runtime-sec 300 \
  --util-threshold 5.0 \
  --mem-threshold 10.0 \
  --squat-mode util-and-mem \
  --qos pri-gpu+ \
  --partition gpu \
  --output-decode-failed gpu_decode_failed.csv

Squat modes

  • util-only — Flags jobs where mean_gpu_util < util_threshold. Use this when you only care about compute utilization (e.g., some workflows legitimately use large GPU memory with low compute).
  • util-and-mem — Flags jobs where both mean_gpu_util < util_threshold and gpu_mem_used_ratio < mem_threshold / 100. Use this (the default) to avoid false positives from jobs that reserve GPUs primarily for their memory.

Output columns

gpu_squat_jobs.csv

Column Description
user Slurm username
job_name Job name
qos_name Quality of Service
partition Slurm partition
gpus_req GPUs requested
runtime_sec Job runtime in seconds
mean_gpu_util Mean GPU utilization across all GPUs (percent, 0–100)
max_gpu_util Max GPU utilization across all GPUs (percent, 0–100)
gpu_mem_used_ratio Ratio of used to total GPU memory (0.0–1.0)
decode_status Decode outcome (always decoded in this file)
squat_reason Why the job was flagged (see squat modes)

gpu_squat_users.csv

Column Description
user Slurm username
flagged_jobs Number of flagged squatting jobs
total_gpu_hours_flagged Total GPU-hours wasted (runtime_sec / 3600 * gpus_req)
avg_mean_gpu_util Average mean GPU utilization across flagged jobs
avg_gpu_mem_used_ratio Average GPU memory ratio across flagged jobs

gpu_decode_failed.csv (optional)

Column Description
user Slurm username
job_name Job name
qos_name Quality of Service
partition Slurm partition
gpus_req GPUs requested
runtime_sec Job runtime in seconds
decode_status Why decoding failed (see decode status values below)

Decode status values

Status Meaning
decoded Successfully decoded and extracted GPU metrics
missing_prefix admin_comment did not start with JS1:
empty_payload Payload after JS1: was empty
short_value Payload was Short — job too brief for stats collection
none_value Payload was None — no stats available for this job
base64_error Base64 decoding failed
gzip_error Gzip decompression failed
json_error JSON parsing failed
missing_nodes JSON lacked a nodes object
missing_gpu_util No gpu_utilization data found in any node

Testing

cargo test

The test suite includes:

  • Unit tests — decode logic (decode.rs), filter case-sensitivity (filters.rs), squat classification thresholds (pipeline.rs)
  • Integration tests — end-to-end runs with fixture CSVs verifying row counts, output content, filter behavior, and squat mode differences (tests/integration_api.rs)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages