Skip to content

Add Hugging Face dataset export flow for competition submissions#459

Open
SinatrasC wants to merge 6 commits intogpu-mode:mainfrom
SinatrasC:Sinatras-add-private-dataset
Open

Add Hugging Face dataset export flow for competition submissions#459
SinatrasC wants to merge 6 commits intogpu-mode:mainfrom
SinatrasC:Sinatras-add-private-dataset

Conversation

@SinatrasC
Copy link

Add Hugging Face dataset export flow for competition submissions

Summary

This PR adds a Hugging Face export path for competition submissions.

The goal is to make it easy to keep a private, continuously updated dataset of live competition submissions, while also giving admins a clean way to publish final competition data to a public dataset repo when a competition is over.

What changed

New export module

I added a new libkernelbot.hf_export module that handles the export flow end to end:

  • defines the parquet schema to match the existing GPUMODE/kernelbot-data layout
  • fetches leaderboard submissions from the database
  • normalizes data into parquet bytes with pyarrow
  • uploads files to Hugging Face dataset repos with huggingface-hub
  • filters for "real" active competitions by excluding expired boards, -dev boards, and effectively permanent/practice boards
  • blocks public exports for still-active competitions
  • uses submission_job_status for exported submission status, with a fallback for older rows that do not have job-status records yet

Discord admin support

I added a new admin slash command:

  • /admin export-hf

That gives admins a direct way to export a specific competition leaderboard to either the private or public HF dataset repo.

I also added a scheduled daily export that snapshots all active competitions to the private repo as active_submissions.parquet when HF_TOKEN is configured.

Config and dependencies

This PR adds the required environment/config wiring:

  • HF_TOKEN
  • HF_PRIVATE_DATASET
  • HF_PUBLIC_DATASET

It also adds the new dependencies needed for parquet generation and HF uploads:

  • huggingface-hub
  • pyarrow

Why this is useful

Before this, exporting competition submission data was effectively an ad hoc task.

With this PR:

  • live competition data can be mirrored automatically to a private dataset
  • admins have a repeatable export path instead of one-off scripts
  • final competition data can be published in a more controlled way
  • the export logic lives in the codebase with tests instead of being operational tribal knowledge

Notes on behavior

A couple of intentional behaviors are worth calling out:

  • Public exports are blocked for active competitions.
  • The scheduled export targets only active competition boards and skips dev/permanent boards.
  • Exported status now comes from submission_job_status where available, instead of relying on the stale default value on leaderboard.submission.status.

Testing

Ran:

  • uv run ruff check . --exclude examples/ --line-length 120
  • .venv/bin/pytest -s tests/test_hf_export.py

Results:

  • ruff passed
  • HF export tests passed

Copilot AI review requested due to automatic review settings March 8, 2026 21:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an end-to-end Hugging Face dataset export flow for competition submissions, including both a scheduled “live” private export and explicit admin-triggered exports (Discord + HTTP admin API).

Changes:

  • Added libkernelbot.hf_export to fetch/normalize leaderboard submissions and export them as parquet to HF dataset repos.
  • Added admin controls: a Discord /admin export-hf command plus a new POST /admin/export-hf endpoint; also added a daily scheduled export of active competitions to the private repo.
  • Wired configuration/env (HF_TOKEN, HF_PRIVATE_DATASET, HF_PUBLIC_DATASET) and added huggingface-hub + pyarrow dependencies, with test coverage for the export logic.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
uv.lock Locks new dependencies required for HF uploads and parquet generation.
pyproject.toml Adds huggingface-hub and pyarrow to project dependencies.
src/libkernelbot/hf_export.py New HF export implementation (querying, normalization, parquet schema, upload).
src/kernelbot/env.py Adds HF-related environment/config variables.
src/kernelbot/cogs/admin_cog.py Adds Discord admin command + scheduled daily private export task.
src/kernelbot/api/main.py Adds /admin/export-hf API endpoint for admin-triggered exports.
tests/test_hf_export.py Adds unit tests for active leaderboard filtering, parquet generation, and query behavior.
tests/test_admin_api.py Adds API test ensuring active public exports are rejected (400).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SinatrasC SinatrasC requested a review from S1ro1 March 10, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants