Add Hugging Face dataset export flow for competition submissions#459
Open
SinatrasC wants to merge 6 commits intogpu-mode:mainfrom
Open
Add Hugging Face dataset export flow for competition submissions#459SinatrasC wants to merge 6 commits intogpu-mode:mainfrom
SinatrasC wants to merge 6 commits intogpu-mode:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces an end-to-end Hugging Face dataset export flow for competition submissions, including both a scheduled “live” private export and explicit admin-triggered exports (Discord + HTTP admin API).
Changes:
- Added
libkernelbot.hf_exportto fetch/normalize leaderboard submissions and export them as parquet to HF dataset repos. - Added admin controls: a Discord
/admin export-hfcommand plus a newPOST /admin/export-hfendpoint; also added a daily scheduled export of active competitions to the private repo. - Wired configuration/env (
HF_TOKEN,HF_PRIVATE_DATASET,HF_PUBLIC_DATASET) and addedhuggingface-hub+pyarrowdependencies, with test coverage for the export logic.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Locks new dependencies required for HF uploads and parquet generation. |
| pyproject.toml | Adds huggingface-hub and pyarrow to project dependencies. |
| src/libkernelbot/hf_export.py | New HF export implementation (querying, normalization, parquet schema, upload). |
| src/kernelbot/env.py | Adds HF-related environment/config variables. |
| src/kernelbot/cogs/admin_cog.py | Adds Discord admin command + scheduled daily private export task. |
| src/kernelbot/api/main.py | Adds /admin/export-hf API endpoint for admin-triggered exports. |
| tests/test_hf_export.py | Adds unit tests for active leaderboard filtering, parquet generation, and query behavior. |
| tests/test_admin_api.py | Adds API test ensuring active public exports are rejected (400). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
S1ro1
reviewed
Mar 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Hugging Face dataset export flow for competition submissions
Summary
This PR adds a Hugging Face export path for competition submissions.
The goal is to make it easy to keep a private, continuously updated dataset of live competition submissions, while also giving admins a clean way to publish final competition data to a public dataset repo when a competition is over.
What changed
New export module
I added a new
libkernelbot.hf_exportmodule that handles the export flow end to end:GPUMODE/kernelbot-datalayoutpyarrowhuggingface-hub-devboards, and effectively permanent/practice boardssubmission_job_statusfor exported submission status, with a fallback for older rows that do not have job-status records yetDiscord admin support
I added a new admin slash command:
/admin export-hfThat gives admins a direct way to export a specific competition leaderboard to either the private or public HF dataset repo.
I also added a scheduled daily export that snapshots all active competitions to the private repo as
active_submissions.parquetwhenHF_TOKENis configured.Config and dependencies
This PR adds the required environment/config wiring:
HF_TOKENHF_PRIVATE_DATASETHF_PUBLIC_DATASETIt also adds the new dependencies needed for parquet generation and HF uploads:
huggingface-hubpyarrowWhy this is useful
Before this, exporting competition submission data was effectively an ad hoc task.
With this PR:
Notes on behavior
A couple of intentional behaviors are worth calling out:
submission_job_statuswhere available, instead of relying on the stale default value onleaderboard.submission.status.Testing
Ran:
uv run ruff check . --exclude examples/ --line-length 120.venv/bin/pytest -s tests/test_hf_export.pyResults:
ruffpassed