Skip to content

feat: enable recording episodes into existent dataset#14

Open
JeronimoMendes wants to merge 2 commits into
huggingface:mainfrom
JeronimoMendes:feat/Enable-recording-episodes-into-existent-dataset
Open

feat: enable recording episodes into existent dataset#14
JeronimoMendes wants to merge 2 commits into
huggingface:mainfrom
JeronimoMendes:feat/Enable-recording-episodes-into-existent-dataset

Conversation

@JeronimoMendes

@JeronimoMendes JeronimoMendes commented Jun 7, 2026

Copy link
Copy Markdown

This PR:

  • extends the edit dataset page to support adding new episodes
  • shows sync status with hf hub.
  • deletes upload dataset page in favor of edit dataset.
  • matches styling of components to the job page.

Demo

shot 2026-06-07 at 02 11 20 shot 2026-06-07 at 02 48 53@2x

Disclaimer

AI tools such as Pi and Claude were used, specially on the frontend part of the PR, although I reviewed all of the code generated.

Closes #11

@nicolas-rabault nicolas-rabault self-assigned this Jun 8, 2026
nicolas-rabault added a commit that referenced this pull request Jun 8, 2026
`_CONTAINER_OUTPUT_DIR = "/tmp/lelab/train"` is a fixed path inside the
remote HF Jobs container (already documented above the line), not a
host-local temp dir, so B108 is a false positive here. The pre-commit
workflow runs bandit with `--all-files` on every PR, so this latent
violation — merged straight to main in 68d7b45, where pre-commit doesn't
run — was failing unrelated PRs (#13, #14). Add an inline `# nosec B108`
with justification to clear it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@nicolas-rabault nicolas-rabault self-requested a review June 8, 2026 10:19

@nicolas-rabault nicolas-rabault left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR @JeronimoMendes
A few simple modifications :

  1. Dead code

The resume feature is driven entirely from EditDataset.tsx, which renders RecordingSettingsFields inline and calls rec.startRecording directly. It never uses RecordingModal. Consequences:

useRecording.ts:48 openForResume(...) is exported but never called anywhere.
The entire isResuming branch of RecordingModal.tsx is unreachable: the modal is only mounted in Landing.tsx, which always opens it via openForNew → resumeRepoId is always null there. So "Record More Episodes" / "Resume Recording" / the resume-repo display block / the "Append new episodes…" description can never render.
So the modal carries a resumeRepoId prop + branch and the hook carries an openForResume path that nothing exercises. Either wire the modal up for resume, or strip the resume branch from RecordingModal and drop openForResume. Given EditDataset already inlines the fields, the latter (delete) is the simpler, consistent choice.

  1. Sync-status mtime comparison is fragile (medium)

handle_dataset_sync_status uses local_root.stat().st_mtime (top-level dataset dir) vs Hub lastModified. Two failure modes:

A directory's mtime only changes when a direct child entry is added/removed/renamed, not when files deep in data/chunk-*/ or videos/ are written, nor when an existing top-level file is rewritten in place. New episodes can land without the top-level mtime advancing, so needs_sync can be falsely False. The worst direction (UI says "in sync" when it isn't).
After a fresh download/clone from the Hub, local mtime = download time > hub_mtime, so it reports falsely needs_sync: True.
A more reliable signal would be a max mtime walk over the dataset tree, or reading the dataset's own episode count/revision metadata.

@JeronimoMendes JeronimoMendes force-pushed the feat/Enable-recording-episodes-into-existent-dataset branch from 9eb6c58 to f53933d Compare June 8, 2026 23:30
@JeronimoMendes

Copy link
Copy Markdown
Author

Thanks a lot for the quick review @nicolas-rabault .

  1. Totally right about that, should be fixed now.
  2. Thought about this and reached the conclusion that a simple file count might not be enough (I'm already thinking in the future with a possible episode deletion feature 😉, where a delete followed by a new episode would be considered "synced" with the hub). In fact, this is a far more complex issue than what this feature branch calls for because to truly be in sync with the Hub we need to consider pushes and pulls from the hub. To keep it simple, I've changed it so that we compute the local manifest (files + sizes) and match it against the one on the hub whilst still keeping the local dataset as the reference. Full bidirectional sync seems out of scope for this branch, let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable recording episodes into existent dataset

2 participants