Skip to content

docs(trainer): add JIT checkpointing to trainer recipes#46826

Merged
stevhliu merged 3 commits into
huggingface:mainfrom
efazal:docs/jit-checkpointing
Jun 23, 2026
Merged

docs(trainer): add JIT checkpointing to trainer recipes#46826
stevhliu merged 3 commits into
huggingface:mainfrom
efazal:docs/jit-checkpointing

Conversation

@efazal

@efazal efazal commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Adds documentation for JIT checkpointing (enable_jit_checkpoint) to the Checkpointing section of trainer_recipes.md, as requested by @SunMarc in #41723.

Covers what JIT checkpointing is, how it differs from periodic checkpointing, usage, and orchestrator grace period configuration for Kubernetes and Slurm.

Add documentation for the JIT checkpointing feature (enable_jit_checkpoint)
in the Checkpointing section of trainer_recipes.md, as requested during
PR huggingface#41723 review.

@SunMarc SunMarc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@SunMarc SunMarc requested a review from stevhliu June 23, 2026 15:07

@stevhliu stevhliu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding docs! would you also mind updating the docstring to match the docs (--signal=USR1@<seconds> --> --signal=TERM@300)?

(default 30s is usually insufficient). For Slurm, use `--signal=USR1@<seconds>`.

Comment thread docs/source/en/trainer_recipes.md Outdated
Comment thread docs/source/en/trainer_recipes.md Outdated
Comment thread docs/source/en/trainer_recipes.md Outdated
Comment thread docs/source/en/trainer_recipes.md Outdated
efazal added 2 commits June 23, 2026 21:49
- Reword docs for clarity and tone consistency
- Clarify sentinel file behavior and that Trainer does not auto-check it
- Include kill_wait delay in grace period calculation
- Fix Slurm signal flag in training_args.py docstring (USR1 -> TERM)
@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@stevhliu stevhliu added this pull request to the merge queue Jun 23, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Merged via the queue into huggingface:main with commit c5deba2 Jun 23, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants