From 4520b7df74c0902ed581da2a70655ca42f5ca428 Mon Sep 17 00:00:00 2001 From: Esa Fazal Date: Mon, 22 Jun 2026 23:46:55 +0100 Subject: [PATCH 1/2] docs(trainer): add JIT checkpointing section to trainer recipes Add documentation for the JIT checkpointing feature (enable_jit_checkpoint) in the Checkpointing section of trainer_recipes.md, as requested during PR #41723 review. --- docs/source/en/trainer_recipes.md | 54 +++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) diff --git a/docs/source/en/trainer_recipes.md b/docs/source/en/trainer_recipes.md index 406e742986b0..56c2d4be1051 100644 --- a/docs/source/en/trainer_recipes.md +++ b/docs/source/en/trainer_recipes.md @@ -247,3 +247,57 @@ trainer.train(resume_from_checkpoint="out/checkpoint-1000") When resuming, [`Trainer`] restores the optimizer state, scheduler state, and RNG state. Checkpoint resuming requires optimizer and scheduler state files in the checkpoint directory. If those files are missing (for example, when `save_only_model=True`), the optimizer restarts from scratch. + +### JIT checkpointing + +With periodic checkpointing (`save_strategy="steps"` or `"epoch"`), any training progress between the last saved checkpoint and an interruption is lost. On shared clusters with preemptible workloads (e.g., [Kueue](https://kueue.sigs.k8s.io/)), jobs can be terminated at any time, and that gap can mean hours of wasted compute. + +JIT (Just-In-Time) checkpointing closes this gap. When the trainer receives a SIGTERM signal, it saves a checkpoint at the exact point training was interrupted, so you resume with zero loss of progress. It works alongside periodic checkpointing: periodic saves guard against crashes and hardware failures, while JIT saves guard against preemption and graceful shutdowns. + +Enable it by setting `enable_jit_checkpoint=True` in [`TrainingArguments`]. + +```py +from transformers import TrainingArguments + +training_args = TrainingArguments( + output_dir="your-model", + enable_jit_checkpoint=True, +) +``` + +When SIGTERM is received, the trainer waits for the current training step to finish, saves a checkpoint, and stops training gracefully. A sentinel file (`checkpoint-is-incomplete.txt`) is written during the save and removed once the checkpoint is fully written, so incomplete checkpoints are easy to detect when resuming. + +Resume from the JIT checkpoint the same way as any other checkpoint. + +```py +trainer.train(resume_from_checkpoint=True) +``` + +> [!WARNING] +> You must configure your orchestrator to allow enough time for the checkpoint to complete. The default Kubernetes graceful shutdown period is only 30 seconds, which is typically not enough for larger models. + + + + +Set `terminationGracePeriodSeconds` in your Pod or Job spec. The exact field location varies by trainer (Kubeflow Training Operator, Ray, etc.). + +```yaml +spec: + template: + spec: + terminationGracePeriodSeconds: 300 +``` + + + + +Use `--signal=TERM@` in your sbatch script to send SIGTERM before the job time limit expires. + +```bash +#SBATCH --signal=TERM@300 +``` + + + + +Calculate the required grace period as the longest possible training step time plus the checkpoint saving time. For example, if a training step takes up to 2 minutes and saving a checkpoint takes 2 minutes, set at least 240 seconds of grace time. From f99502ab22d15ac326ea919d1c62033c908f6a66 Mon Sep 17 00:00:00 2001 From: Esa Fazal Date: Tue, 23 Jun 2026 21:47:36 +0100 Subject: [PATCH 2/2] Address review comments - Reword docs for clarity and tone consistency - Clarify sentinel file behavior and that Trainer does not auto-check it - Include kill_wait delay in grace period calculation - Fix Slurm signal flag in training_args.py docstring (USR1 -> TERM) --- docs/source/en/trainer_recipes.md | 8 ++++---- src/transformers/training_args.py | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/source/en/trainer_recipes.md b/docs/source/en/trainer_recipes.md index 56c2d4be1051..f6f2c1e07cb7 100644 --- a/docs/source/en/trainer_recipes.md +++ b/docs/source/en/trainer_recipes.md @@ -250,9 +250,9 @@ Checkpoint resuming requires optimizer and scheduler state files in the checkpoi ### JIT checkpointing -With periodic checkpointing (`save_strategy="steps"` or `"epoch"`), any training progress between the last saved checkpoint and an interruption is lost. On shared clusters with preemptible workloads (e.g., [Kueue](https://kueue.sigs.k8s.io/)), jobs can be terminated at any time, and that gap can mean hours of wasted compute. +With periodic checkpointing (save_strategy="steps" or "epoch"), you lose any training progress between the last saved checkpoint and an interruption. On shared clusters with preemptible workloads such as [Kueue](https://kueue.sigs.k8s.io/), jobs can be terminated at any time, so that gap can mean hours of wasted compute. -JIT (Just-In-Time) checkpointing closes this gap. When the trainer receives a SIGTERM signal, it saves a checkpoint at the exact point training was interrupted, so you resume with zero loss of progress. It works alongside periodic checkpointing: periodic saves guard against crashes and hardware failures, while JIT saves guard against preemption and graceful shutdowns. +JIT (Just-In-Time) checkpointing closes this gap. When the trainer receives a SIGTERM signal, it saves a checkpoint at the exact point training was interrupted, so you resume with minimal loss of progress. It works alongside periodic checkpointing. Periodic saves guard against crashes and hardware failures, while JIT saves guard against preemption and graceful shutdowns. Enable it by setting `enable_jit_checkpoint=True` in [`TrainingArguments`]. @@ -265,7 +265,7 @@ training_args = TrainingArguments( ) ``` -When SIGTERM is received, the trainer waits for the current training step to finish, saves a checkpoint, and stops training gracefully. A sentinel file (`checkpoint-is-incomplete.txt`) is written during the save and removed once the checkpoint is fully written, so incomplete checkpoints are easy to detect when resuming. +When SIGTERM is received, [`Trainer`] waits for the current training step to finish, saves a checkpoint, and stops training gracefully. A sentinel file (`checkpoint-is-incomplete.txt`) is written when the save begins and removed once the checkpoint is fully written. If a checkpoint directory still contains this file, the save was interrupted before completing. [`Trainer`] doesn't check for it automatically, so inspect for it yourself before resuming. Resume from the JIT checkpoint the same way as any other checkpoint. @@ -300,4 +300,4 @@ Use `--signal=TERM@` in your sbatch script to send SIGTERM before the j -Calculate the required grace period as the longest possible training step time plus the checkpoint saving time. For example, if a training step takes up to 2 minutes and saving a checkpoint takes 2 minutes, set at least 240 seconds of grace time. +Calculate the required grace period as the longest possible training step time plus the checkpoint saving time, plus the 3 second `kill_wait` delay before the checkpoint begins. For example, if a training step takes up to 2 minutes and saving a checkpoint takes 2 minutes, set at least 243 seconds of grace time. diff --git a/src/transformers/training_args.py b/src/transformers/training_args.py index eb1231e02d50..0a1928aca1ad 100644 --- a/src/transformers/training_args.py +++ b/src/transformers/training_args.py @@ -482,7 +482,7 @@ class TrainingArguments: Enable Just-In-Time checkpointing on SIGTERM signal for graceful termination on preemptible workloads. **Important**: Configure your orchestrator's graceful shutdown period to allow sufficient time. For Kubernetes, set `terminationGracePeriodSeconds` - (default 30s is usually insufficient). For Slurm, use `--signal=USR1@`. + (default 30s is usually insufficient). For Slurm, use `--signal=TERM@`. Required grace period ≥ longest iteration time + checkpoint save time. > Hugging Face Hub Integration