Skip to content

Enable graceful shutdown when an error is raised #92

@lewtun

Description

@lewtun

Hi folks, I'm testing out PipelineRL and have noticed that when I launch a training job, the job will continue running even if a fatal error is raised like this:

[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1208, in <module>
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    main()
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1204, in main
    launch_command(args)
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1183, in launch_command
    deepspeed_launcher(args)
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 868, in deepspeed_launcher
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    distrib_run.run(args)
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    elastic_launch(
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
pipelinerl/entrypoints/run_finetune.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-11-05_22:58:08
  host      : ip-26-0-172-73.ec2.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 669951)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 669951
=======================================================
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6

I know the root cause of the above error, but more generally is there a way to have graceful shutdowns when some error like this or CUDA OOM is triggered?

Without this, it makes it quite hard to schedule Slurm jobs with PipelineRL, as the Slurm controller checks for failed jobs that can be flushed from the current pool.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions