Enable graceful shutdown when an error is raised

Hi folks, I'm testing out PipelineRL and have noticed that when I launch a training job, the job will continue running even if a fatal error is raised like this:

```
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1208, in <module>
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    main()
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1204, in main
    launch_command(args)
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1183, in launch_command
    deepspeed_launcher(args)
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 868, in deepspeed_launcher
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    distrib_run.run(args)
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    elastic_launch(
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
pipelinerl/entrypoints/run_finetune.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-11-05_22:58:08
  host      : ip-26-0-172-73.ec2.internal
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 669951)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 669951
=======================================================
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
```

I know the root cause of the above error, but more generally is there a way to have graceful shutdowns when some error like this or CUDA OOM is triggered?

Without this, it makes it quite hard to schedule Slurm jobs with PipelineRL, as the Slurm controller checks for failed jobs that can be flushed from the current pool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable graceful shutdown when an error is raised #92

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enable graceful shutdown when an error is raised #92

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions