-
Notifications
You must be signed in to change notification settings - Fork 30
Open
Description
Hi folks, I'm testing out PipelineRL and have noticed that when I launch a training job, the job will continue running even if a fatal error is raised like this:
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1208, in <module>
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,270 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,271 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
main()
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1204, in main
launch_command(args)
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1183, in launch_command
deepspeed_launcher(args)
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/accelerate/commands/launch.py", line 868, in deepspeed_launcher
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
distrib_run.run(args)
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
[preprocessor]: 2025-11-05 22:58:15,272 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
elastic_launch(
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
[preprocessor]: 2025-11-05 22:58:15,273 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fsx/lewis/git/pipeline-rl-cmu/prl/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - AssertionError: Current batch should not be empty when writing
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
pipelinerl/entrypoints/run_finetune.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-11-05_22:58:08
host : ip-26-0-172-73.ec2.internal
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 669951)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 669951
=======================================================
[preprocessor]: 2025-11-05 22:58:15,274 - pipelinerl.preprocess - ERROR - Current length: 0, target samples per lead: 32, samples per trainer: 6
I know the root cause of the above error, but more generally is there a way to have graceful shutdowns when some error like this or CUDA OOM is triggered?
Without this, it makes it quite hard to schedule Slurm jobs with PipelineRL, as the Slurm controller checks for failed jobs that can be flushed from the current pool.
Metadata
Metadata
Assignees
Labels
No labels