-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix infinite loop in MPITaskScheduler
#3800
base: master
Are you sure you want to change the base?
Conversation
pending_task_q is a multiprocessing.Queue and so some (implicit) pickle serialization happens there (I'm not sure exactly when) in order to get the objects movable between processes. So that makes me immediately suspicious about "a related performance issue with tasks potentially getting unpacked multiple times." without seeing some numbers: although the Parsl code is not reserializing everything, is Python reserializing and deserializing everything each time round when it puts things on the multiprocessing queue? |
i guess that doesn't matter too much - a task only goes into the |
Alright, here's the difference: |
i think its fine - i was misunderstanding the backlog queue to be a |
this needs updating against master (now that #3794 is merged) |
…klog_queue and then attempt to schedule them avoiding the infinite loop. * Adding regression test `test_tiny_large_loop` that triggers `RecursionError: maximum recursion depth exceeded while getting the repr of an object`
…ng backlog processing
adfe740
to
27ad9df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other comment about test hang
This dropped out of the merge queue due to a hang again. The first time it dropped out, I investigated and it looked like a multiprocessing related monitoring problem, unrelated to this PR. The second time it dropped out, I have looked at the logs for that failed test and it looks like the hang is inside an MPI executor test, so I would like to review that a bit more seriously. The run directory is As an example of suspicious behaviour: htex task 106 is placed into the backlog in
returns nothing. I haven't audited the other tasks in that log for liveness. If it is only the last task failing, then maybe a race condition around final task handling ("more tasks need to keep arriving to keep backlog tasks scheduled" or something like that?). I have also not investigated the code in depth to look at that. I've dropped off my approval for this PR until more eyes look at this case. |
I'll note that the final task completion is logged in the same millisecond as the final backlog placement:
My gut then says theres a race condition: T1: decide a task needs to be backlogged on the results side: and now no event ever happens to schedule the backlog. I have not really dug into this, and I have not looked at code prior to the PR to see if I get the same gut feeling there. |
Description
Currently, the
MPITaskScheduler
'sschedule_backlog_tasks
method takes tasks from the backlog and attempts to schedule them until the queue is empty. However, since callingput_task
pops the task back onto the backlog queue, this ends up in an infinite loop if at least 1 task cannot be scheduled. This PR fixes this bug and a related performance issue with tasks potentially getting unpacked multiple times.Changed Behaviour
schedule_backlog_tasks
to fetch all tasks in the backlog_queue and then attempt to schedule them avoiding the infinite loop.put_task
is divided into packing and scheduling tasks with a newschedule_task
method. Backlog processing callsschedule_task
to avoid redundant unpacking.These are change split from #3783 to keep the PR concise.
This is split 2 of 3.
Type of change