-
-
Notifications
You must be signed in to change notification settings - Fork 736
Description
In distributed==2022.5.0
we started seeing workers go down repeatedly with the message "Worker process died unexpectedly". I traced it back to #6200 and connected the dots that these tasks hold the GIL for long periods of time (one example is a third party numerical optimization solver that we call out to).
I follow the reasoning of the change, but I do think this has the potential to be a pretty disruptive change for some users, especially given the vagueness of the relevant logs: I put this together quickly because I had been browsing the issues recently and had some inkling that this had been discussed (but didn't know a change was already merged+released), but I think for users less familiar with distributed
this might be very challenging to get to the bottom of. At minimum I think adding some messaging that points to the worker-ttl
variable would be helpful (admittedly the "blah has been holding the GIL for n seconds" message is an existing clue but personally I tune those out because we see them so often).