Skip to content

"Worker process died unexpectedly" when tasks don't release GIL #6324

@bnaul

Description

@bnaul

In distributed==2022.5.0 we started seeing workers go down repeatedly with the message "Worker process died unexpectedly". I traced it back to #6200 and connected the dots that these tasks hold the GIL for long periods of time (one example is a third party numerical optimization solver that we call out to).

I follow the reasoning of the change, but I do think this has the potential to be a pretty disruptive change for some users, especially given the vagueness of the relevant logs: I put this together quickly because I had been browsing the issues recently and had some inkling that this had been discussed (but didn't know a change was already merged+released), but I think for users less familiar with distributed this might be very challenging to get to the bottom of. At minimum I think adding some messaging that points to the worker-ttl variable would be helpful (admittedly the "blah has been holding the GIL for n seconds" message is an existing clue but personally I tune those out because we see them so often).

cc @mrocklin @gjoseph92

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionDiscussing a topic with no specific actions yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions