-
Notifications
You must be signed in to change notification settings - Fork 178
fix: race condition with interruptibles #566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @elasticspoon, thanks for looking into this! I haven't been able to reproduce this with
unfortunately. However, I'm not sure I get what's happening... 🤔 There's a single scheduler here being configured, not two, and the scheduler runs in a separate process, not a thread. The signal should be handled by the main thread in that process. |
Whoops. You are right @rosa. I got my concurrency primitives mixed up. I fixed up the PR descriptions but just replace Thread in what I was saying with Process. Regarding the reproduction bit, the laptop is pretty crappy so thats probably why it happens pretty consistently for me (similar to CI). But here are maybe some more descriptive logs of what happens during a failure: Registered 38637 solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Registered 38641 solid-queue-worker(1.1.5): waiting for jobs in first_queue
Creating pipe for solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653
Registered 38646 solid-queue-worker(1.1.5): waiting for jobs in second_queue
PID 38631 solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653 sleeping. waiting on pipe r: 59 (w: 60)
Creating pipe for solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Creating pipe for solid-queue-worker(1.1.5): waiting for jobs in first_queue
PID 38641 solid-queue-worker(1.1.5): waiting for jobs in first_queue sleeping. waiting on pipe r: 61 (w: 62)
Creating pipe for solid-queue-worker(1.1.5): waiting for jobs in second_queue
PID 38646 solid-queue-worker(1.1.5): waiting for jobs in second_queue sleeping. waiting on pipe r: 62 (w: 63)
Registered 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38637 solid-queue-dispatcher(1.1.5): dispatching every 1 seconds sleeping. waiting on pipe r: 62 (w: 63)
Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 60 (r: 59) from solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653
Sending TERM to w: 63 (r: 62) from solid-queue-worker(1.1.5): waiting for jobs in second_queue
Sending TERM to w: 63 (r: 62) from solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Sending TERM to w: 62 (r: 61) from solid-queue-worker(1.1.5): waiting for jobs in first_queue
Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 66 (r: 65) from solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result sleeping. waiting on pipe r: 63 (w: 64)
Killing 38653 The issue as I see it is that 6 pipes are created for 5 processes. I believe it is one being created by
and the other by
I assume this is cause the pipe is lazy evaluated def self_pipe
@self_pipe ||= create_self_pipe
end But it is a bit confusing cause all this is happening within the same process so you would not expect a race condition in setting that variable. |
Oh, thank you! These are helpful. Just to make sure I understand this completely, what are the IDs following
And from where are you logging this? Thanks! |
Those numbers are the file descriptors for the ends of the pipe. This is the full debugging code I added: main...elasticspoon:solid_queue:debug-test-failure |
Note: this blocking / nonblocking threaded / forked code is hella confusing to me so I might have gotten some stuff wrong here.
The Problem
When running
TARGET_DB=sqlite bin/rails test --seed 38423
I consistently had failures like:I believe these failures are not sqlite exclusive but are more noticeable there cause it was not running in docker.
The test:
SolidQueue.on_scheduler_start {...}
pid = run_supervisor_as_fork(...)
terminate_process(pid)
JobResults
from the callbacks were createdInstead of the Scheduler receiving the
TERM
signal, gracefully shutting down and running callbacks it would not do anything in response to the signal, hit the timeout, receive aKILL
signal and not run the callbacks.Race Condition
It seems like the issue was
self_pipe
being slow:One process would call
run
and while it was creating the pipe another process would callstop
.This is a race condition. Both processes end up creating their own pipes and when
interrupt
writes to the pipe the does not go anywhere cause the other process has a different pipe.If this race condition happens Process A will sleep until it gets killed.
Solution
I ended up fixing this by getting rid of the lazy initialization of the pipe. That way any process that is accessing the same scheduler will have the same pipe.
Alternatively maybe there should be a Mutex around the setter for
self_pipe
.