Skip to content

fix: race condition with interruptibles #566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

elasticspoon
Copy link

@elasticspoon elasticspoon commented May 19, 2025

Note: this blocking / nonblocking threaded / forked code is hella confusing to me so I might have gotten some stuff wrong here.

The Problem

When running TARGET_DB=sqlite bin/rails test --seed 38423 I consistently had failures like:

Failure:
LifecycleHooksTest#test_run_lifecycle_hooks [test/integration/lifecycle_hooks_test.rb:90]:
Expected: 15
  Actual: 13

I believe these failures are not sqlite exclusive but are more noticeable there cause it was not running in docker.

The test:

  • sets a bunch of lifecycle hooks SolidQueue.on_scheduler_start {...}
  • runs the supervisor pid = run_supervisor_as_fork(...)
  • terminates the supervisor terminate_process(pid)
  • then checks that the JobResults from the callbacks were created

Instead of the Scheduler receiving the TERM signal, gracefully shutting down and running callbacks it would not do anything in response to the signal, hit the timeout, receive a KILL signal and not run the callbacks.

Race Condition

It seems like the issue was self_pipe being slow:

def self_pipe
  @self_pipe ||= create_self_pipe
end

def create_self_pipe
  reader, writer = IO.pipe
  { reader: reader, writer: writer }
end

One process would call run and while it was creating the pipe another process would call stop.

This is a race condition. Both processes end up creating their own pipes and when interrupt writes to the pipe the does not go anywhere cause the other process has a different pipe.

Process A: #run
Process A: #self_pipe
Process B: #stop
Process B: #self_pipe
Process B: #self_pipe => {reader: #<IO:fd 65>, writer: #<IO:fd 66>}
Process B: #interrupt => #<IO:fd 66>
Process A: #self_pipe => {reader: #<IO:fd 63>, writer: #<IO:fd 64>}
Process B: done => killing process
Process A: #interruptible_sleep => #<IO:fd 63>

If this race condition happens Process A will sleep until it gets killed.

Solution

I ended up fixing this by getting rid of the lazy initialization of the pipe. That way any process that is accessing the same scheduler will have the same pipe.

Alternatively maybe there should be a Mutex around the setter for self_pipe.

@elasticspoon elasticspoon marked this pull request as ready for review May 19, 2025 05:05
@rosa
Copy link
Member

rosa commented Jun 13, 2025

Hey @elasticspoon, thanks for looking into this! I haven't been able to reproduce this with

TARGET_DB=sqlite bin/rails test --seed 38423

unfortunately. However, I'm not sure I get what's happening... 🤔 There's a single scheduler here being configured, not two, and the scheduler runs in a separate process, not a thread. The signal should be handled by the main thread in that process.

@elasticspoon
Copy link
Author

Hey @elasticspoon, thanks for looking into this! I haven't been able to reproduce this with

TARGET_DB=sqlite bin/rails test --seed 38423

unfortunately. However, I'm not sure I get what's happening... 🤔 There's a single scheduler here being configured, not two, and the scheduler runs in a separate process, not a thread. The signal should be handled by the main thread in that process.

Whoops. You are right @rosa. I got my concurrency primitives mixed up. I fixed up the PR descriptions but just replace Thread in what I was saying with Process.

Regarding the reproduction bit, the laptop is pretty crappy so thats probably why it happens pretty consistently for me (similar to CI). But here are maybe some more descriptive logs of what happens during a failure:

Registered 38637 solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Registered 38641 solid-queue-worker(1.1.5): waiting for jobs in first_queue
Creating pipe for solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653
Registered 38646 solid-queue-worker(1.1.5): waiting for jobs in second_queue
PID 38631 solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653 sleeping. waiting on pipe r: 59 (w: 60)
Creating pipe for solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Creating pipe for solid-queue-worker(1.1.5): waiting for jobs in first_queue
PID 38641 solid-queue-worker(1.1.5): waiting for jobs in first_queue sleeping. waiting on pipe r: 61 (w: 62)
Creating pipe for solid-queue-worker(1.1.5): waiting for jobs in second_queue
PID 38646 solid-queue-worker(1.1.5): waiting for jobs in second_queue sleeping. waiting on pipe r: 62 (w: 63)
Registered 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38637 solid-queue-dispatcher(1.1.5): dispatching every 1 seconds sleeping. waiting on pipe r: 62 (w: 63)
Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 60 (r: 59) from solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653
Sending TERM to w: 63 (r: 62) from solid-queue-worker(1.1.5): waiting for jobs in second_queue
Sending TERM to w: 63 (r: 62) from solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Sending TERM to w: 62 (r: 61) from solid-queue-worker(1.1.5): waiting for jobs in first_queue
Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 66 (r: 65) from solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result sleeping. waiting on pipe r: 63 (w: 64)
Killing 38653

The issue as I see it is that 6 pipes are created for 5 processes. I believe it is one being created by

Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 66 (r: 65) from solid-queue-scheduler(1.1.5): scheduling periodic_store_result

and the other by

Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result sleeping. waiting on pipe r: 63 (w: 64)

I assume this is cause the pipe is lazy evaluated

def self_pipe
  @self_pipe ||= create_self_pipe
end

But it is a bit confusing cause all this is happening within the same process so you would not expect a race condition in setting that variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants