fix: race condition with interruptibles #566

elasticspoon · 2025-05-19T05:00:40Z

Note: this blocking / nonblocking threaded / forked code is hella confusing to me so I might have gotten some stuff wrong here.

The Problem

When running TARGET_DB=sqlite bin/rails test --seed 38423 I consistently had failures like:

Failure:
LifecycleHooksTest#test_run_lifecycle_hooks [test/integration/lifecycle_hooks_test.rb:90]:
Expected: 15
  Actual: 13

I believe these failures are not sqlite exclusive but are more noticeable there cause it was not running in docker.

The test:

sets a bunch of lifecycle hooks SolidQueue.on_scheduler_start {...}
runs the supervisor pid = run_supervisor_as_fork(...)
terminates the supervisor terminate_process(pid)
then checks that the JobResults from the callbacks were created

Instead of the Scheduler receiving the TERM signal, gracefully shutting down and running callbacks it would not do anything in response to the signal, hit the timeout, receive a KILL signal and not run the callbacks.

Race Condition

It seems like the issue was self_pipe being slow:

def self_pipe
  @self_pipe ||= create_self_pipe
end

def create_self_pipe
  reader, writer = IO.pipe
  { reader: reader, writer: writer }
end

One process would call run and while it was creating the pipe another process would call stop.

This is a race condition. Both processes end up creating their own pipes and when interrupt writes to the pipe the does not go anywhere cause the other process has a different pipe.

Process A: #run
Process A: #self_pipe
Process B: #stop
Process B: #self_pipe
Process B: #self_pipe => {reader: #<IO:fd 65>, writer: #<IO:fd 66>}
Process B: #interrupt => #<IO:fd 66>
Process A: #self_pipe => {reader: #<IO:fd 63>, writer: #<IO:fd 64>}
Process B: done => killing process
Process A: #interruptible_sleep => #<IO:fd 63>

If this race condition happens Process A will sleep until it gets killed.

Solution

I ended up fixing this by getting rid of the lazy initialization of the pipe. That way any process that is accessing the same scheduler will have the same pipe.

Alternatively maybe there should be a Mutex around the setter for self_pipe.

rosa · 2025-06-13T14:31:25Z

Hey @elasticspoon, thanks for looking into this! I haven't been able to reproduce this with

TARGET_DB=sqlite bin/rails test --seed 38423

unfortunately. However, I'm not sure I get what's happening... 🤔 There's a single scheduler here being configured, not two, and the scheduler runs in a separate process, not a thread. The signal should be handled by the main thread in that process.

elasticspoon · 2025-06-17T21:48:43Z

Hey @elasticspoon, thanks for looking into this! I haven't been able to reproduce this with
TARGET_DB=sqlite bin/rails test --seed 38423
unfortunately. However, I'm not sure I get what's happening... 🤔 There's a single scheduler here being configured, not two, and the scheduler runs in a separate process, not a thread. The signal should be handled by the main thread in that process.

Whoops. You are right @rosa. I got my concurrency primitives mixed up. I fixed up the PR descriptions but just replace Thread in what I was saying with Process.

Regarding the reproduction bit, the laptop is pretty crappy so thats probably why it happens pretty consistently for me (similar to CI). But here are maybe some more descriptive logs of what happens during a failure:

Registered 38637 solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Registered 38641 solid-queue-worker(1.1.5): waiting for jobs in first_queue
Creating pipe for solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653
Registered 38646 solid-queue-worker(1.1.5): waiting for jobs in second_queue
PID 38631 solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653 sleeping. waiting on pipe r: 59 (w: 60)
Creating pipe for solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Creating pipe for solid-queue-worker(1.1.5): waiting for jobs in first_queue
PID 38641 solid-queue-worker(1.1.5): waiting for jobs in first_queue sleeping. waiting on pipe r: 61 (w: 62)
Creating pipe for solid-queue-worker(1.1.5): waiting for jobs in second_queue
PID 38646 solid-queue-worker(1.1.5): waiting for jobs in second_queue sleeping. waiting on pipe r: 62 (w: 63)
Registered 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38637 solid-queue-dispatcher(1.1.5): dispatching every 1 seconds sleeping. waiting on pipe r: 62 (w: 63)
Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 60 (r: 59) from solid-queue-supervisor(1.1.5): supervising 38637, 38641, 38646, 38653
Sending TERM to w: 63 (r: 62) from solid-queue-worker(1.1.5): waiting for jobs in second_queue
Sending TERM to w: 63 (r: 62) from solid-queue-dispatcher(1.1.5): dispatching every 1 seconds
Sending TERM to w: 62 (r: 61) from solid-queue-worker(1.1.5): waiting for jobs in first_queue
Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 66 (r: 65) from solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result sleeping. waiting on pipe r: 63 (w: 64)
Killing 38653

The issue as I see it is that 6 pipes are created for 5 processes. I believe it is one being created by

Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
Sending TERM to w: 66 (r: 65) from solid-queue-scheduler(1.1.5): scheduling periodic_store_result

and the other by

Creating pipe for solid-queue-scheduler(1.1.5): scheduling periodic_store_result
PID 38653 solid-queue-scheduler(1.1.5): scheduling periodic_store_result sleeping. waiting on pipe r: 63 (w: 64)

I assume this is cause the pipe is lazy evaluated

def self_pipe
  @self_pipe ||= create_self_pipe
end

But it is a bit confusing cause all this is happening within the same process so you would not expect a race condition in setting that variable.

rosa · 2025-06-21T09:19:44Z

Oh, thank you! These are helpful. Just to make sure I understand this completely, what are the IDs following w and r in your logs? These:

Sending TERM to w: 60 (r: 59)

And from where are you logging this?

Thanks!

elasticspoon · 2025-06-22T16:35:19Z

Oh, thank you! These are helpful. Just to make sure I understand this completely, what are the IDs following w and r in your logs? These:
Sending TERM to w: 60 (r: 59)
And from where are you logging this?

Thanks!

Those numbers are the file descriptors for the ends of the pipe. This is the full debugging code I added: main...elasticspoon:solid_queue:debug-test-failure

fix: race condition with interruptibles

7f1c439

elasticspoon marked this pull request as ready for review May 19, 2025 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: race condition with interruptibles #566

fix: race condition with interruptibles #566

elasticspoon commented May 19, 2025 •

edited

Loading

Uh oh!

rosa commented Jun 13, 2025

Uh oh!

elasticspoon commented Jun 17, 2025

Uh oh!

rosa commented Jun 21, 2025

Uh oh!

elasticspoon commented Jun 22, 2025

Uh oh!

Uh oh!

fix: race condition with interruptibles #566

Are you sure you want to change the base?

fix: race condition with interruptibles #566

Conversation

elasticspoon commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

Race Condition

Solution

Uh oh!

rosa commented Jun 13, 2025

Uh oh!

elasticspoon commented Jun 17, 2025

Uh oh!

rosa commented Jun 21, 2025

Uh oh!

elasticspoon commented Jun 22, 2025

Uh oh!

Uh oh!

elasticspoon commented May 19, 2025 •

edited

Loading