Skip to content

Adding 100ms minimum delay for recurring tasks #615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

p-schlickmann
Copy link
Contributor

@p-schlickmann p-schlickmann commented Jul 25, 2025

Hey @rosa

While I was trying to reduce the flakiness of test/integration/instrumentation_test.rb, I uncovered a real issue:

On rare occasions the scheduler would recurse with a near-zero delay and enqueue 2-3 jobs in the same second, which 100% shouldn't happen. The output was something like:

SolidQueue::Job.all.map(&:scheduled_at)
# ⇒ [
#   2025-08-01T02:41:38.995992Z,
#   2025-08-01T02:41:38.998557Z,
#   2025-08-01T02:41:38.999440Z
# ]

# Note they are all scheduled when it's very close to the next whole-second.

Explanation:

def schedule(task)
  # 1) Create a scheduled task that will wait `delay_from_now` seconds and then run the block
  scheduled_task = Concurrent::ScheduledTask.new(task.delay_from_now, args: [ self, task, task.next_time ]) do |thread_schedule, thread_task, thread_task_run_at|
    # 2) Schedule the next occurrence, which calls #schedule again
    thread_schedule.schedule_task(thread_task)
  
    wrap_in_app_executor do
      # 3) Enqueues the run and creates SolidQueue::Job
      thread_task.enqueue(at: thread_task_run_at)
    end
  end
  
  scheduled_task.add_observer do |_, _, error|
    handle_thread_error(error) if error && !error.is_a?(Concurrent::CancelledOperationError)
  end
  
  # 4) If `delay_from_now` is very small, this will execute the block immediately
  scheduled_task.tap(&:execute)
end

So what happens is:

  • delay_from_now function returns something very close to 0, because the timestamp is 2025-08-01T02:41:38.995992Z
  • Step 4 with delay_from_now ≈ 0 executes the block right away
  • Step 2 inside that block calls #schedule_task, which invokes #schedule again (still zero delay), and so on, nesting 2-3 immediate runs before any enqueue happens.
  • Only after all that recursion, Step 3 enqueues the first job (and creates the DB record).

The fix

By clamping the delay to at least 100 ms - when the raw delay is just a few milliseconds before the next whole-second mark - that extra 0.1 s pushes the scheduled timestamp into the following second, guaranteeing Step 3 (the enqueue) has time to complete and prevents Step 2 from going into recursion.

It's kinda confusing 😅, but the risk should be very low, and I have confirmed the test no longer fails:

# Runs ok!
for i in {1..1000}; do bin/rails test test/integration/instrumentation_test.rb || break; done

@p-schlickmann p-schlickmann changed the title Attempt to reduce flakiness of test/integration/instrumentation_test.rb Adding 100ms minimum delay for recurring tasks Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant