Skip to content

When the worker dies, the concurrency limit is not reset #546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
OlegChuev opened this issue Apr 4, 2025 · 3 comments · May be fixed by #547
Open

When the worker dies, the concurrency limit is not reset #546

OlegChuev opened this issue Apr 4, 2025 · 3 comments · May be fixed by #547
Assignees

Comments

@OlegChuev
Copy link

OlegChuev commented Apr 4, 2025

I have a job that can run for an extended period — let’s call it InfiniteSleepJob:

class InfiniteSleepJob < ApplicationJob
  # limit only one simultaneous execution of operation per key
  limits_concurrency to: 1, key: ->(key, *_args) { key },  group: "my_group", duration: 1.day

  def perform(*args)
    # do smt
  end
end

When I add a few jobs to the queue and the worker unexpectedly dies during execution (e.g., due to an HA event or if I forcibly kill the Docker container with docker kill container), the concurrency limit gets stuck, even after the job is pruned. Block is released only when it expires

Here’s an example of a job that was pruned, but the concurrency limit wasn’t removed:
Image

Image

Here’s my queue.yml:

default: &default
  dispatchers:
    - polling_interval: 1
      batch_size: 500
      concurrency_maintenance_interval:  120
  workers:
    - queues: "default"
      threads: 3
      processes: 1
      polling_interval: 1

I’d expect the queue to unlock as soon as the job fails, allowing the blocked jobs to proceed as expected.

Could you kindly clarify whether this is the expected behavior? Am I missing something?

@rosa
Copy link
Member

rosa commented Apr 4, 2025

Hey @OlegChuev, yeah, it's expected behaviour if the worker dies unceremoniously (eg. a kill -9 signal). It wouldn't have a chance to clean up after itself, which includes the concurrency limit. The duration parameter determines how long the system will wait to unblock jobs that didn't get successfully unblocked for any reason, so if your long running job starts at, say 17:00, another one gets enqueued and blocked at 18:00, and the first one dies at 20:00, it won't be until the following day at 17:00 (1.day duration) that the blocked job will be considered to be unlocked.

@rosa
Copy link
Member

rosa commented Apr 4, 2025

Although I think I could improve this case, when releasing claimed jobs whose worker got killed... 🤔 I'll work on this next week.

@nerisa
Copy link

nerisa commented Apr 10, 2025

Hello @rosa, I have tried to fix this issue in my PR #547

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants