Skip to content

Conversation

@adamruzicka
Copy link
Contributor

@adamruzicka adamruzicka commented Nov 12, 2025

If an exception was raised when trying to acquire to lock, the process would end up in a half-dead state where it would stay "up", but would not attempt to acquire the lock again when the connection came back up.

How to test this?

Setup

  1. Run postgres podman run -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=dynflow
  2. Run redis podman run -p 6379:6379 redis:6
  3. Run orchestrator 1
export DB_CONN_STRING=postgresql://postgres:postgres@localhost:5432/dynflow
bundle exec sidekiq -r ./examples/remote_executor.rb -q dynflow_orchestrator
  1. Run orchestrator 2
export DB_CONN_STRING=postgresql://postgres:postgres@localhost:5432/dynflow
bundle exec sidekiq -r ./examples/remote_executor.rb -q dynflow_orchestrator
  1. Run a worker
export DB_CONN_STRING=postgresql://postgres:postgres@localhost:5432/dynflow
bundle exec sidekiq -r ./examples/remote_executor.rb -q default
  1. Start spawning things
export DB_CONN_STRING=postgresql://postgres:postgres@localhost:5432/dynflow
bundle exec ruby ./examples/remote_executor.rb client

The actual test

  1. Stop the redis container
  2. Observe how everything starts spewing errors
  3. Stop orchestrator 1
  4. Start the redis container again

Expected results:
Things may appear stuck until orchestrator 1's lock expires (up to 1 minute), after that orchestrator 2 should kick in.

Includes #462

@adamruzicka adamruzicka force-pushed the redis-conn-drop branch 5 times, most recently from feb876b to ee9b565 Compare November 20, 2025 12:17
@adamruzicka adamruzicka force-pushed the redis-conn-drop branch 4 times, most recently from c5e9e01 to 8859c92 Compare November 26, 2025 15:09
If an exception was raised when trying to acquire to lock, the process
would end up in a half-dead state where it would stay "up", but would
not attempt to acquire the lock again when the connection came back up.
@adamruzicka
Copy link
Contributor Author

not ok 8 active orchestrator can survive a longer redis connection drop
...
# o1: [2025-11-26 15:29:35.743 #6666] FATAL -- dynflow: The orchestrator lock was stolen by 24ade3e2-60a6-41e8-acbf-076998697081, aborting.

waiiit, what? In a test where there's only one orchestrator running?

@adamruzicka
Copy link
Contributor Author

Guess a SIGTERM wasn't enough

Copy link

@ofedoren ofedoren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @adamruzicka, LGTM, just one probably not related question, but since we're at redis/sidekiq world...

I've just noticed that dynflow uses gitlab-sidekiq-fetcher, which seems to be no longer maintained for others: https://gitlab.com/gitlab-org/ruby/gems/sidekiq-reliable-fetch#this-gem-is-no-longer-updated. Is it something that affects us aside from using gem that is not maintained anymore? Note that this gem locks used version of sidekiq to be ~> 6.1, thus it might need some refactoring if we'll end up using latest sidekiq/redis versions since we don't seem to have any locks on these.

@adamruzicka
Copy link
Contributor Author

Is it something that affects us aside from using gem that is not maintained anymore?

It is holding us back from moving to newer sidekiq, but nothing else apart from that. I have some ideas how we could get out of this, but resolving this situation isn't an immediate priority right now.

@adamruzicka adamruzicka merged commit b7507cd into Dynflow:master Dec 3, 2025
11 checks passed
@adamruzicka adamruzicka deleted the redis-conn-drop branch December 3, 2025 10:31
@adamruzicka
Copy link
Contributor Author

Thank you @ofedoren !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants