Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

okcomputer: Sidekiq worker count check #2216

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jmartin-sul
Copy link
Member

@jmartin-sul jmartin-sul commented Mar 31, 2023

Why was this change made? 🤔

fail the health check if a worker box has an unexpected number of sidekiq worker processes or threads

How was this change tested? 🤨

after rebasing on the latest main (to pick up the upgrade to sidekiq 7), i deployed this branch to QA, and tested the following scenarios:

web host (doesn't have workers, so doesn't run this new check)

Screen Shot 2023-03-31 at 12 35 38 PM

worker VM

normal operation (exactly the number of expected management processes and worker threads)

Screen Shot 2023-03-31 at 12 35 51 PM

some but not all worker management processes running

Screen Shot 2023-03-31 at 12 39 52 PM

no worker management processes running

Screen Shot 2023-03-31 at 12 40 28 PM

more than expected number of worker management processes running

this might happen when a long running job spans a deployment, though in this case i faked it by starting an extra process manually on the VM. happy to kick off a long running CV job and re-deploy for a more realistic test. i'd just wanted to avoid messing with cocina-models release testing that was in progress at the time, since infra integration tests are very sensitive to delays.

Screen Shot 2023-03-31 at 12 44 00 PM

⚡ ⚠ If this change has cross service impact, or if it changes code used internally for cloud replication, run integration test preassembly_image_accessioning_spec.rb against stage as it tests preservation, and/or test in stage environment, in addition to specs. The main classes relevant to replication are ZipmakerJob, DeliveryDispatcherJob, *DeliveryJob, ResultsRecorderJob, and DruidVersionZip; see here for overview diagram of replication pipeline.⚡

@jmartin-sul jmartin-sul force-pushed the sidekiq-worker-count-check branch from 8b55361 to b9779bf Compare March 31, 2023 18:51
@jmartin-sul jmartin-sul changed the title [HOLD] okcomputer: Sidekiq worker count check okcomputer: Sidekiq worker count check Mar 31, 2023
@jmartin-sul jmartin-sul marked this pull request as ready for review March 31, 2023 23:32
@jmartin-sul jmartin-sul force-pushed the sidekiq-worker-count-check branch from b9779bf to c49af28 Compare June 20, 2023 23:21
@jmartin-sul
Copy link
Member Author

rebased on latest main, and deployed the rebased branch to stage for some quick testing to make sure all was still working (happy path works, check fails if i manually kill some sidekiq processes).

Copy link
Contributor

@justinlittman justinlittman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you're checking thread counts (instead of just process counts)?

@jmartin-sul
Copy link
Member Author

Curious why you're checking thread counts (instead of just process counts)?

some combination of... it was pretty easy to get at that info (both the actual and expected thread counts), it seemed potentially useful at a glance for infra FR and ops folks, and we've seen occasional sidekiq worker issues but not enough to have a great idea of what the most common misbehavior modes are.

so i figured (a little more) info == better for triaging alerts. i also was thinking that the sidekiq API is pretty stable and pretty well documented, and this is unlikely to need changes in the near future, so even though thread counts adds a bit of complexity over just doing process counts, it didn't seem like complexity we'd have to pay much for going forward? and then if e.g. sidekiq 8 or 9 breaks this, we can simplify if fixing doesn't seem worth the effort?

@justinlittman
Copy link
Contributor

My concern is that this is just making work for our future selves to troubleshoot. Do we have any indication that there is any value in monitoring this or that Sidekiq does not repair any damaged threads? Do we know this won't set off false alarms? How will our future selves know why we were monitoring threads in the first place if it breaks?

Might be worth getting some additional opinions on this.

@jmartin-sul
Copy link
Member Author

backburnered this since it was reviewed. haven't seen any issues with sidekiq processes running the wrong number of worker threads compared to what's configured in the year or so this change has hung out. we did run into an issue in hydra_etd that was along the lines of what justin worried about above. after discussing the hydra_etd issue in slack with aaron and laura, i was convinced to simplify the hydra_etd check to just look for expected process count (and to give a bit of helpful guidance about when to worry if there's an alert for too many workers).

next unscheduled week, i'll circle back to this PR and simplify it in the same way as https://github.com/sul-dlss/hydra_etd/pull/1689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants