feat(scheduler): liveness check to detect deadlocks #6908
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
We have observed from customers logs, sometimes it appears the scheduler hits a deadlock state and doesn't respond to gRPC requests. Customers then have to manually restart the scheduler to mitigate this.
This PR introduces a liveness check, which will verify there's no deadlocks on any of the critical services. It will attempt to acquire locks and release them. If it fails to acquire a lock, it will block, and the liveness check will timeout and mark as failed, eventually causing a restart.
Summary of changes
20 secondsto avoid causing delays in processing control plane events10 secondsto give a generous amount of time if scheduler is busy processing eventsagent grpc serverdata-flow-engine grpc serverscheduler serverexperiment svcwhich will acquire and immediatly release their locks which could cause blocking behaviourAlso introduced a new
Makefiletargetkind-install-scheduleras was taking a long time to test changes using Ansible. This new rule will build the scheduler docker image, and tag it as the same tag as is currently deployed in Kind and will then restart the scheduler. Additionally to speed up the build, I removed the target to run the tests prior to buiding. IMHO this isn't needed and shouldn't be part of the build process, as this is what the pipeline is for.So to build and deploy the scheduler can now run:
We should probably change all other docker images to not run the tests either to speed up builds/deployments to allow for quicker testing feedback.
Checklist
Testing