Skip to content

Conversation

@domsolutions
Copy link
Contributor

@domsolutions domsolutions commented Oct 30, 2025

Motivation

We have observed from customers logs, sometimes it appears the scheduler hits a deadlock state and doesn't respond to gRPC requests. Customers then have to manually restart the scheduler to mitigate this.

This PR introduces a liveness check, which will verify there's no deadlocks on any of the critical services. It will attempt to acquire locks and release them. If it fails to acquire a lock, it will block, and the liveness check will timeout and mark as failed, eventually causing a restart.

Summary of changes

  • increased the liveness period to 20 seconds to avoid causing delays in processing control plane events
  • liveness timeout of 10 seconds to give a generous amount of time if scheduler is busy processing events
  • heartbeats on: agent grpc server data-flow-engine grpc server scheduler server experiment svc which will acquire and immediatly release their locks which could cause blocking behaviour

Also introduced a new Makefile target kind-install-scheduler as was taking a long time to test changes using Ansible. This new rule will build the scheduler docker image, and tag it as the same tag as is currently deployed in Kind and will then restart the scheduler. Additionally to speed up the build, I removed the target to run the tests prior to buiding. IMHO this isn't needed and shouldn't be part of the build process, as this is what the pipeline is for.

So to build and deploy the scheduler can now run:

make -C scheduler kind-install-scheduler

We should probably change all other docker images to not run the tests either to speed up builds/deployments to allow for quicker testing feedback.

Checklist

  • Added/updated unit tests
  • Added/updated documentation
  • Checked for typos in variable names, comments, etc.
  • Added licences for new files

Testing

@domsolutions domsolutions requested a review from lc525 as a code owner October 30, 2025 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants