feat(scheduler): liveness check to detect deadlocks #6908

domsolutions · 2025-10-30T10:42:16Z

Motivation

We have observed from customers logs, sometimes it appears the scheduler hits a deadlock state and doesn't respond to gRPC requests. Customers then have to manually restart the scheduler to mitigate this.

This PR introduces a liveness check, which will verify there's no deadlocks on any of the critical services. It will attempt to acquire locks and release them. If it fails to acquire a lock, it will block, and the liveness check will timeout and mark as failed, eventually causing a restart.

Summary of changes

increased the liveness period to 20 seconds to avoid causing delays in processing control plane events
liveness timeout of 10 seconds to give a generous amount of time if scheduler is busy processing events
heartbeats on: agent grpc server data-flow-engine grpc server scheduler server experiment svc which will acquire and immediatly release their locks which could cause blocking behaviour

Also introduced a new Makefile target kind-install-scheduler as was taking a long time to test changes using Ansible. This new rule will build the scheduler docker image, and tag it as the same tag as is currently deployed in Kind and will then restart the scheduler. Additionally to speed up the build, I removed the target to run the tests prior to buiding. IMHO this isn't needed and shouldn't be part of the build process, as this is what the pipeline is for.

So to build and deploy the scheduler can now run:

make -C scheduler kind-install-scheduler

We should probably change all other docker images to not run the tests either to speed up builds/deployments to allow for quicker testing feedback.

Checklist

Added/updated unit tests
Added/updated documentation
Checked for typos in variable names, comments, etc.
Added licences for new files

Testing

domsolutions added 2 commits October 30, 2025 09:38

wip

2b97bff

liveness check to detect deadlocks

faaca1c

domsolutions requested a review from lc525 as a code owner October 30, 2025 10:42

copyright

0959fbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(scheduler): liveness check to detect deadlocks #6908

feat(scheduler): liveness check to detect deadlocks #6908

Uh oh!

domsolutions commented Oct 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(scheduler): liveness check to detect deadlocks #6908

Are you sure you want to change the base?

feat(scheduler): liveness check to detect deadlocks #6908

Uh oh!

Conversation

domsolutions commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Summary of changes

Checklist

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

domsolutions commented Oct 30, 2025 •

edited

Loading