Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Invalidate session check if associated session is deleted #22227

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

srahul3
Copy link

@srahul3 srahul3 commented Mar 17, 2025

Background

The ESM is also requested to go agentless, being agentless it becomes challenging to keep track of the health of the artificial node.

Proposal

This pull request introduces a new feature to support Consul Session to update the state of the health-checks. The constraint is, the health-check needs to be of type session and both session and the health-check shall belong to common parent node.

The idea is to use the sessions to keep track of health status. During partition of ESM crash, the session is deleted and hence, the changes in this PR enables to make the associated session checks to be critical. On thus event, the other ESM instance picks up the leadership and re-distribute the responsibility among remaining health ESM instances.

There is need for a regular session to be created with name <session-name>
There is new check which is proposed of type Type="session" and Definition.SessionName=<session-name>.

** Code sequence **

  1. Session invalidates / ttl timer to no reset
  2. Session is destroyed
  3. The destruction process also checks if the session node has as special check of type session. This session health-check also has a field which suggest the session name. If this check's session name matches with this session name, the check's name is marked critical.

Testing

Unit test added

PR Checklist

  • updated test coverage
  • external facing docs updated
  • appropriate backport labels added
  • not a security concern

@srahul3 srahul3 requested a review from a team as a code owner March 17, 2025 07:22
@github-actions github-actions bot added theme/api Relating to the HTTP API interface theme/health-checks Health Check functionality labels Mar 17, 2025
@srahul3 srahul3 added pr/no-changelog PR does not need a corresponding .changelog entry backport/all Apply backports for all active releases per .release/versions.hcl labels Mar 17, 2025
Copy link
Collaborator

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @srahul3, Thank you for working on this.
I left a comment about how this would work in the context of snapshot restoration. My understanding is that the session deletion don't create any tombstone, so we won't have any trace of it in a snapshot, so when a snapshot is applied the health-check associated with the session will have the wrong state.
A test that could be done to confirm that:

  • Single server environment (not in dev mode)
  • Create a session
  • Create a health-check of type session pointing to the session
  • Destroy the session
  • Observe the health-check state (critical)
  • wait for some time until a raft snapshot is performed
  • restart the server
  • observe the health-check state (should be still critical)

I guess this can be mitigated by adding logic in the health-check creation code to set the health-check state based on the session lookup but you need to be careful about order in that case (what if the snapshot replay happen to have the session creation after the health check?).

That bring me to my second question, how would we mark the health-check back to healthy? Are you planning to add that in the session creation code path?

@@ -448,5 +449,28 @@ func (s *Store) deleteSessionTxn(tx WriteTxn, idx uint64, sessionID string, entM
}
}

// session invalidating the health-checks
return s.markSessionCheckCritical(tx, idx, session)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this works properly, specifically in the context of snapshot restoration. As in that case, the session would be gone and we don't have any trace in the snapshot to replay the session deletion and mark the health-check critical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New updates:

@dhiaayachi I have made some changes to the code as suggested. This code even works with snapshot.

  1. At the very beginning, the ESM will register the session check in critical state.
  2. The session registration will discount the session check even if it is critical hence allowing the registration.
  3. On successful session registration, the session health check will be marked healthy
  4. On deletion of session, the session health check will be marked critical

@srahul3 srahul3 requested a review from dhiaayachi March 18, 2025 13:47
Copy link
Collaborator

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srahul3 This look great! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/all Apply backports for all active releases per .release/versions.hcl pr/no-changelog PR does not need a corresponding .changelog entry theme/api Relating to the HTTP API interface theme/health-checks Health Check functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants