-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Invalidate session check if associated session is deleted #22227
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 @srahul3, Thank you for working on this.
I left a comment about how this would work in the context of snapshot restoration. My understanding is that the session deletion don't create any tombstone, so we won't have any trace of it in a snapshot, so when a snapshot is applied the health-check associated with the session will have the wrong state.
A test that could be done to confirm that:
- Single server environment (not in dev mode)
- Create a session
- Create a health-check of type session pointing to the session
- Destroy the session
- Observe the health-check state (critical)
- wait for some time until a raft snapshot is performed
- restart the server
- observe the health-check state (should be still critical)
I guess this can be mitigated by adding logic in the health-check creation code to set the health-check state based on the session lookup but you need to be careful about order in that case (what if the snapshot replay happen to have the session creation after the health check?).
That bring me to my second question, how would we mark the health-check back to healthy? Are you planning to add that in the session creation code path?
agent/consul/state/session.go
Outdated
@@ -448,5 +449,28 @@ func (s *Store) deleteSessionTxn(tx WriteTxn, idx uint64, sessionID string, entM | |||
} | |||
} | |||
|
|||
// session invalidating the health-checks | |||
return s.markSessionCheckCritical(tx, idx, session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this works properly, specifically in the context of snapshot restoration. As in that case, the session would be gone and we don't have any trace in the snapshot to replay the session deletion and mark the health-check critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New updates:
@dhiaayachi I have made some changes to the code as suggested. This code even works with snapshot.
- At the very beginning, the ESM will register the session check in critical state.
- The session registration will discount the
session
check even if it is critical hence allowing the registration. - On successful session registration, the session health check will be marked healthy
- On deletion of session, the session health check will be marked critical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srahul3 This look great! 🚀
Background
The ESM is also requested to go agentless, being agentless it becomes challenging to keep track of the health of the artificial node.
Proposal
This pull request introduces a new feature to support Consul Session to update the state of the health-checks. The constraint is, the health-check needs to be of type
session
and bothsession
and thehealth-check
shall belong to common parent node.The idea is to use the sessions to keep track of health status. During partition of ESM crash, the session is deleted and hence, the changes in this PR enables to make the associated
session
checks to be critical. On thus event, the other ESM instance picks up the leadership and re-distribute the responsibility among remaining health ESM instances.There is need for a regular session to be created with name
<session-name>
There is new check which is proposed of type
Type
="session"
andDefinition.SessionName
=<session-name>
.** Code sequence **
session
. This session health-check also has a field which suggest the session name. If this check's session name matches with this session name, the check's name is marked critical.Testing
Unit test added
PR Checklist