[AMORO-3752] Resolve data inconsistency issue during HA failover #3795
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Close #3752.
Brief change log
When enables HA, the new primary node creates a ZooKeeper path named
disposeCompletePath. During a primary-standby node switchover, the old primary node deletes this path after releasing the AMS service, thereby notifying the new primary node that it can start the AMS service.1、Detect previous primary node information:
Check the current primary node information stored in ZooKeeper
Determine if it matches the current node information
If it is the same node (e.g., during restart), no waiting is required
If it is a different node, wait for the previous primary node to complete resource release
2、Wait for previous primary node release completion:
Create a special ZooKeeper path
disposeCompletePathfor signal notificationWait up to 30 seconds until the previous primary node deletes this path
If no signal is received after the timeout, proceed with master node operations (to prevent system blocking due to unresponsive previous nodes)
3、Send completion signal upon resource release:
Within the
signalDisposeComplete()method, the master node deletes thedisposeCompletePathpath upon shutdownThis allows the new master node to detect that the previous master node has completed resource release
Because it cannot effectively distinguish whether amoro is performing a primary-standby node switchover or a restart operation, the current PR has an unavoidable issue:
when HA is enabled, after the ams service restarts following a period of operation, the elected primary node (different from the one before the restart) will wait 30 seconds before starting the ams service.
How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request
Documentation