Skip to content

Conversation

@zhangwl9
Copy link
Contributor

@zhangwl9 zhangwl9 commented Sep 24, 2025

Why are the changes needed?

Close #3752.

Brief change log

When enables HA, the new primary node creates a ZooKeeper path named disposeCompletePath. During a primary-standby node switchover, the old primary node deletes this path after releasing the AMS service, thereby notifying the new primary node that it can start the AMS service.

1、Detect previous primary node information:

  • Check the current primary node information stored in ZooKeeper

  • Determine if it matches the current node information

  • If it is the same node (e.g., during restart), no waiting is required

  • If it is a different node, wait for the previous primary node to complete resource release

2、Wait for previous primary node release completion:

  • Create a special ZooKeeper path disposeCompletePath for signal notification

  • Wait up to 30 seconds until the previous primary node deletes this path

  • If no signal is received after the timeout, proceed with master node operations (to prevent system blocking due to unresponsive previous nodes)

3、Send completion signal upon resource release:

  • Within the signalDisposeComplete() method, the master node deletes the disposeCompletePath path upon shutdown

  • This allows the new master node to detect that the previous master node has completed resource release

    Because it cannot effectively distinguish whether amoro is performing a primary-standby node switchover or a restart operation, the current PR has an unavoidable issue:
    when HA is enabled, after the ams service restarts following a period of operation, the elected primary node (different from the one before the restart) will wait 30 seconds before starting the ams service.

How was this patch tested?

  • Add some test cases that check the changes thoroughly including negative and positive cases if possible

  • Add screenshots for manual tests if appropriate

  • Run test locally before making a pull request

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@zhangwl9 zhangwl9 force-pushed the AMORO-FixPrimaryBackupSwitch-dev branch from 5d68b30 to cb36f40 Compare September 24, 2025 10:50
@zhangwl9 zhangwl9 force-pushed the AMORO-FixPrimaryBackupSwitch-dev branch from d1c356b to a50b4fb Compare October 10, 2025 09:15
@tcodehuber tcodehuber changed the title [AMORO-3752] Fixed an issue where, when enabling HA in Amoro and switching between primary and standby nodes, the data loaded by the primary node was inconsistent with the database. [AMORO-3752] Resolve data inconsistency issue during HA failover Oct 10, 2025
@zhangwl9
Copy link
Contributor Author

zhangwl9 commented Oct 13, 2025

@xxubai Could you help me to review it when you are free? Thank you very much!

… primary and standby nodes, the data loaded by the primary node was inconsistent with the database.
@zhangwl9 zhangwl9 force-pushed the AMORO-FixPrimaryBackupSwitch-dev branch from e655b30 to 69132f2 Compare October 17, 2025 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Data inconsistency caused by switching between primary and backup nodes

1 participant