Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACL replication breaks after upgrade from 1.9.5 to 1.14.3 #16273

Open
kemko opened this issue Feb 15, 2023 · 6 comments
Open

ACL replication breaks after upgrade from 1.9.5 to 1.14.3 #16273

kemko opened this issue Feb 15, 2023 · 6 comments

Comments

@kemko
Copy link

kemko commented Feb 15, 2023

Overview of the Issue

After upgrading Consul from 1.9.5 to 1.14.3, ACL replication breaks. It's fixed by some rather strange actions. We decided to file a bug report since we could not find any notes about this behavior in the documentation.

Reproduction Steps

  1. Deploy at least three Consul clusters on version 1.9.5, one of them must be declared as primary datacenter, the rest must be configured to replicate ACLs from it.
  2. Upgrade all clusters to 1.14.3. After that, secondary clusters will log ACL replication errors periodically. See log 1.
  3. In the web ui of any secondary datacenter, create an empty policy. This will temporarily fix the problem in that DC, but not in the others. After that, everything will break back without manual intervention. See log 2.
  4. Bind the empty policy from step 3 to any existing token. (For example, we bound it to the initial management token).

After that, the replication error will disappear for all datacenters and replication will work as expected.

Operating system and Environment details

Ubuntu 20.04.5 LTS, x86_64 GNU/Linux

Log Fragments

Log 1
2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1
2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1
2023-02-15T12:40:01.391+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates
2023-02-15T12:40:01.391+0300 [WARN]  agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to update local ACL policies: Failed to apply policy upserts: node is not the leader"
Log 2
2023-02-15T13:10:29.486+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:10:29.486+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:10:29.487+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0
2023-02-15T13:10:29.487+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962
2023-02-15T13:12:44.368+0300 [INFO]  agent.server.replication.acl.policy: started ACL Policy replication
2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0
2023-02-15T13:12:44.373+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962
2023-02-15T13:15:40.920+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0
2023-02-15T13:17:53.696+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962
2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0
2023-02-15T13:23:00.541+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962
2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0
2023-02-15T13:28:01.043+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962
2023-02-15T13:33:11.701+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:33:11.701+0300 [WARN]  agent.server.replication.acl.policy: ACL replication remote index moved backwards, forcing a full ACL sync: from=1867938962 to=1692767365
2023-02-15T13:33:11.701+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:33:11.701+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1
2023-02-15T13:33:11.705+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1
2023-02-15T13:33:11.706+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates
2023-02-15T13:33:11.713+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - upserted batch: number_upserted=1 batch_size=497
2023-02-15T13:33:11.713+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - finished updates
2023-02-15T13:33:11.713+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1692767365
2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0
2023-02-15T13:33:11.718+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962
2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=0
2023-02-15T13:38:26.839+0300 [DEBUG] agent.server.replication.acl.policy: ACL replication completed through remote index: index=1867938962
2023-02-15T13:43:32.062+0300 [DEBUG] agent.server.replication.acl.policy: finished fetching acls: amount=27
2023-02-15T13:43:32.062+0300 [WARN]  agent.server.replication.acl.policy: ACL replication remote index moved backwards, forcing a full ACL sync: from=1867938962 to=1692767365
2023-02-15T13:43:32.062+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: local=27 remote=27
2023-02-15T13:43:32.062+0300 [DEBUG] agent.server.replication.acl.policy: acl replication: deletions=0 updates=1
2023-02-15T13:43:32.067+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - downloaded updates: amount=1
2023-02-15T13:43:32.067+0300 [DEBUG] agent.server.replication.acl.policy: acl replication - performing updates
2023-02-15T13:43:32.083+0300 [WARN]  agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to update local ACL policies: Failed to apply policy upserts: Changing the Rules for the builtin global-management policy is not permitted"
@huikang
Copy link
Collaborator

huikang commented Feb 16, 2023

Given the large gap between 1.9.x and 1.14.x, I am wondering if upgrading to an earlier version that helps narrow down reasoning about the root cause, like 1.9.x to 1.10.x ....

https://developer.hashicorp.com/consul/docs/upgrading/instructions

@akotlyar
Copy link

Got same error, replication not working between 1.12.8 and 1.13.7
If the primary and secondary use version 1.13.7 everything works, but if one of them 1.12.8 policy replycation get error

@akotlyar
Copy link

Policy replication does not work with any version 1.13.x if one of the DC is below version 1.13.x

@akotlyar
Copy link

Is there any solution to this problem? We use 8 datacenters and have always performed the update according to the following instructions:
"Upgrade the Consul agents in all DCs to version 1.x.x by following our General Upgrade Process. This should be done one DC at a time, leaving the primary DC for last"

But this scheme does not work when upgrading from 1.12.8 to 1.3.7. The field of updating of the first DC at it replication ACL flies. Raised a test environment with 3 data centers and revealed the following - 1.12.8 (9) in principle, ACL synchronization with versions 1.3.x does not work
If you update the primary DC to version 1.3.x, then replication crashes on all other DCs of version 1.12.8, and if only on one of the secondary ones, then it crashes on it.

The only option I see is to update the consul in all DCs at the same time, but this will affect more critical services, which I would not like. Is this update option intended or is it a bug?
The Specific Version Details does not contain information about the change in the replication system in versions 1.3.x

@garry-t
Copy link

garry-t commented Oct 10, 2024

Got same case. upgrade from 1.11.4 -> 12.x.x -> 1.13.x -> 1.14.x -> 1.15.x -> 1.16.x->1.17.x->1.19.x, in the middle of upgraded version replication stopped working, all keys have been deleted in a secondary DC

@satroutr
Copy link

satroutr commented Feb 3, 2025

Same issue is being introduced between 1.16.x and 1.7.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants