Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium installation with BGP Control Plane enabled on OpenShift fails #31499

Closed
2 of 3 tasks
akaliwod opened this issue Mar 19, 2024 · 8 comments
Closed
2 of 3 tasks

Cilium installation with BGP Control Plane enabled on OpenShift fails #31499

akaliwod opened this issue Mar 19, 2024 · 8 comments
Labels
area/bgp info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Comments

@akaliwod
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

I am trying to install Cilium on OpenShift cluster (running in ESXi). As soon as the following lines are added to manifest

  bgpControlPlane:
    enabled: true

Cluster bootstrap fails.

I also tried an approach to install OCP with Cilium CNI without bgp control plane enabled That works fine and cluster is up with Cilium CNI. However, when ciliumconfig CR is changed with bgpControlPlaned enabled, then OLM claims that cluster upgrade has failed with the crash

2024-03-19T13:41:10Z    ERROR   helm.controller Release failed  {"namespace": "cilium", "name": "cilium", "apiVersion": "cilium.io/v1alpha1", "kind": "CiliumConfig", "release": "cilium", "error": "upgrade failed; rollback required"}
github.com/operator-framework/operator-sdk/internal/helm/controller.HelmOperatorReconciler.Reconcile
        /workspace/internal/helm/controller/reconcile.go:328
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226
2024-03-19T13:41:10Z    ERROR   Reconciler error        {"controller": "ciliumconfig-controller", "object": {"name":"cilium","namespace":"cilium"}, "namespace": "cilium", "name": "cilium", "reconcileID": "f6e30445-896f-4e05-891a-36e163a49994", "error": "upgrade failed; rollback required"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226

Cilium Version

1.15.1

Kernel Version

5.14.0-284.25.1.el9_2.x86_64

Kubernetes Version

v1.26.1

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@akaliwod akaliwod added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels Mar 19, 2024
@aditighag
Copy link
Member

I also tried an approach to install OCP with Cilium CNI without bgp control plane enabled That works fine and cluster is up with Cilium CNI. However, when ciliumconfig CR is changed with bgpControlPlaned enabled, then OLM claims that cluster upgrade has failed with the crash

Did you check the cilium agent logs to see what the issue might be? Please attach a sysdump.

@aditighag aditighag added the need-more-info More information is required to further debug or fix the issue. label Mar 19, 2024
@akaliwod
Copy link
Author

I think it might be related to missing clusterrole and role binding

I solved it by adding the following

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cilium-cilium-olm-secrets
rules:
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cilium-cilium-olm-secrets
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cilium-cilium-olm-secrets
subjects:
- kind: ServiceAccount
  name: cilium-olm
  namespace: cilium

then after restart of the cilium agents, config is reconciled, new images pulled and bgp is ready to be used

[root@cilium-installer ~]# oc create -f bgp-policy.yaml
ciliumbgppeeringpolicy.cilium.io/xrd created
[root@cilium-installer ~]# cilium bgp peers --namespace cilium
Node                          Local AS   Peer AS   Peer Address   Session State   Uptime   Family         Received   Advertised
cilium-2vx97-master-0         64513      64512     192.168.66.1   active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0
cilium-2vx97-master-1         64513      64512     192.168.66.1   active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0
cilium-2vx97-master-2         64513      64512     192.168.66.1   active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0
cilium-2vx97-worker-0-9cnpm   64513      64512     192.168.66.1   active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0
cilium-2vx97-worker-0-gf8jj   64513      64512     192.168.66.1   active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0
cilium-2vx97-worker-0-p4dff   64513      64512     192.168.66.1   active          0s       ipv4/unicast   0          0
                                                                                           ipv6/unicast   0          0

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels Mar 20, 2024
@squeed
Copy link
Contributor

squeed commented Mar 26, 2024

That's odd, we should be configuring those values correctly.

@squeed squeed added sig/agent Cilium agent related. and removed needs/triage This issue requires triaging to establish severity and next steps. labels Mar 26, 2024
Copy link

This issue has been automatically marked as stale because it has not
had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label May 26, 2024
Copy link

github-actions bot commented Jun 9, 2024

This issue has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 9, 2024
@simu
Copy link
Contributor

simu commented Jun 20, 2024

I just ran into the same issue when trying to enable the BGP control plane on an existing OCP cluster. Applying the workaround as mentioned in #31499 (comment) worked for me as well.

I had a quick look at the role that's deployed for the OLM operator service account (that role grants the OLM SA verbs: ['*'] which is a superset of the permissions of the denied role). However, it seems like something in Kubernetes doesn't realize that creating RBAC with specific verbs for a resource isn't a privilege escalation for a principal which holds permissions for verbs: ['*'] for that resource.

Edit: I didn't read the comments carefully enough earlier. The actual problem is that the current OLM RBAC only works if Cilium is installed in namespace kube-system (at least for Cilium 1.14), since the Helm chart used by the OLM install for 1.14 doesn't support customizing the bgp-secrets-namespace flag which defaults to kube-system. If the OLM operator runs in a different namespace (e.g. cilium for us), it doesn't have sufficient permissions to create a Role or RoleBinding to access secrets in namespace kube-system out of the box.

I haven't checked the Helm chart for 1.15.1 yet, but from the docs it seems like with 1.15 this issue can be avoided by setting Helm value bgpControlPlane.secretNamespace.name to the name of the namespace in which Cilium is installed.

@saintdle
Copy link
Contributor

Issue has been reported here
isovalent/olm-for-cilium#91

@saintdle
Copy link
Contributor

saintdle commented Jun 24, 2024

Using these values, should mean that you are not getting the warning message, even if you don't intend to use BGP secrets.

    bgpControlPlane:
      enabled: true
      secretsNamespace:
        name: cilium
        create: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bgp info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. sig/agent Cilium agent related. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Projects
None yet
Development

No branches or pull requests

5 participants