Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

machine-controller-manager is not waiting for detach of CSI PVs #508

Closed
ialidzhikov opened this issue Sep 10, 2020 · 8 comments · Fixed by #509
Closed

machine-controller-manager is not waiting for detach of CSI PVs #508

ialidzhikov opened this issue Sep 10, 2020 · 8 comments · Fixed by #509
Assignees
Labels
area/storage Storage related kind/bug Bug priority/2 Priority (lower number equals higher priority)
Milestone

Comments

@ialidzhikov
Copy link
Member

ialidzhikov commented Sep 10, 2020

What happened:
Currently machine-controller-manager Driver interface has a func GetVolNames which for given slice of PVs, returns the provider volume id. For example this func looks like this for the aws driver:

https://github.com/gardener/machine-controller-manager/blob/2177dc7df83197c70624f243463d4f2c58c14017/pkg/driver/driver_aws.go#L492_L505

Note that this funcs relies on .spec.awsElasticBlockStore.volumeID to be present.

This func only works for in-tree PVs where the the volume spec looks:

apiVersion: v1
kind: PersistentVolume
metadata:
  # omitted
spec:
  accessModes:
  - ReadWriteOnce
  awsElasticBlockStore:
    fsType: ext4
    volumeID: aws://eu-west-1a/vol-02c0a3f24fa045d24
  capacity:
    storage: 10Gi
  # omitted

A CSI PV looks like:

apiVersion: v1
kind: PersistentVolume
metadata:
  # omitted
spec:
  capacity:
    storage: 4Gi
  csi:
    driver: ebs.csi.aws.com
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: 1599743221444-1847-ebs.csi.aws.com
    volumeHandle: vol-054657884a6973290
  # omitted

Currently GetVolNames won't return any CSI PV volumeID and machine-controller-manager won't wait any CSI PV to be detached during machine deletion.

What you expected to happen:

machine-controller-manager to wait for detach of CSI PVs.

How to reproduce it (as minimally and precisely as possible):

  1. Create a v1.18 or v1.19 Shoot.

  2. Create a PVC

  3. Ensure that drain of Machine is not waiting for detach

Anything else we need to know:

Environment:

@ialidzhikov ialidzhikov added the kind/bug Bug label Sep 10, 2020
@ialidzhikov
Copy link
Member Author

/area storage
/priority critical

@gardener-robot gardener-robot added area/storage Storage related priority/critical Needs to be resolved soon, because it impacts users negatively labels Sep 10, 2020
@ialidzhikov
Copy link
Member Author

/assign @ialidzhikov

@hardikdr
Copy link
Member

Great catch, thanks for pointing out @ialidzhikov.

This should essentially mean, with 1.18+ k8s versions, the serialized eviction will not be effective anymore. cc @ggaurav10 @amshuman-kr

/priority blocker
/milestone 0.35.0

@gardener-robot gardener-robot added priority/blocker Needs to be resolved now, because it breaks the service and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Sep 11, 2020
@hardikdr
Copy link
Member

/milestone v0.35.0

@gardener-robot gardener-robot added this to the v0.35.0 milestone Sep 11, 2020
@gardener gardener deleted a comment from gardener-robot Sep 11, 2020
@hardikdr hardikdr added the priority/critical Needs to be resolved soon, because it impacts users negatively label Sep 11, 2020
@gardener-robot gardener-robot removed the priority/blocker Needs to be resolved now, because it breaks the service label Sep 11, 2020
@prashanth26
Copy link
Contributor

Something to be kept in mind - gardener/gardener-extension-provider-gcp#172 (comment).

@hardikdr
Copy link
Member

hardikdr commented Sep 11, 2020

Btw the interface method GetVolNames is optional for providers and implemented only to enable the serialized eviction.

  • GetVolNames definitely has to be adapted.

The root cause could be a little more than that, the dangling volumeAttachments are probably because machines are force-deleted during hibernation[no-drain], but this needs investigation as mentioned by @prashanth26 above.

@hardikdr
Copy link
Member

Also, I think the machine could be force deleted due to multiple reasons, and the volumeAttachments ideally should be garbage collected by the responsible csi-controller, I don't have enough understanding there though.

@hardikdr
Copy link
Member

Likely related: kubernetes-csi/external-attacher#215

@gardener-robot gardener-robot added priority/2 Priority (lower number equals higher priority) and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage Storage related kind/bug Bug priority/2 Priority (lower number equals higher priority)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants