Skip to content

Commit 968dc5e

Browse files
committed
Move volume health monitoring to beta
1 parent 6a4aadc commit 968dc5e

File tree

3 files changed

+13
-14
lines changed

3 files changed

+13
-14
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
kep-number: 1432
2-
alpha:
2+
beta:
33
approver: "@deads2k"

keps/sig-storage/1432-volume-health-monitor/README.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ Two main parts are involved here in the architecture.
139139
- Note that currently we do not have CSI support for local storage. When the support is available, we will implement relavant CSI monitoring interfaces as well.
140140
- Expose Volume Health information as Kubelet VolumeStats Metrics.
141141

142-
The volume health monitoring by Kubelet will be controlled by a new feature gate called `VolumeHealth`.
142+
The volume health monitoring by Kubelet will be controlled by a new feature gate called `CSIVolumeHealth`.
143143

144144
## Implementation
145145

@@ -515,7 +515,7 @@ In addition to volume stats collected already, Kubelet will also check the mount
515515
If abnormal volume condition is detected from NodeGetVolumeStats, Kubelet will retrieve all the pods used by the particular volume and report events on the pod objects. If multiple pods are using the same volume, events will be reported on all pods. This can be done by adding logic in csi_client after the NodeGetVolumeStats call to send events to pods if volume condition is abnormal.
516516
https://github.com/kubernetes/kubernetes/blob/v1.21.0-alpha.2/pkg/volume/csi/csi_client.go#L608
517517

518-
This new volume health monitoring by Kubelet will be gated by the `VolumeHealth` feature gate. If enabled, Kubelet will monitor volume health when calling NodeGetVolumeStats CSI function and report events on pods when abnormal volume condition is detected. If not enabled, Kubelet works the same as before and will not check volume health when calling NodeGetVolumeStats CSI function.
518+
This new volume health monitoring by Kubelet will be gated by the `CSIVolumeHealth` feature gate. If enabled, Kubelet will monitor volume health when calling NodeGetVolumeStats CSI function and report events on pods when abnormal volume condition is detected. If not enabled, Kubelet works the same as before and will not check volume health when calling NodeGetVolumeStats CSI function.
519519

520520
### Alternatives
521521

@@ -732,7 +732,7 @@ _This section must be completed when targeting alpha to a release._
732732
* **How can this feature be enabled / disabled in a live cluster?**
733733
- [x] Other
734734
- Describe the mechanism:
735-
This feature has a feature gate called `VolumeHealth` for Kubelet.
735+
This feature has a feature gate called `CSIVolumeHealth` for Kubelet.
736736
It is enabled when the feature gate in turned on.
737737
The health monitoring feature in external controller does not have a
738738
feature gate because it is out of tree.
@@ -766,7 +766,7 @@ _This section must be completed when targeting alpha to a release._
766766
detected again and the new metric will be emitted by Kubelet again.
767767

768768
* **Are there any tests for feature enablement/disablement?**
769-
There will be unit tests for the feature `VolumeHealth` enablement/disablement.
769+
There will be unit tests for the feature `CSIVolumeHealth` enablement/disablement.
770770
Since there is no feature gate for this feature on the controller side and the only way to
771771
enable or disable this feature is to install or unistall the sidecar, we cannot write
772772
tests for feature enablement/disablement on the controller side.
@@ -785,7 +785,7 @@ _This section must be completed when targeting beta graduation to a release._
785785
the health monitoring controller cannot be deployed, no events on volume
786786
condition will be reported on PVCs.
787787

788-
If enabling the `VolumeHealth` feature fails, no event on volume condition will be
788+
If enabling the `CSIVolumeHealth` feature fails, no event on volume condition will be
789789
reported on the pod and the new `volume_stats_health_abnormal` metric won't be emitted.
790790

791791
* **What specific metrics should inform a rollback?**
@@ -802,7 +802,7 @@ _This section must be completed when targeting beta graduation to a release._
802802
Describe manual testing that was done and the outcomes.
803803
Longer term, we may want to require automated upgrade/rollback tests, but we
804804
are missing a bunch of machinery and tooling and can't do that now.
805-
Manual testing will be done to upgrade from 1.22 to 1.23 and downgrade from 1.23 back to 1.22.
805+
Manual testing will be done to upgrade from 1.24 to 1.25 and downgrade from 1.25 back to 1.24.
806806

807807
* **Is the rollout accompanied by any deprecations and/or removals of features, APIs,
808808
fields of API types, flags, etc.?**
@@ -826,7 +826,7 @@ _This section must be completed when targeting beta graduation to a release._
826826
The `csi_sidecar_operations_seconds` metric should be sliced by process after
827827
they are aggregated to show metrics for different sidecars.
828828

829-
In Kubelet, an operator can check whether the feature gate `VolumeHealth`
829+
In Kubelet, an operator can check whether the feature gate `CSIVolumeHealth`
830830
is enabled and whether the new metric `volume_stats_health_abnormal` is emitted.
831831

832832
* **What are the SLIs (Service Level Indicators) an operator can use to determine
@@ -883,7 +883,7 @@ _This section must be completed when targeting beta graduation to a release._
883883
- Usage description:
884884
- Impact of its outage on the feature: Installation of csi-external-health-monitor-controller sidecar is required for the feature to work from the controller side. If csi-external-health-monitor-controller is not installed, abnormal volume conditions will not be reported as events on PVCs.
885885
Note that CSI driver needs to be updated to implement volume health RPCs in controller/node plugins. The minimum kubernetes version should be 1.13: https://kubernetes-csi.github.io/docs/introduction.html#kubernetes-releases. K8s v1.13 is the minimum supported version for CSI driver to work, however, different CSI drivers have different requirements on supported k8s versions so users are supposed to check documentation of the CSI drivers. If the CSI node plugin on one node has been upgraded to support volume health while it is not upgraded on 3 other nodes, then we will only expect to see volume health events on pods running on that one upgraded node.
886-
In addition, since Kubelet is doing volume health monitoring from the node side, the supported Kubernetes version will have to be the version that supports `VolumeHealth` feature. So the minimum Kubernetes version will be 1.21.
886+
In addition, since Kubelet is doing volume health monitoring from the node side, the supported Kubernetes version will have to be the version that supports `CSIVolumeHealth` feature when we moved volume health events report to Kubelet. So the minimum Kubernetes version will be 1.21 for the events to be reported on the pods. In Kubernetes 1.24, we also added volume_stats_health_abnormal to metrics in Kubelet. So 1.24 is the minimum required version for metrics support.
887887
- Impact of its degraded performance or high-error rates on the feature: If abnormal volume conditions are reported with degraded performance or high-error rates, that would affect how soon or how accurately users could manually react to these conditions.
888888

889889

@@ -899,15 +899,14 @@ previous answers based on experience in the field._
899899

900900
* **Will enabling / using this feature result in any new API calls?**
901901
Describe them, providing:
902-
- API call type (e.g. PATCH pods): Only events will be reported to PVCs or Pods if this feature is enabled.
902+
- API call type (e.g. PATCH pods): Events will be reported to PVCs and Pods and metrics will be in Kubelet if this feature is enabled.
903903
- estimated throughput
904904
- originating component(s) (e.g. Kubelet, Feature-X-controller)
905905
focusing mostly on:
906906
- components listing and/or watching resources they didn't before
907907
csi-external-health-monitor-controller sidecar.
908908
There is a monitor interval for the controller to control how often to check the volume health.
909-
It is configurable with 1 minute as default. Will consider changing it to 5 minutes by default
910-
to avoid overloading the K8s API server.
909+
It is configurable with 5 minute as default.
911910
When scaled out across many nodes, low frequency checks can still produce high volumes of
912911
events. To control this, we should use options on the eventrecorder to control QPS per key.
913912
This way we can collapse keys and have a slow update cadence per key.

keps/sig-storage/1432-volume-health-monitor/kep.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ approvers:
2020
see-also:
2121
replaces:
2222

23-
latest-milestone: "v1.24"
24-
stage: "alpha"
23+
latest-milestone: "v1.25"
24+
stage: "beta"
2525
milestone:
2626
alpha: "v1.21"
2727
beta: "v1.25"

0 commit comments

Comments
 (0)