You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-storage/1432-volume-health-monitor/README.md
+10-11Lines changed: 10 additions & 11 deletions
Original file line number
Diff line number
Diff line change
@@ -139,7 +139,7 @@ Two main parts are involved here in the architecture.
139
139
- Note that currently we do not have CSI support for local storage. When the support is available, we will implement relavant CSI monitoring interfaces as well.
140
140
- Expose Volume Health information as Kubelet VolumeStats Metrics.
141
141
142
-
The volume health monitoring by Kubelet will be controlled by a new feature gate called `VolumeHealth`.
142
+
The volume health monitoring by Kubelet will be controlled by a new feature gate called `CSIVolumeHealth`.
143
143
144
144
## Implementation
145
145
@@ -515,7 +515,7 @@ In addition to volume stats collected already, Kubelet will also check the mount
515
515
If abnormal volume condition is detected from NodeGetVolumeStats, Kubelet will retrieve all the pods used by the particular volume and report events on the pod objects. If multiple pods are using the same volume, events will be reported on all pods. This can be done by adding logic in csi_client after the NodeGetVolumeStats call to send events to pods if volume condition is abnormal.
This new volume health monitoring by Kubelet will be gated by the `VolumeHealth` feature gate. If enabled, Kubelet will monitor volume health when calling NodeGetVolumeStats CSI function and report events on pods when abnormal volume condition is detected. If not enabled, Kubelet works the same as before and will not check volume health when calling NodeGetVolumeStats CSI function.
518
+
This new volume health monitoring by Kubelet will be gated by the `CSIVolumeHealth` feature gate. If enabled, Kubelet will monitor volume health when calling NodeGetVolumeStats CSI function and report events on pods when abnormal volume condition is detected. If not enabled, Kubelet works the same as before and will not check volume health when calling NodeGetVolumeStats CSI function.
519
519
520
520
### Alternatives
521
521
@@ -732,7 +732,7 @@ _This section must be completed when targeting alpha to a release._
732
732
***How can this feature be enabled / disabled in a live cluster?**
733
733
-[x] Other
734
734
- Describe the mechanism:
735
-
This feature has a feature gate called `VolumeHealth` for Kubelet.
735
+
This feature has a feature gate called `CSIVolumeHealth` for Kubelet.
736
736
It is enabled when the feature gate in turned on.
737
737
The health monitoring feature in external controller does not have a
738
738
feature gate because it is out of tree.
@@ -766,7 +766,7 @@ _This section must be completed when targeting alpha to a release._
766
766
detected again and the new metric will be emitted by Kubelet again.
767
767
768
768
***Are there any tests for feature enablement/disablement?**
769
-
There will be unit tests for the feature `VolumeHealth` enablement/disablement.
769
+
There will be unit tests for the feature `CSIVolumeHealth` enablement/disablement.
770
770
Since there is no feature gate for this feature on the controller side and the only way to
771
771
enable or disable this feature is to install or unistall the sidecar, we cannot write
772
772
tests for feature enablement/disablement on the controller side.
@@ -785,7 +785,7 @@ _This section must be completed when targeting beta graduation to a release._
785
785
the health monitoring controller cannot be deployed, no events on volume
786
786
condition will be reported on PVCs.
787
787
788
-
If enabling the `VolumeHealth` feature fails, no event on volume condition will be
788
+
If enabling the `CSIVolumeHealth` feature fails, no event on volume condition will be
789
789
reported on the pod and the new `volume_stats_health_abnormal` metric won't be emitted.
790
790
791
791
***What specific metrics should inform a rollback?**
@@ -802,7 +802,7 @@ _This section must be completed when targeting beta graduation to a release._
802
802
Describe manual testing that was done and the outcomes.
803
803
Longer term, we may want to require automated upgrade/rollback tests, but we
804
804
are missing a bunch of machinery and tooling and can't do that now.
805
-
Manual testing will be done to upgrade from 1.22 to 1.23 and downgrade from 1.23 back to 1.22.
805
+
Manual testing will be done to upgrade from 1.24 to 1.25 and downgrade from 1.25 back to 1.24.
806
806
807
807
***Is the rollout accompanied by any deprecations and/or removals of features, APIs,
808
808
fields of API types, flags, etc.?**
@@ -826,7 +826,7 @@ _This section must be completed when targeting beta graduation to a release._
826
826
The `csi_sidecar_operations_seconds` metric should be sliced by process after
827
827
they are aggregated to show metrics for different sidecars.
828
828
829
-
In Kubelet, an operator can check whether the feature gate `VolumeHealth`
829
+
In Kubelet, an operator can check whether the feature gate `CSIVolumeHealth`
830
830
is enabled and whether the new metric `volume_stats_health_abnormal` is emitted.
831
831
832
832
***What are the SLIs (Service Level Indicators) an operator can use to determine
@@ -883,7 +883,7 @@ _This section must be completed when targeting beta graduation to a release._
883
883
- Usage description:
884
884
- Impact of its outage on the feature: Installation of csi-external-health-monitor-controller sidecar is required for the feature to work from the controller side. If csi-external-health-monitor-controller is not installed, abnormal volume conditions will not be reported as events on PVCs.
885
885
Note that CSI driver needs to be updated to implement volume health RPCs in controller/node plugins. The minimum kubernetes version should be 1.13: https://kubernetes-csi.github.io/docs/introduction.html#kubernetes-releases. K8s v1.13 is the minimum supported version for CSI driver to work, however, different CSI drivers have different requirements on supported k8s versions so users are supposed to check documentation of the CSI drivers. If the CSI node plugin on one node has been upgraded to support volume health while it is not upgraded on 3 other nodes, then we will only expect to see volume health events on pods running on that one upgraded node.
886
-
In addition, since Kubelet is doing volume health monitoring from the node side, the supported Kubernetes version will have to be the version that supports `VolumeHealth` feature. So the minimum Kubernetes version will be 1.21.
886
+
In addition, since Kubelet is doing volume health monitoring from the node side, the supported Kubernetes version will have to be the version that supports `CSIVolumeHealth` feature when we moved volume health events report to Kubelet. So the minimum Kubernetes version will be 1.21 for the events to be reported on the pods. In Kubernetes 1.24, we also added volume_stats_health_abnormal to metrics in Kubelet. So 1.24 is the minimum required version for metrics support.
887
887
- Impact of its degraded performance or high-error rates on the feature: If abnormal volume conditions are reported with degraded performance or high-error rates, that would affect how soon or how accurately users could manually react to these conditions.
888
888
889
889
@@ -899,15 +899,14 @@ previous answers based on experience in the field._
899
899
900
900
***Will enabling / using this feature result in any new API calls?**
901
901
Describe them, providing:
902
-
- API call type (e.g. PATCH pods): Only events will be reported to PVCs or Pods if this feature is enabled.
902
+
- API call type (e.g. PATCH pods): Events will be reported to PVCs and Pods and metrics will be in Kubelet if this feature is enabled.
0 commit comments