Skip to content

Commit 3b845e8

Browse files
committed
KEP-3545: graduate to GA
Signed-off-by: pprokop <[email protected]>
1 parent e85182b commit 3b845e8

File tree

2 files changed

+46
-15
lines changed

2 files changed

+46
-15
lines changed

keps/sig-node/3545-improved-multi-numa-alignment/README.md

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ tags, and then generate with `hack/update-toc.sh`.
4949
- [Scalability](#scalability)
5050
- [Troubleshooting](#troubleshooting)
5151
- [Implementation History](#implementation-history)
52+
- [Drawbacks](#drawbacks)
53+
- [Alternatives](#alternatives)
54+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
5255
<!-- /toc -->
5356

5457
## Release Signoff Checklist
@@ -74,10 +77,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
7477
- [x] (R) Design details are appropriately documented
7578
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
7679
- [ ] e2e Tests for all Beta API Operations (endpoints)
77-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
80+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
7881
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
7982
- [x] (R) Graduation criteria is in place
80-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
83+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
8184
- [x] (R) Production readiness review completed
8285
- [ ] (R) Production readiness review approved
8386
- [x] "Implementation History" section is up-to-date for milestone
@@ -124,7 +127,7 @@ This limitation surfaces in multi-socket, as well as single-socket multi NUMA sy
124127

125128

126129
### Proposed Change
127-
130+
128131
We propose to
129132
- add a new flag in Kubelet called `TopologyManagerPolicyOptions` in the kubelet config or command line argument called `topology-manager-policy-options` which allows the user to specify the Topology Manager policy option.
130133
- add a new topology manager option called `prefer-closest-numa-nodes`; if present, this option will enable further refinements of the existing `restricted` and `best-effort` policies, this option has no effect for `none` and `single-numa-node` policies.
@@ -141,7 +144,7 @@ When `prefer-closest-numa-nodes` policy is enabled, we need to retrieve informat
141144
Right now Topology manager discovers Node layout using [CAdvisor API](https://github.com/google/cadvisor/blob/master/info/v1/machine.go#L40).
142145

143146
We will need to extend the `MachineInfo` struct with a `Distances` field which will describe the distance between a given NUMA node and other NUMA nodes in the system.
144-
This is already implemented in `cadvisor` by this [patch](https://github.com/google/cadvisor/pull/3179) but it is not yet present in any of the released versions.
147+
This is already implemented in `cadvisor` by this [patch](https://github.com/google/cadvisor/pull/3179) but it is not yet present in any of the released versions.
145148
Until a new release of `cadvisor` includes this patch (and it gets vendored into the `kubelet`) we will need to replicate this logic in the `kubelet` code itself.
146149

147150
### Implementation strategy
@@ -155,7 +158,7 @@ Until a new release of `cadvisor` includes this patch (and it gets vendored into
155158
- When `TopologyManager` is being created it discovers distances between NUMA nodes and stores them inside `manager` struct. This is temporary until `distance` information lands in `cadvisor`.
156159
- Pass `TopologyManagerPolicyOptions` to best-effort and restricted policy. When this is specified best-hint is picked based on average distance between NUMA nodes. This would require modification to `compareHints` function to change how the best hint is calculated:
157160

158-
```go
161+
```go
159162

160163
// NUMADistance is a matrix representing distances between NUMA nodes
161164
type NUMADistance [][]uint64
@@ -178,16 +181,16 @@ func compareHints(bestNonPreferredAffinityCount int, current *TopologyHint, cand
178181
if current.Preferred && candidate.Preferred {
179182
if candidate.NUMANodeAffinity.IsNarrowerThan(current.NUMANodeAffinity) {
180183
return candidate
181-
}
184+
}
182185
if policyOpts.PreferClosestNuma && candidate.NUMANodeAffinity.IsEqual(current.NUMANodeAffinity) {
183186
candidateDistance := policyOpts.Distances.CalculateAvgDistanceFor(candidate)
184187
currentDistance := policyOpts.Distances.CalculateAvgDistanceFor(current)
185188
// candidate avg distance is lower
186189
if candidateDistance < currentDistance {
187190
return candidate
188-
}
191+
}
189192

190-
return current
193+
return current
191194
}
192195
}
193196

@@ -289,7 +292,7 @@ These cases will be added in the existing e2e tests:
289292

290293
In 1.26 we are releasing this feature to Alpha. We propose the following management of TopologyManager policy options graduation:
291294

292-
- `TopologyManagerPolicyOptions` for enabling/disabling the entire feature. As this is an alpha feature, this feature gate would be disabled by default.
295+
- `TopologyManagerPolicyOptions` for enabling/disabling the entire feature. As this is an alpha feature, this feature gate would be disabled by default.
293296
Explicitly enabling `TopologyManagerPolicyOptions` feature gate provides us the ability to supply `TopologyManagerPolicyOptions` or `topology-manager-policy-options` flag in Kubelet.
294297

295298
- `TopologyManagerPolicyAlphaOptions` is not enabled by default. Topology Manager alpha options (only one as of 1.26), are hidden by default
@@ -309,6 +312,10 @@ In 1.28 this feature is being promoted to Beta. We propose following changes to
309312
- `TopologyManagerPolicyBetaOptions` feature flag for enabling/disabling beta options will be enabled by default.
310313
- `prefer-closest-numa-nodes` will be moved to Beta options.
311314

315+
In 1.32 this feature is being promoted to Beta. We propose following changes to TopologyManager policy options default visibility:
316+
317+
- `prefer-closest-numa-nodes` will be moved to stable options.
318+
312319
The graduation Criteria of options is described below:
313320

314321
#### Graduation of Options to `Beta-quality` (non-hidden)
@@ -342,7 +349,7 @@ No changes needed
342349
- Components depending on the feature gate: kubelet
343350
- [x] Change the kubelet configuration to set a `TopologyManager` policy to `restricted` or `best-effort` and a `TopologyManagerPolicyOptions` to `prefer-closest-numa-nodes`
344351
- Will enabling / disabling the feature require downtime of the control
345-
plane?
352+
plane?
346353
No.
347354
- Will enabling / disabling the feature require downtime or reprovisioning
348355
of a node?
@@ -455,11 +462,11 @@ N/A.
455462

456463
There are 2 scenarios where Kubelet may fail to start due to using this feature:
457464

458-
- Bad policy option name or using policy option without enabling appropriate feature flag. we are emitting appropriate error message for this case,
465+
- Bad policy option name or using policy option without enabling appropriate feature flag. we are emitting appropriate error message for this case,
459466
Kubelet will fail to start and print error message what happened. To recover one just have to provide fix policy option name or disable/enable feature flags.
460467

461468
- Cadvisor is not exposing distances for NUMA domains. In this case Kubelet will fail with `error getting NUMA distances from cadvisor` message.
462-
Reading NUMA distances is only performed when `prefer-clostest-numa-nodes` option is specified.
469+
Reading NUMA distances is only performed when `prefer-clostest-numa-nodes` option is specified.
463470
To recover one has to either disable `TopologyManagerPolicyOptions` feature-flag or stop using `prefer-closest-numa-nodes` option.
464471

465472
###### What steps should be taken if SLOs are not being met to determine the problem?
@@ -470,3 +477,27 @@ N/A.
470477

471478
- 2021-09-26: KEP created
472479
- 2023-06-12: KEP updated for Beta release
480+
- 2024-09-30: KEP updated for Stable release
481+
482+
483+
## Drawbacks
484+
485+
<!--
486+
Why should this KEP _not_ be implemented?
487+
-->
488+
489+
## Alternatives
490+
491+
<!--
492+
What other approaches did you consider, and why did you rule them out? These do
493+
not need to be as detailed as the proposal, but should include enough
494+
information to express the idea and why it was not acceptable.
495+
-->
496+
497+
## Infrastructure Needed (Optional)
498+
499+
<!--
500+
Use this section if you need things from the project/SIG. Examples include a
501+
new subproject, repos requested, or GitHub details. Listing these here allows a
502+
SIG to get the process for these resources started right away.
503+
-->

keps/sig-node/3545-improved-multi-numa-alignment/kep.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,18 +14,18 @@ see-also: []
1414
replaces: []
1515

1616
# The target maturity stage in the current dev cycle for this KEP.
17-
stage: beta
17+
stage: stable
1818

1919
# The most recent milestone for which work toward delivery of this KEP has been
2020
# done. This can be the current (upcoming) milestone, if it is being actively
2121
# worked on.
22-
latest-milestone: "v1.28"
22+
latest-milestone: "v1.32"
2323

2424
# The milestone at which this feature was, or is targeted to be, at each stage.
2525
milestone:
2626
alpha: "v1.26"
2727
beta: "v1.28"
28-
stable: "v1.30"
28+
stable: "v1.32"
2929

3030
# The following PRR answers are required at alpha release
3131
# List the feature gate name and the components for which it must be enabled

0 commit comments

Comments
 (0)