-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: OpenShift Virtualization Higher Density #1679
base: master
Are you sure you want to change the base?
Conversation
Hi @iholder101. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
||
#### Timeline | ||
|
||
* GA higher workload density in OpenShift Virtualization in 2024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this timeline still accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe it is.
Please keep me honest @stu-gott.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haircommander Hi, I've checked the Jira planning, we are on track, so yes this is indeed accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haircommander The timeline bullet GA higher workload density in OpenShift Virtualization in 2024
relates to the phase 1 only. Maybe we should add it in brackets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @haircommander!
@enp0s3 and I reworked the PR. Can you please have another look?
/ok-to-test |
A feature describing CNV's path to higher density based on - phase 1: wasp - phase 2: kube swap Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal for a much-requested feature. I think it is high time to have it merged.
## Summary | ||
|
||
Fit more workloads onto a given node - achieve a higher workload | ||
density - by overcommitting it's memory resources. Due to timeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
density - by overcommitting it's memory resources. Due to timeline | |
density - by overcommitting its memory resources. Due to timeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
## Motivation | ||
|
||
Today, OpenShift Virtualization is reserving memory (`requests.memory`) | ||
according to the needs of the virtual machine and it's infrastructure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the needs of the virtual machine and it's infrastructure | |
according to the needs of the virtual machine and its infrastructure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
given node leads to the observation that _on average_ there is no memory | ||
ressure and often a rather low memory utilization - despite the fact that | ||
much memory has been reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given node leads to the observation that _on average_ there is no memory | |
ressure and often a rather low memory utilization - despite the fact that | |
much memory has been reserved. | |
given node leads to the observation that _on average_ much of the reserved memory is not utilized. |
I think it is not precises to say there is no pressure. There is. But we can reduce it, because the memory causing the pressure is not used and can be swapped out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've re-phrased the section
|
||
### Non-Goals | ||
|
||
* Complete life-cycling of the WASP Agent. We are not intending to write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time WASP agent is mentioned. Please add a URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Complete life-cycling of the WASP Agent. We are not intending to write | ||
an Operator for memory over commit for two reasons: | ||
* [Kubernetes SWAP] is close, writing a fully fledged operator seems | ||
to be no good use of resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be no good use of resources | |
to be no good use of developer resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
## Test Plan | ||
|
||
Add e2e tests for the WASP agent repository for regression testing against | ||
OpenShift. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should include here a bit more details about how we are (already) testing it. Most importantly: configure 200% over-commitment, fill up the cluster with dormant VMs and verify that the cluster is responsive and survives upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Consider the following in developing an upgrade/downgrade strategy for this | ||
enhancement: | ||
- What changes (in invocations, configurations, API use, etc.) is an existing | ||
cluster required to make on upgrade in order to keep previous behavior? | ||
- What changes (in invocations, configurations, API use, etc.) is an existing | ||
cluster required to make on upgrade in order to make use of the enhancement? | ||
|
||
Upgrade expectations: | ||
- Each component should remain available for user requests and | ||
workloads during upgrades. Ensure the components leverage best practices in handling [voluntary | ||
disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to | ||
this should be identified and discussed here. | ||
- Micro version upgrades - users should be able to skip forward versions within a | ||
minor release stream without being required to pass through intermediate | ||
versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1` | ||
as an intermediate step. | ||
- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade | ||
steps. So, for example, it is acceptable to require a user running 4.3 to | ||
upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step. | ||
- While an upgrade is in progress, new component versions should | ||
continue to operate correctly in concert with older component | ||
versions (aka "version skew"). For example, if a node is down, and | ||
an operator is rolling out a daemonset, the old and new daemonset | ||
pods must continue to work correctly even while the cluster remains | ||
in this partially upgraded state for some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like generic content. should we not replace it with something specific, or drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
How will the component handle version skew with other components? | ||
What are the guarantees? Make sure this is in the test plan. | ||
|
||
Consider the following in developing a version skew strategy for this | ||
enhancement: | ||
- During an upgrade, we will always have skew among components, how will this impact your work? | ||
- Does this enhancement involve coordinating behavior in the control plane and | ||
in the kubelet? How does an n-2 kubelet without this feature available behave | ||
when this feature is used? | ||
- Will any other components on the node change? For example, changes to CSI, CRI | ||
or CNI may require updating that component before the kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WASP from CNV-X.Y must work with OCP-X.Y as well as OCP-X.(Y+1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dankenigsberg Added.
|
||
## Operational Aspects of API Extensions | ||
|
||
None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is a place to discuss the fact that all workers have to have to same memory size and the same disk topology, that deploying and upgrading WASP is a manual step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Examples: | ||
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence | ||
will not block the creation or updates on objects when it fails. When the | ||
webhook comes back online, there is a controller reconciling all objects, applying | ||
labels that were not applied during admission webhook downtime. | ||
- Namespaces deletion will not delete all objects in etcd, leading to zombie | ||
objects when another namespace with the same name is created. | ||
|
||
TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us replace this generic content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
26e4e09
to
96db93a
Compare
|--------------------------------------------------------------------|--------------|-----------------|----------------------| | ||
| Phase 1 - Out-Of-Tree SWAP with WASP | 2024 | Tech-Preview | Beta | | ||
| Phase 2 - Transition to Kubernetes SWAP. Limited to CNV-only users | mid-end 2025 | GA | GA | | ||
| Phase 3 - Kubernetes SWAP for all Openshift users | TBD | GA | GA | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Phase 3 - Kubernetes SWAP for all Openshift users | TBD | GA | GA | | |
| Phase 3 - Kubernetes SWAP for all OpenShift users | TBD | GA | GA | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* **General Availability** | ||
* Limited to burstable QoS class pods. | ||
* Uses `LimitedSwap`. | ||
* Limited to non-high-priority pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Starts containers with
memory.swap.max=max
as in TechPreivew - Sets
memory.swap.max
to according to the container request, For more info, refer to the upstream documentation on how to calculate
limited swap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
For more info, refer to the upstream documentation on how to calculate | ||
[limited swap](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would benefit from seeing a diagram (or a link to a diagram) showing the execution order of the CRI, OCI hook and Wasp update.
I would also state clearly that we have a race condition here: container binary may allocate memory before WASP sets a limited value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dankenigsberg Added both.
421cb05
to
5ee0ec2
Compare
#### Hypershift / Hosted Control Planes | ||
|
||
The `MachineConfig` based swap provisioning will not work, as HCP does | ||
not provide the `MachineConfig` APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's a way for HCP to deploy rhcos with needed changes, but you can leave this as it is for now.
#### Single-node Deployments or MicroShift | ||
|
||
Single-node and MicroShift deployments are out of scope of this | ||
proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a word about why: we don't support swapping on control-plane node. I'd also be explicit about not supporting Compact cluster for this reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
59ce661
to
ea40818
Compare
@haircommander @mrunalp Hi, can you please have another review? |
### Non-Goals | ||
|
||
* Complete life-cycling of the [WASP](https://github.com/OpenShift-virtualization/wasp-agent) Agent. We are not intending to write | ||
an Operator for memory over commit for two reasons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
* **Phase 1** - OpenShift Virtualization will provide an out-of-tree | ||
solution (WASP) to enable higher workload density and swap-based eviction mechanisms. | ||
* **Phase 2** - WASP will include swap-based eviction mechanisms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you describe these have been done (we're in phase 2 right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
In this phase, WASP deprecation will start. | ||
* **Phase 4** - OpenShift will GA SWAP for every user, even if OpenShift Virtualization | ||
is not installed on the cluster. In this phase WASP will be removed, machine-level configuration | ||
will be managed by Machine Config Operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap Operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
### API Extensions | ||
|
||
#### Phase 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this phase glosses over some other MC pieces that are needed (to actually provision swap). that's part of the motivation of doing SwapOperator: to couple the actual swap provisioning with pointing the kubelet to use swap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@haircommander Good point! Let me then suggest the following - since in phase 3 kube swap will be GAed, but wasp-agent will still be there (deprecated but not removed), the user actually doesn't need to move to kube swap right away. This transition can happen in phase 4 by the swap operator (and its even better, less error prone).
Therefore transition to phase 3 can be no-op from user perspective.
More info about kubelet swap API can be found [here](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#api-changes) | ||
|
||
#### Phase 4 | ||
In previous phases the machine-level confgiuration was deployed manually using machine configs. The transition to phase 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spellcheck: configuration*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
#### Phase 4 | ||
In previous phases the machine-level confgiuration was deployed manually using machine configs. The transition to phase 4 | ||
will require re-deployment of the configuration by the operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the operator" this is not very clear. I think we should spell out what will be taken over by the swap operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are open questions, call them out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added content to the open question section regarding the swap operator
### Topology Considerations | ||
|
||
#### Hosted Control Planes: Hosting cluster with OCP Virt | ||
No special consideration required since it's identical to the regular cluster case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're not able to use MCs in HCP AFAIU. it's also not clear once we get to SwapOperator world how exactly that'd work. Probably need to extend the HostedCluster https://github.com/openshift/hypershift/blob/d03bec5285b9dcca819686037a75027e514c6d64/api/hypershift/v1beta1/hostedcluster_types.go#L1617
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see nevermind, I got the cases wrong
|
||
#### Hosted Control Planes: Hosted cluster with OCP Virt | ||
Since nested virt is not supported, the only topology that is supported is when the data plane resides on a bare metal. | ||
In that case the abovementioned `KubeletCofnig` and `MachineConfig` should be injected into the `NodePool` CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubeletConfig*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
Dealing with memory pressure on a node is differentiating the TP fom GA. | ||
|
||
** **Technology Preview** - `memory.high` is set on the `kubepods.slice` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has this been implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was. @enp0s3 @iholder101 is it still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand @haircommander This has been implemented in OCP Virt 4.16 inside the MachineConfig, but since 4.17 (GA) we've removed it in favor to swap-based evictions.
|
||
| | TP | GA | | ||
|------------------------------|---------------------|---------------------| | ||
| SWAP Provisioning | MachineConfig | MachineConfig | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GA: operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CNV higher density with wasp is already GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth speaking about what can be contributed to openshift in order to take openshift close to GA, and let CNV benefit from it at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One clear advantage of having the swap operator contributed directly to Openshift rather than CNV->Openshift is less API transitions for the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiand Perhaps this table is redundant since we have the phases table.
|
||
| Risk | Mitigation | | ||
|--------------------------------------------|----------------------------------------------------------------------------------------------------------| | ||
| Miss details and introduce instability | * Adjust overcommit ratio <br/> * Tweak eviction thresholds <br/> * Use de-scheduler to balance the load | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is anything needed in the descheduler to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added link.
Signed-off-by: Igor Bezukh <[email protected]>
@iholder101: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
A feature describing OpenShift Virtualization's path to higher density based on:
This is a replacement for #1630.