Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip similar nodegroup recomputation #6926

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rrangith
Copy link
Contributor

@rrangith rrangith commented Jun 14, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Related to #6940

This recomputation used to only occur when the bestOption NodeGroup did not exist, but was changed in #5802. There are cases where an expander could modify the bestOption's similar nodegroups, such as custom logic in the gRPC expander.

In cases like this, we should have a CLI option to trust expander’s similar nodegroups and skip the recomputation.

If a user does not enable this option, then by default the behaviour will stay the same. This will only skip similar nodegroup recomputation for users who enable this option.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added `skip-similar-node-group-recomputation` flag. If enabled, skips similar NodeGroup recomputation for the best option returned by the expander during scaleup.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 14, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @rrangith. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024
@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 32774c7 to 450a753 Compare November 6, 2024 21:11
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2024
@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 450a753 to 35c514d Compare November 6, 2024 21:14
@rrangith
Copy link
Contributor Author

rrangith commented Nov 6, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2024
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there had been some discussions about trying to limit the new flags we are adding to the autoscaler, but i'm not sure if there was ever any guidance about that. regardless, i think this new flag should also be mentioned in the FAQ, see https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 35c514d to 2627751 Compare December 2, 2024 15:51
@rrangith rrangith requested a review from elmiko December 2, 2024 15:51
@elmiko
Copy link
Contributor

elmiko commented Dec 2, 2024

just a question as i'm reviewing, is there any relationship or interaction between this flag and the balance-similar-node-groups flag? (eg does the latter need to be enabled or anything special like that)

@rrangith
Copy link
Contributor Author

rrangith commented Dec 2, 2024

just a question as i'm reviewing, is there any relationship or interaction between this flag and the balance-similar-node-groups flag? (eg does the latter need to be enabled or anything special like that)

There is no strict relationship between the 2 flags for things to function properly, however if you enable skip-similar-node-group-recomputation but have balance-similar-nodegroups disabled, then it will just do nothing https://github.com/rrangith/autoscaler/blob/26277519bfe0728a532076f22e4485f1e7ba4adb/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go#L722-L730

I was debating mentioning that in the FAQ or cli arg description, but wasn't sure if I should add even more words to it

@elmiko
Copy link
Contributor

elmiko commented Dec 2, 2024

I was debating mentioning that in the FAQ or cli arg description, but wasn't sure if I should add even more words to it

i think it's worth mentioning, if only so people know they need to have balance-similar-nodegroups enabled to get this behavior.

@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 2627751 to 0a2c91f Compare December 2, 2024 22:58
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

we will need a review from a core maintainer for the flag change.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 3, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2025
@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 0a2c91f to 3f9efa1 Compare January 9, 2025 16:36
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rrangith
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 9, 2025
@jackfrancis
Copy link
Contributor

/test pull-cluster-autoscaler-e2e-azure-master

newNodes int,
nodeInfos map[string]*framework.NodeInfo,
schedulablePodGroups map[string][]estimator.PodEquivalenceGroup,
) ([]nodegroupset.ScaleUpInfo, errors.AutoscalerError) {
// Recompute similar node groups in case they need to be updated
similarNodeGroups := o.ComputeSimilarNodeGroups(nodeGroup, nodeInfos, schedulablePodGroups, now)
similarNodeGroups := bestOption.SimilarNodeGroups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like we're reliably populating bestOption.SimilarNodeGroups as part of the flow before invoking this balanceScaleUps method (L148 where we invoke o.ComputeExpansionOption).

That said, do we want to do some checking here, to make sure we have a good default value if "skip similar nodegroup recomputation" is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think there is a good default value since my thinking with this feature was to fully rely on what the expander deems as the similar nodegroups here. So for example if the bestOption is nodegroup A, and it has similar nodegroups B and C, but the expander removes both B and C as similar nodegroups, then CA should respect that and only scaleup nodegroup A

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thx

@jackfrancis
Copy link
Contributor

/lgtm

/assign @x13n @towca

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants