Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to prefix provisioningClassName to filter provisioning requests #7676

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

macsko
Copy link
Member

@macsko macsko commented Jan 8, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds the ability to set a provisioningClassName prefix, and then the CA will only process provisioning requests that have a matching prefix. This can allow to simply run multiple CA instances and route specific provisioning requests to them, while being backward compatible.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added provisioningClassPrefix option that allows to filter ProvisioningRequests' provisioningClassName to process by specific Cluster Autoscaler instance.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler labels Jan 8, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: macsko
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 8, 2025
@macsko
Copy link
Member Author

macsko commented Jan 8, 2025

/cc @aleksandra-malinowska

@macsko macsko force-pushed the allow_to_prefix_provisioning_class_name_to_filter_prs branch from 001b922 to 4fa718f Compare January 9, 2025 14:38
@macsko macsko requested a review from gabesaba January 9, 2025 14:40
Copy link
Contributor

@gabesaba gabesaba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after addressing last 2 nits

cluster-autoscaler/processors/provreq/injector_test.go Outdated Show resolved Hide resolved
cluster-autoscaler/processors/provreq/injector_test.go Outdated Show resolved Hide resolved
@macsko macsko force-pushed the allow_to_prefix_provisioning_class_name_to_filter_prs branch from 4fa718f to 43513b7 Compare January 10, 2025 10:19
@macsko macsko requested a review from gabesaba January 10, 2025 10:20
Copy link
Contributor

@gabesaba gabesaba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2025
@x13n
Copy link
Member

x13n commented Jan 23, 2025

Running multiple CA instances takes more than adding a prefix for provisioning requests. For instance, regular pending pods will trigger scale up in every CA and different provisioning requests may trigger scaling in the same node group. Is this a part of some broader feature you're trying to build?

@macsko
Copy link
Member Author

macsko commented Jan 27, 2025

Running multiple CA instances takes more than adding a prefix for provisioning requests. For instance, regular pending pods will trigger scale up in every CA and different provisioning requests may trigger scaling in the same node group. Is this a part of some broader feature you're trying to build?

If we want to have one CA that will do scale up, (basic) provreq processing etc. and second CA that does only check capacity provreq processing (with prefixed class name), then this should be enough to work correctly. Check capacity Provision doesn't take node pools into consideration (ref, nodeInfos are not used). It base its assumptions only on the global cluster state, so all the nodes and pods within a cluster. Given that, if we run the second CA without node pools configured, then there won't be any scale up/down activity in the CA, but only provreq processing.

If the above assumptions are not enough, we could add a yet another flag to CA that will disable scale up and scale down activities leaving only provreq processing.

@aleksandra-malinowska
Copy link
Contributor

Running multiple CA instances takes more than adding a prefix for provisioning requests. For instance, regular pending pods will trigger scale up in every CA and different provisioning requests may trigger scaling in the same node group. Is this a part of some broader feature you're trying to build?

It's already possible to 'shard' the cluster and run multiple CA instances by passing a different set of node group (or node group prefixes) to each of them. In practice this will work only if each workload's requirements fit only node groups from the same shard, otherwise multiple instances can trigger scale-up.

For regular pods that fulfill these requirements, the only consequence of sharding would be spamming fake NotTriggerScaleUp events from instances that can't request scale-up. Not perfect, but possible to ignore in absence of a solution.

For ProvisioningRequest, the instances that can't do anything for them won't just spam events though - they'll actually modify the ProvisioningRequest object, updating the condition. This actually needs to be fixed for any multi-CA setup to work.

@x13n
Copy link
Member

x13n commented Jan 27, 2025

Sharding can work only if you can split workloads and node groups at the same time.

However, is this this specific scenario intended to work only with check-capacity Provisioning Requests? That sounds safe, as there's no responsibility overlap once prefixes are configured correctly on each instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants