Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it #5033

Conversation

mochizuki875
Copy link
Member

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 10, 2025
@mochizuki875
Copy link
Member Author

/cc @sanposhiho

@sanposhiho
Copy link
Member

sanposhiho commented Jan 10, 2025

/cc @alculquicondor
Do we need /lead-opt-in at #3243?

@wojtek-t (or @alculquicondor) do we need PRR review in this case? (I suppose Yes?)

@sanposhiho
Copy link
Member

/retitle KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it

@k8s-ci-robot k8s-ci-robot changed the title KEP-3243: Update description related to a labelSelector KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it Jan 10, 2025
@alculquicondor
Copy link
Member

cc @dom4ha

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the content based on the conclusion kubernetes/kubernetes#129480 (comment)

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update other sections such as Test Plan and other PRR questions.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add the current implementation for Alternative section and describe why we decided to move to a new approach.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2025
@mochizuki875
Copy link
Member Author

I appreciate your comment.
I've addressed.

@mochizuki875 mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from 71f1c8c to c6e0a76 Compare January 14, 2025 09:04
@sanposhiho
Copy link
Member

/assign @wojtek-t

for a PRR part reviewing. Please assign another person if needed.

@wojtek-t
Copy link
Member

for a PRR part reviewing. Please assign another person if needed.

Queued - although I will wait for SIG approval to happen first.

@mochizuki875
Copy link
Member Author

mochizuki875 commented Jan 17, 2025

@sanposhiho
Thank you for your comments.

I have a question.
I understand that kube-scheduler has two roles:

  1. Read matchLabelKeys of cluster-level default constraints from scheduler configuration, get key-value labels corresponding to matchLabelKeys from pod labels, and logically combine them with labelSelector internally to schedule pod.
  2. Check if the key-value labels corresponding to matchLabelKeys are set in labelSelector by kube-apiserver or users, and if not, logically combine them with labelSelector internally to schedule pod.

Are my understandings correct?
I'm concerned about 2.

For example, if the following Pod manifest is applied

apiVersion: v1
kind: Pod
metadata:
  name: sample
  labels:
    app: sample
...
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: {}
    matchLabelKeys:
    - app

kube-apiserver will usually merge key-value labels corresponding to matchLabelKeys into labelSelector, so kube-scheduler will not handle matchLabelKeys.

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
+      - key: app
+        operator: In
+        values:
+        - sample
    matchLabelKeys:
    - app

It is the same if user sets key-value labels corresponding to matchLabelKeys in labelSelector

apiVersion: v1
kind: Pod
metadata:
  name: sample
  labels:
    app: sample
...
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
      - key: app
        operator: In
        values:
        - sample
    matchLabelKeys:
    - app

However, if key-value labels corresponding to matchLabelKeys are not merged into labelSelector by kube-apiserver for some reason, kube-scheduler will get key-value labels from pod labels and logically combine them with labelSelector internally to schedule pod.(I'm wondering if this case should be considered.)

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: 
      matchExpressions:
      - key: foo
        operator: In
        values:
        - bar
    matchLabelKeys:
    - app

@sanposhiho
Copy link
Member

sanposhiho commented Jan 17, 2025

As my suggested changes above imply, the scheduler only has one role, which is to handle matchLabelKeys within the cluster wide default constraints. In other cases (like a user created a Pod with matchLabelKeys), we assume kube-apiserver handles it and kube-scheduler doesn't have to worry about matchLabelKeys, right?

... and you know what, I found I forgot about the update path issue that I raised myself while I'm writing this comment. 😅
I'll revisit my suggestions made above.

So, answering your questions,

However, if key-value labels corresponding to matchLabelKeys are not merged into labelSelector by kube-apiserver for some reason, kube-scheduler will get key-value labels from pod labels and logically combine them with labelSelector internally to schedule pod.(I'm wondering if this case should be considered.)

Usually, no. Like my above strikethrough-ed comment argued, we can just assume kube-apiserver handles all matchLabelKeys that users added. And, the scheduler has to only worry about the cluster wide default constraints.
HOWEVER, there's an exception which is during the cluster upgrade. We need to consider scenarios kube-apiserver not handling matchLabelKeys because, at the time running the cluster upgrade, there could be unscheduled pods that are made through the old kube-apiserver, and have to be handled by the new version kube-scheduler. In this case, kube-scheduler needs to handle matchLabelKeys even though those matchLabelKeys came from users, not from the cluster wide default constraints.
That being said, after one release with this new design, so basically at the next release cycle, we can change kube-scheduler to only handle matchLabelKeys from the cluster wide default constraints.

@mochizuki875 mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from f9e848a to eea447d Compare January 23, 2025 07:21
@mochizuki875
Copy link
Member Author

@sanposhiho
I've addressed them and add description related to kubernetes/kubernetes#129480 (comment)

@mochizuki875 mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from 02b3f71 to 820dcf8 Compare January 24, 2025 06:20
@mochizuki875 mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from a38566f to 688ecc1 Compare January 24, 2025 08:37
@mochizuki875
Copy link
Member Author

@sanposhiho
Thank you, I've incorporated your suggestion.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/cc @dom4ha @macsko
/assign @alculquicondor

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2025
Comment on lines 324 to 325
2. Even if it happens, the topology spread on the pod is just ignored (since `labelSelector` no longer matches the pod itself)
and the pod could still be schedulable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true? Even if the labelSelector no longer matches the pod itself, the topology spreading is not ignored and could still make the pod unschedulable (ofc selfMatchNum will be 0). If it's what you mean, I'd rephrase it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't simply the label change become ignored? Does the pod itself need to be covered by the selector? I think the bigger problem is that since we'd have both implementations (in api-server and scheduler), won't scheduler add new label ot the labelSelecor and make the pod completely unschedulable?

Copy link
Member

@sanposhiho sanposhiho Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true? Even if the labelSelector no longer matches the pod itself, the topology spreading is not ignored and could still make the pod unschedulable (ofc selfMatchNum will be 0). If it's what you mean, I'd rephrase it.

Yeah, sorry it's my suggestion and you're right, "ignored" was misleading and we should rephrase; matchNum - minMatchNum could still be bigger than maxSkew and node(s) in some domains could be unschedulable, depending on the current number of matching pods there.

won't scheduler add new label ot the labelSelecor and make the pod completely unschedulable?

The scheduler adds, but the pod won't be completely unschedulable. Rather, the label addition in this case makes the pod easier to get scheduled because selfMatchNum would be 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments.
I've tried to rephrase this sentence.

#### the update to labels specified at `matchLabelKeys` isn't supported

`MatchLabelKeys` is handled and merged into `LabelSelector` at _a pod's creation_.
It means this feature doesn't support the label's update even though users, of course,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It means this feature doesn't support the label's update even though users, of course,
It means this feature doesn't support the label's update even though a user
could update the label that is specified at `matchLabelKeys` after a pod's creation.

`TopologySpreadConstraint` which represent a set of label keys only. The scheduler
will use those keys to look up label values from the incoming pod; and those key-value
labels are ANDed with `LabelSelector` to identify the group of existing pods over
`TopologySpreadConstraint` which represent a set of label keys only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`TopologySpreadConstraint` which represent a set of label keys only.
`TopologySpreadConstraint` which represents a set of label keys only.

could update the label that is specified at `matchLabelKeys` after a pod's creation.
So, in such cases, the update of the label isn't reflected onto the merged `LabelSelector`,
even though users might expect it to be.
On the documentation, we'll declare it's not recommended using `matchLabelKeys` with labels that is frequently updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather say that updating labels used is not recommended at all (instead of updating frequently), as it would break the original placement of such pods.

Comment on lines 324 to 325
2. Even if it happens, the topology spread on the pod is just ignored (since `labelSelector` no longer matches the pod itself)
and the pod could still be schedulable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't simply the label change become ignored? Does the pod itself need to be covered by the selector? I think the bigger problem is that since we'd have both implementations (in api-server and scheduler), won't scheduler add new label ot the labelSelecor and make the pod completely unschedulable?

to `labelSelector` of `topologySpreadConstraints` to filter and group pods.
The pods belonging to the same group will be part of the spreading in
`PodTopologySpread`.
In addition, kube-scheduler handles `matchLabelKeys` if the cluster-level default constraints is configured with `matchLabelKeys`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it only a proposal? kubernetes/kubernetes#129198 (comment)

Also, we cannot implement this feature only within kube-apiserver because it'd make it
impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration.

So we decided to go with the current design that implements this feature within both
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
So we decided to go with the current design that implements this feature within both
So we decided to go with the design that implements this feature within both

But, it may confuse users because this behavior would be different from PodAffinity's `MatchLabelKeys`.

Also, we cannot implement this feature only within kube-apiserver because it'd make it
impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration.
impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration in the future (see https://github.com/kubernetes/kubernetes/issues/129198).

labels are ANDed with `LabelSelector` to identify the group of existing pods over
`TopologySpreadConstraint` which represent a set of label keys only.
kube-apiserver will use those keys to look up label values from the incoming pod
and those key-value labels are ANDed with `LabelSelector` to identify the group of existing pods over
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and those key-value labels are ANDed with `LabelSelector` to identify the group of existing pods over
and those key-value labels will be merged with existing `LabelSelector` to identify the group of existing pods over

will use those keys to look up label values from the incoming pod; and those key-value
labels are ANDed with `LabelSelector` to identify the group of existing pods over
`TopologySpreadConstraint` which represent a set of label keys only.
kube-apiserver will use those keys to look up label values from the incoming pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kube-apiserver will use those keys to look up label values from the incoming pod
At a pod creation, kube-apiserver will use those keys to look up label values from the incoming pod

which the spreading skew will be calculated.
kube-scheduler will also handle it if the cluster-level default constraints have the one with `MatchLabelKeys`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kube-scheduler will also handle it if the cluster-level default constraints have the one with `MatchLabelKeys`.
Note that in case `MatchLabelKeys`is supported in the cluster-level default constraints (see https://github.com/kubernetes/kubernetes/issues/129198), kube-scheduler will also handle it separately.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025
@mochizuki875
Copy link
Member Author

@dom4ha @macsko
Thank you for your comments.
I've addressed and please check them.

@macsko
Copy link
Member

macsko commented Jan 27, 2025

lgtm from me. Let's wait for @dom4ha

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Time is running out, so better approve now.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mochizuki875, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025
@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025
@k8s-ci-robot k8s-ci-robot merged commit db09024 into kubernetes:master Jan 27, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Jan 27, 2025
@sanposhiho
Copy link
Member

@wojtek-t We haven't modified PRR file and it's merged without your approval.
Please take a look and we can follow up accordingly if you have any suggestion.

@dom4ha
Copy link
Member

dom4ha commented Jan 29, 2025

No further concerns from my side. I have limited availability this week, so thanks @macsko and @alculquicondor for approvals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants