KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it #5033

mochizuki875 · 2025-01-10T05:01:01Z

One-line PR description: As discussed #129480, I've updated this KEP related to
labelSelector.
Issue link: Respect PodTopologySpread after rolling upgrades #3243
Other comments:

mochizuki875 · 2025-01-10T05:01:56Z

sanposhiho · 2025-01-10T05:36:57Z

/cc @alculquicondor
Do we need /lead-opt-in at #3243?

@wojtek-t (or @alculquicondor) do we need PRR review in this case? (I suppose Yes?)

sanposhiho · 2025-01-10T05:37:48Z

/retitle KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it

alculquicondor · 2025-01-10T14:26:55Z

cc @dom4ha

sanposhiho

Update the content based on the conclusion kubernetes/kubernetes#129480 (comment)

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

sanposhiho

You need to update other sections such as Test Plan and other PRR questions.

sanposhiho

Also, please add the current implementation for Alternative section and describe why we decided to move to a new approach.

mochizuki875 · 2025-01-14T08:54:56Z

I appreciate your comment.
I've addressed.

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

sanposhiho · 2025-01-15T05:40:02Z

/assign @wojtek-t

for a PRR part reviewing. Please assign another person if needed.

wojtek-t · 2025-01-15T14:29:11Z

for a PRR part reviewing. Please assign another person if needed.

Queued - although I will wait for SIG approval to happen first.

mochizuki875 · 2025-01-17T06:42:20Z

@sanposhiho
Thank you for your comments.

I have a question.
I understand that kube-scheduler has two roles:

Read matchLabelKeys of cluster-level default constraints from scheduler configuration, get key-value labels corresponding to matchLabelKeys from pod labels, and logically combine them with labelSelector internally to schedule pod.
Check if the key-value labels corresponding to matchLabelKeys are set in labelSelector by kube-apiserver or users, and if not, logically combine them with labelSelector internally to schedule pod.

Are my understandings correct?
I'm concerned about 2.

For example, if the following Pod manifest is applied

apiVersion: v1
kind: Pod
metadata:
  name: sample
  labels:
    app: sample
...
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: {}
    matchLabelKeys:
    - app

kube-apiserver will usually merge key-value labels corresponding to matchLabelKeys into labelSelector, so kube-scheduler will not handle matchLabelKeys.

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
+      - key: app
+        operator: In
+        values:
+        - sample
    matchLabelKeys:
    - app

It is the same if user sets key-value labels corresponding to matchLabelKeys in labelSelector

apiVersion: v1
kind: Pod
metadata:
  name: sample
  labels:
    app: sample
...
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchExpressions:
      - key: app
        operator: In
        values:
        - sample
    matchLabelKeys:
    - app

However, if key-value labels corresponding to matchLabelKeys are not merged into labelSelector by kube-apiserver for some reason, kube-scheduler will get key-value labels from pod labels and logically combine them with labelSelector internally to schedule pod.(I'm wondering if this case should be considered.)

  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector: 
      matchExpressions:
      - key: foo
        operator: In
        values:
        - bar
    matchLabelKeys:
    - app

sanposhiho · 2025-01-17T08:17:27Z

As my suggested changes above imply, the scheduler only has one role, which is to handle matchLabelKeys within the cluster wide default constraints. In other cases (like a user created a Pod with matchLabelKeys), we assume kube-apiserver handles it and kube-scheduler doesn't have to worry about matchLabelKeys, right?

... and you know what, I found I forgot about the update path issue that I raised myself while I'm writing this comment. 😅
I'll revisit my suggestions made above.

So, answering your questions,

However, if key-value labels corresponding to matchLabelKeys are not merged into labelSelector by kube-apiserver for some reason, kube-scheduler will get key-value labels from pod labels and logically combine them with labelSelector internally to schedule pod.(I'm wondering if this case should be considered.)

Usually, no. Like my above strikethrough-ed comment argued, we can just assume kube-apiserver handles all matchLabelKeys that users added. And, the scheduler has to only worry about the cluster wide default constraints.
HOWEVER, there's an exception which is during the cluster upgrade. We need to consider scenarios kube-apiserver not handling matchLabelKeys because, at the time running the cluster upgrade, there could be unscheduled pods that are made through the old kube-apiserver, and have to be handled by the new version kube-scheduler. In this case, kube-scheduler needs to handle matchLabelKeys even though those matchLabelKeys came from users, not from the cluster wide default constraints.
That being said, after one release with this new design, so basically at the next release cycle, we can change kube-scheduler to only handle matchLabelKeys from the cluster wide default constraints.

mochizuki875 · 2025-01-24T06:17:18Z

@sanposhiho
I've addressed them and add description related to kubernetes/kubernetes#129480 (comment)

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

mochizuki875 · 2025-01-24T08:37:08Z

@sanposhiho
Thank you, I've incorporated your suggestion.

sanposhiho

/lgtm

/cc @dom4ha @macsko
/assign @alculquicondor

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

macsko · 2025-01-24T14:29:22Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

+2. Even if it happens, the topology spread on the pod is just ignored (since `labelSelector` no longer matches the pod itself) 
+   and the pod could still be schedulable. 


Is it true? Even if the labelSelector no longer matches the pod itself, the topology spreading is not ignored and could still make the pod unschedulable (ofc selfMatchNum will be 0). If it's what you mean, I'd rephrase it.

Wouldn't simply the label change become ignored? Does the pod itself need to be covered by the selector? I think the bigger problem is that since we'd have both implementations (in api-server and scheduler), won't scheduler add new label ot the labelSelecor and make the pod completely unschedulable?

Is it true? Even if the labelSelector no longer matches the pod itself, the topology spreading is not ignored and could still make the pod unschedulable (ofc selfMatchNum will be 0). If it's what you mean, I'd rephrase it.

Yeah, sorry it's my suggestion and you're right, "ignored" was misleading and we should rephrase; matchNum - minMatchNum could still be bigger than maxSkew and node(s) in some domains could be unschedulable, depending on the current number of matching pods there.

won't scheduler add new label ot the labelSelecor and make the pod completely unschedulable?

The scheduler adds, but the pod won't be completely unschedulable. Rather, the label addition in this case makes the pod easier to get scheduled because selfMatchNum would be 0.

Thank you for your comments.
I've tried to rephrase this sentence.

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

dom4ha · 2025-01-24T16:00:46Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

+#### the update to labels specified at `matchLabelKeys` isn't supported
+
+`MatchLabelKeys` is handled and merged into `LabelSelector` at _a pod's creation_.
+It means this feature doesn't support the label's update even though users, of course, 


Suggested change

It means this feature doesn't support the label's update even though users, of course,

It means this feature doesn't support the label's update even though a user

could update the label that is specified at `matchLabelKeys` after a pod's creation.

dom4ha · 2025-01-24T16:04:24Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

-`TopologySpreadConstraint` which represent a set of label keys only. The scheduler
-will use those keys to look up label values from the incoming pod; and those key-value 
-labels are ANDed with `LabelSelector` to identify the group of existing pods over 
+`TopologySpreadConstraint` which represent a set of label keys only. 


Suggested change

`TopologySpreadConstraint` which represent a set of label keys only.

`TopologySpreadConstraint` which represents a set of label keys only.

dom4ha · 2025-01-24T16:09:21Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

+could update the label that is specified at `matchLabelKeys` after a pod's creation.
+So, in such cases, the update of the label isn't reflected onto the merged `LabelSelector`,
+even though users might expect it to be.
+On the documentation, we'll declare it's not recommended using `matchLabelKeys` with labels that is frequently updated.


I'd rather say that updating labels used is not recommended at all (instead of updating frequently), as it would break the original placement of such pods.

dom4ha · 2025-01-24T16:18:28Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

+2. Even if it happens, the topology spread on the pod is just ignored (since `labelSelector` no longer matches the pod itself) 
+   and the pod could still be schedulable. 


Wouldn't simply the label change become ignored? Does the pod itself need to be covered by the selector? I think the bigger problem is that since we'd have both implementations (in api-server and scheduler), won't scheduler add new label ot the labelSelecor and make the pod completely unschedulable?

dom4ha · 2025-01-24T16:46:49Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

-to `labelSelector` of `topologySpreadConstraints` to filter and group pods. 
-The pods belonging to the same group will be part of the spreading in 
-`PodTopologySpread`.
+In addition, kube-scheduler handles `matchLabelKeys` if the cluster-level default constraints is configured with `matchLabelKeys`.


Isn't it only a proposal? kubernetes/kubernetes#129198 (comment)

dom4ha · 2025-01-24T17:12:23Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

+Also, we cannot implement this feature only within kube-apiserver because it'd make it
+impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration.
+
+So we decided to go with the current design that implements this feature within both 


Suggested change

So we decided to go with the current design that implements this feature within both

So we decided to go with the design that implements this feature within both

dom4ha · 2025-01-24T17:15:07Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

+But, it may confuse users because this behavior would be different from PodAffinity's `MatchLabelKeys`.
+
+Also, we cannot implement this feature only within kube-apiserver because it'd make it
+impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration.


Suggested change

impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration.

impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration in the future (see https://github.com/kubernetes/kubernetes/issues/129198).

dom4ha · 2025-01-24T17:17:09Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

-labels are ANDed with `LabelSelector` to identify the group of existing pods over 
+`TopologySpreadConstraint` which represent a set of label keys only. 
+kube-apiserver will use those keys to look up label values from the incoming pod 
+and those key-value labels are ANDed with `LabelSelector` to identify the group of existing pods over 


Suggested change

and those key-value labels are ANDed with `LabelSelector` to identify the group of existing pods over

and those key-value labels will be merged with existing `LabelSelector` to identify the group of existing pods over

dom4ha · 2025-01-24T17:17:37Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

-will use those keys to look up label values from the incoming pod; and those key-value 
-labels are ANDed with `LabelSelector` to identify the group of existing pods over 
+`TopologySpreadConstraint` which represent a set of label keys only. 
+kube-apiserver will use those keys to look up label values from the incoming pod 


Suggested change

kube-apiserver will use those keys to look up label values from the incoming pod

At a pod creation, kube-apiserver will use those keys to look up label values from the incoming pod

dom4ha · 2025-01-24T17:23:38Z

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

 which the spreading skew will be calculated.
+kube-scheduler will also handle it if the cluster-level default constraints have the one with `MatchLabelKeys`.


Suggested change

kube-scheduler will also handle it if the cluster-level default constraints have the one with `MatchLabelKeys`.

Note that in case `MatchLabelKeys`is supported in the cluster-level default constraints (see https://github.com/kubernetes/kubernetes/issues/129198), kube-scheduler will also handle it separately.

mochizuki875 · 2025-01-27T08:05:47Z

@dom4ha @macsko
Thank you for your comments.
I've addressed and please check them.

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

macsko · 2025-01-27T15:21:01Z

lgtm from me. Let's wait for @dom4ha

alculquicondor

/approve

Time is running out, so better approve now.

k8s-ci-robot · 2025-01-27T16:55:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mochizuki875, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-scheduling/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2025-01-27T16:56:09Z

/lgtm

sanposhiho · 2025-01-28T06:30:48Z

@wojtek-t We haven't modified PRR file and it's merged without your approval.
Please take a look and we can follow up accordingly if you have any suggestion.

dom4ha · 2025-01-29T01:27:21Z

No further concerns from my side. I have limited availability this week, so thanks @macsko and @alculquicondor for approvals

update related to #129480

7a89964

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 10, 2025

k8s-ci-robot requested review from ahg-g and Huang-Wei January 10, 2025 05:01

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 10, 2025

k8s-ci-robot requested a review from sanposhiho January 10, 2025 05:01

mochizuki875 mentioned this pull request Jan 10, 2025

matchLabelKeys in PodTopologySpread should insert a labelSelector like PodAffinity's matchLabelKeys kubernetes/kubernetes#129480

Open

k8s-ci-robot requested a review from alculquicondor January 10, 2025 05:36

k8s-ci-robot changed the title ~~KEP-3243: Update description related to a labelSelector~~ KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it Jan 10, 2025

sanposhiho reviewed Jan 11, 2025

View reviewed changes

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md Outdated Show resolved Hide resolved

sanposhiho reviewed Jan 11, 2025

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2025

fix1

c6e0a76

mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from 71f1c8c to c6e0a76 Compare January 14, 2025 09:04

sanposhiho reviewed Jan 15, 2025

View reviewed changes

k8s-ci-robot assigned wojtek-t Jan 15, 2025

mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from f9e848a to eea447d Compare January 23, 2025 07:21

add Risks and Mitigations

820dcf8

mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from 02b3f71 to 820dcf8 Compare January 24, 2025 06:20

sanposhiho reviewed Jan 24, 2025

View reviewed changes

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md Outdated Show resolved Hide resolved

keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md Outdated Show resolved Hide resolved

fix5

688ecc1

mochizuki875 force-pushed the matchlabelkeys-to-podtopologyspread branch from a38566f to 688ecc1 Compare January 24, 2025 08:37

sanposhiho approved these changes Jan 24, 2025

View reviewed changes

k8s-ci-robot assigned alculquicondor and sanposhiho Jan 24, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2025

macsko reviewed Jan 24, 2025

View reviewed changes

dom4ha reviewed Jan 24, 2025

View reviewed changes

fix6

74642f9

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025

macsko reviewed Jan 27, 2025

View reviewed changes

fix7

7d90a4b

alculquicondor reviewed Jan 27, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025

k8s-ci-robot merged commit db09024 into kubernetes:master Jan 27, 2025
4 checks passed

k8s-ci-robot added this to the v1.33 milestone Jan 27, 2025

mochizuki875 mentioned this pull request Jan 31, 2025

Update the description of TopologySpreadConstraints's matchLabelKeys to align with the design change kubernetes/website#49607

Merged

		2. Even if it happens, the topology spread on the pod is just ignored (since `labelSelector` no longer matches the pod itself)
		and the pod could still be schedulable.

	It means this feature doesn't support the label's update even though users, of course,
	It means this feature doesn't support the label's update even though a user
	could update the label that is specified at `matchLabelKeys` after a pod's creation.

	`TopologySpreadConstraint` which represent a set of label keys only.
	`TopologySpreadConstraint` which represents a set of label keys only.

	So we decided to go with the current design that implements this feature within both
	So we decided to go with the design that implements this feature within both

	impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration.
	impossible to handle `MatchLabelKeys` within the cluster-level default constraints in the scheduler configuration in the future (see https://github.com/kubernetes/kubernetes/issues/129198).

	and those key-value labels are ANDed with `LabelSelector` to identify the group of existing pods over
	and those key-value labels will be merged with existing `LabelSelector` to identify the group of existing pods over

	kube-apiserver will use those keys to look up label values from the incoming pod
	At a pod creation, kube-apiserver will use those keys to look up label values from the incoming pod

		which the spreading skew will be calculated.
		kube-scheduler will also handle it if the cluster-level default constraints have the one with `MatchLabelKeys`.

	kube-scheduler will also handle it if the cluster-level default constraints have the one with `MatchLabelKeys`.
	Note that in case `MatchLabelKeys`is supported in the cluster-level default constraints (see https://github.com/kubernetes/kubernetes/issues/129198), kube-scheduler will also handle it separately.

KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it #5033

KEP-3243: Update the design to mutate the label selector based on matchLabelKeys at api-server instead of the scheduler handling it #5033

Conversation

mochizuki875 commented Jan 10, 2025

mochizuki875 commented Jan 10, 2025

sanposhiho commented Jan 10, 2025 • edited Loading

sanposhiho commented Jan 10, 2025

alculquicondor commented Jan 10, 2025

sanposhiho left a comment

Choose a reason for hiding this comment

sanposhiho left a comment

Choose a reason for hiding this comment

sanposhiho left a comment • edited Loading

Choose a reason for hiding this comment

mochizuki875 commented Jan 14, 2025

sanposhiho commented Jan 15, 2025

wojtek-t commented Jan 15, 2025

mochizuki875 commented Jan 17, 2025 • edited Loading

sanposhiho commented Jan 17, 2025 • edited Loading

mochizuki875 commented Jan 24, 2025

mochizuki875 commented Jan 24, 2025

sanposhiho left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mochizuki875 commented Jan 27, 2025

macsko commented Jan 27, 2025

alculquicondor left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 27, 2025

alculquicondor commented Jan 27, 2025

sanposhiho commented Jan 28, 2025

dom4ha commented Jan 29, 2025

sanposhiho commented Jan 10, 2025 •

edited

Loading

sanposhiho left a comment •

edited

Loading

mochizuki875 commented Jan 17, 2025 •

edited

Loading

sanposhiho commented Jan 17, 2025 •

edited

Loading

sanposhiho left a comment •

edited

Loading

sanposhiho Jan 25, 2025 •

edited

Loading