Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4974: initial proposal for deprecating v1.Endpoints #4975

Merged
merged 1 commit into from
Feb 11, 2025

Conversation

danwinship
Copy link
Contributor

The `v1.Endpoints` API has been essentially deprecated since
EndpointSlices became GA in 1.21. Several new Service features (such
as dual-stack and topology, not to mention "services with more than
1000 endpoints") are implemented only for EndpointSlice, not for
Endpoints. Kube-proxy no longer uses Endpoints ever, for anything, and
the Gateway API conformance tests also require implementations to use
EndpointSlices.

Despite this, kube-controller-manager still does all of the work of
managing Endpoints objects for all Services, and a cluster cannot pass
the conformance test suite unless the Endpoints and EndpointSlice
Mirroring controllers are running, even though in many cases nothing
will ever look at the output of the Endpoints controller (and the
EndpointSlice Mirroring controller will never output anything).

While Kubernetes's API guarantees make it essentially impossible to
ever actually fully remove Endpoints, we should at least move toward a
world where most users run Kubernetes with the Endpoints and
EndpointSlice Mirroring controllers disabled.

#4974

/sig network
/cc @aojea @robscott

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 22, 2024
@danwinship danwinship mentioned this pull request Nov 22, 2024
4 tasks
Copy link

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some feedback

@aojea
Copy link
Member

aojea commented Nov 26, 2024

I think this makes a lot of sense, we need to find a way to move the ecosystem to EndpointSlices in a way that Endpoints will not even be required ... that I think is a good litmus test.
There still be cluster that will need endpoints, kub-apiserver aggregated-routing, kube-dns or knative project comes to mind, but I'm sure there will be a lot of projects or custom users using custom Endpoints, leader election used to use an Endpoints as lock IIRC.

One important thing is that we can not regress on quality, we can not drop just everything that uses Endpoints from e2e or we miss a lot of coverage so SIG Testing should review carefully the test plan cc: @BenTheElder @pohly

I also think this is a new challenge in Kubernetes, as v1.Endpoints is a GA API, and some of the proposals like removing them from conformance are tricky or AFAIK we never did it, I also think we have this target of 100% API endpoints coverage that will fail if we remove v1.Endpoints of comformance ... we need SIG Architecture input here @johnbelamaric @dims

Independently, definitive a topic we should work toward it, though not one of the highest priority at this point IMHO
/cc @thockin

@k8s-ci-robot k8s-ci-robot requested a review from thockin November 26, 2024 16:46
@danwinship
Copy link
Contributor Author

I also think this is a new challenge in Kubernetes, as v1.Endpoints is a GA API, and some of the proposals like removing them from conformance are tricky or AFAIK we never did it

The KEP does not propose to demote the Endpoints API from conformance; it only proposes to demote the Endpoints controller from conformance.

I also think we have this target of 100% API endpoints coverage that will fail if we remove v1.Endpoints of comformance

I think you're referring to tests like "framework.ConformanceIt("should test the lifecycle of an Endpoint"", and that would remain a conformance test under this plan, because it doesn't depend on the Endpoints controller at all, it just confirms that you can create/update/get/watch Endpoints objects as expected.

@robscott
Copy link
Member

robscott commented Nov 27, 2024

The KEP does not propose to demote the Endpoints API from conformance; it only proposes to demote the Endpoints controller from conformance.

I'm generally supportive of this direction, but it does raise the question of "what even is an API"? The API would technically still be there, but this change would effectively force any system that had been reading Endpoints managed by the Endpoints controller to transition to the EndpointSlice API which does feel pretty close to a deprecation even if the API does technically still exist.

As long as the EndpointSlice Mirroring controller is still running, people will still be able to create Endpoints and have them automatically translated to EndpointSlices, so this is primarily a question of systems that are reading from Endpoints.

I do think that we need to have an answer for what we do if/when we migrate functionality from one v1 API to another. In practice, a rather small portion of components are still relying on the Endpoints API, and an even smaller portion are relying on the Endpoints controller. Given that most of our existing e2e test coverage is focused on functionality that is now provided by the EndpointSlice controller, it's possible that a regression in the Endpoints controller could go unnoticed for a very long time.

There's also the practical matter that running the Endpoints controller is likely wasting resources on >90% of existing Kubernetes clusters. More clearly communicating that running the Endpoints controller is optional could be a good step here.

To refer to the Kubernetes deprecation policy, making the Endpoints controller optional may fall under "Deprecating a feature or behavior", which has the following guidelines:

  1. Deprecated behaviors must function for no less than 1 year after their announced deprecation
  2. The feature of behavior must not be deprecated in favor of an alternative implementation that is less stable

1 just requires lots of advance notice, and I think it's safe to say that the EndpointSlice controller is more widely tested and likely more reliable than the Endpoints controller at this point.

@BenTheElder
Copy link
Member

To refer to the Kubernetes deprecation policy, making the Endpoints controller optional may fall under "Deprecating a feature or behavior", which has the following guidelines:

Yeah, I think we need to give a lot of heads up, even if we're not using it in SIG-Network owned projects/tools anymore aside from these controller we don't know how users are still relying on the endpoints being populated.

The v1.Endpoints API has been essentially deprecated since
EndpointSlices became GA in 1.21.

But not actually documented as deprecated, right?
https://kubernetes.io/docs/reference/kubernetes-api/service-resources/endpoints-v1/

Several new Service features (such
as dual-stack and topology, not to mention "services with more than
1000 endpoints") are implemented only for EndpointSlice, not for
Endpoints. Kube-proxy no longer uses Endpoints ever, for anything, and
the Gateway API conformance tests also require implementations to use
EndpointSlices.

"Clusters with <= 1000 endpoints per service that don't use dual-stack / topology" is probably still a LOT of clusters that may be using it successfully with some other controller / tooling, even if we aren't with kube-proxy?

I don't think gateway-api conformance tests are super relevant in this context?

The KEP does not propose to demote the Endpoints API from conformance; it only proposes to demote the Endpoints controller from conformance.

Important distinction that still still seems like a big expectation change, I don't think we have a good precedent for this yet.

This seems relevant to @kubernetes/api-approvers in addition to SIG-Arch.

@danwinship
Copy link
Contributor Author

Given that most of our existing e2e test coverage is focused on functionality that is now provided by the EndpointSlice controller, it's possible that a regression in the Endpoints controller could go unnoticed for a very long time.

I had assumed this too, and that's true to the extent that lots of tests depend on the behavior of kube-proxy, and kube-proxy only looks at EndpointSlices. But there are still a surprising number of tests that look at Endpoints explicitly, and even some that look at Endpoints and not EndpointSlices (including, disturbingly, the dual-stack Service tests!)

My plan here was to ensure that everything outside of test/e2e/network/ only looks at slices, but to preserve all of the Endpoints-specific and "Endpoints match EndpointSlices" tests in test/e2e/network (with all of the Endpoints stuff eventually feature-tagged and skippable).

1 just requires lots of advance notice

Right, I should have been clearer in the KEP but I was expecting this might take a while. (Antonio mentioned deprecating the in-tree cloud providers as a comparison point.)

On that note, the KEP currently claims that demoting the Endpoints controller tests from conformance is part of "step 1", which I was thinking would sort of constitute the official announcement that we were doing this, but Antonio suggested that maybe it should come later in the process, so providers don't end up disabling the controller while there are still lots of third-party components that aren't ready for it.

But not actually documented as deprecated, right?

Yeah... I had kinda been thinking it was, though it's obvious in hindsight that it's not (I look at those API docs all the time and should have noticed this).

"Clusters with <= 1000 endpoints per service that don't use dual-stack / topology" is probably still a LOT of clusters

Yes; I was thinking more in terms of "how much third-party software is still using Endpoints rather than EndpointSlices?". Even if most individual users of FooBarProxy™ have small single-stack clusters, the fact that some don't should mean that the authors would have been under pressure to port away from Endpoints for a while now.

I don't think gateway-api conformance tests are super relevant in this context?

I was mentioning it because it means we don't have to worry that there are Gateway implementations that are based on Endpoints rather than EndpointSlice.

@sftim
Copy link

sftim commented Nov 27, 2024

Nothing in our stability guarantees stops us from, eg, making and promoting a ValidatingAdmissionPolicy that blocks Endpoints creation.

@BenTheElder
Copy link
Member

On that note, the KEP currently claims that demoting the Endpoints controller tests from conformance is part of "step 1", which I was thinking would sort of constitute the official announcement that we were doing this, but Antonio suggested that maybe it should come later in the process, so providers don't end up disabling the controller while there are still lots of third-party components that aren't ready for it.

+1, I think it should look closer to cloud provider removal:

  1. Formally deprecated. Release note highlight, blog, etc.
  2. Wait
  3. Disable controllers + drop from conformance. New release note highlight, blog, etc.
  4. Wait
  5. Remove controllers?

Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM idea

because a controller _isn't_ enabled seems slightly magical? But
perhaps there is precedent somewhere else?)

Another possibility would be to create an `endpoints-cleanup`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer this, but it feels sort of permanent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but I would go even a step further - this controller is automatically enabled if endpoints controller is disabled (and vice versa).

I agree it's permanent, but I think it's ok.

@thockin thockin self-assigned this Jan 18, 2025
Comment on lines +122 to +135
- Update the conformance test suite to not require the Endpoints and
EndpointSlice Mirroring controllers to be running, by rewriting some
tests and demoting others from conformance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conformance and e2e should be behavior driven so this is ok if the behavior is covered ... IIRC most of the tests use the endpoints/endpoint slices objects to synchronize the test and avoid races and flakes because of that, so a conformance test that uses these helpers should be ok to use just EndpointSlices

Comment on lines +126 to +139
- Explicitly document that disabling `endpoints-controller` and/or
`endpointslice-mirroring-controller` via kube-controller-manager's
`--controllers` flag is a supported and conforming configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving towards being able to disable EndpointsController by default and clean up the endpoint-controller-maintained Endpoints objects seems plausible.

I don't think I agree with taking the EndpointSlice Mirroring controller out of conformance... user-created/maintained Endpoints objects have to keep getting turned into EndpointSlice objects automatically ~forever or we'll break v1 API behavior guarantees, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantee did we make?

In my mind, Endpoints might go the way of ComponentStatus: still part of the OpenAPI, but it eventually stops working how you expect, per https://kubernetes.io/docs/reference/using-api/deprecation-policy/#deprecating-a-feature-or-behavior

Related to this, we should make the docs clearer about that guarantee. I'm not sure I can tell what we're breaking here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What guarantee did we make?

writing IPs to Endpoints API objects programs the network

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writing IPs to Endpoints API objects programs the network

I'm happy to accept that we've made this, but I don't think we've a document that shows how Endpoints is a commitment but ComponentStatus is / was not.

Something for another KEP perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/kubernetes/blob/78ad9a7b974660e5c5a1a4b4f800a9a19b4a1ddd/test/conformance/testdata/conformance.yaml#L1685-L1694

yeah, it seems is going to be complicated to justify the mirroring controller out of conformance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to make sure everything in tree is using endpoint slices and not endpoints to be able to do that, I can't remember if things like aggregated service routing or kube-apiserver service proxying switched to use endpoint slices or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@aojea aojea Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship for reference, list of places that need migration to EndpointSlices

EDIT with the exact list

https://github.com/kubernetes/kubernetes/blob/a716ea756d87f60900dbbb500fc27ae30f7bd384/staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go#L31-L38

// NewEndpointServiceResolver returns a ServiceResolver that chooses one of the
// service's endpoints.
func NewEndpointServiceResolver(services listersv1.ServiceLister, endpoints listersv1.EndpointsLister) ServiceResolver {
	return &aggregatorEndpointRouting{
		services:  services,
		endpoints: endpoints,
	}
}

https://github.com/kubernetes/kubernetes/blob/a716ea756d87f60900dbbb500fc27ae30f7bd384/staging/src/k8s.io/apiserver/pkg/util/proxy/proxy.go#L54-L59

// ResolveEndpoint returns a URL to which one can send traffic for the specified service.
func ResolveEndpoint(services listersv1.ServiceLister, endpoints listersv1.EndpointsLister, namespace, id string, port int32) (*url.URL, error) {

https://github.com/kubernetes/kubernetes/blob/a716ea756d87f60900dbbb500fc27ae30f7bd384/staging/src/k8s.io/kube-aggregator/pkg/controllers/status/remote/remote_available_controller.go#L242-L249

		endpoints, err := c.endpointsLister.Endpoints(apiService.Spec.Service.Namespace).Get(apiService.Spec.Service.Name)
		if apierrors.IsNotFound(err) {
			availableCondition.Status = apiregistrationv1.ConditionFalse
			availableCondition.Reason = "EndpointsNotFound"
			availableCondition.Message = fmt.Sprintf("cannot find endpoints for service/%s in %q", apiService.Spec.Service.Name, apiService.Spec.Service.Namespace)
			apiregistrationv1apihelper.SetAPIServiceCondition(apiService, availableCondition)
			_, err := c.updateAPIServiceStatus(originalAPIService, apiService)
			return err

EDITED

there are 3 places that resolve services using Endpoints: remote_available_controller, apiserver proxy and aggregator routing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ grep -r ResolveEndpoint staging/

Note that staging/src/k8s.io/kube-aggregator/pkg/apiserver/resolvers.go is the only call to proxy.ResolveEndpoint; all of the other hits are unrelated methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edited my previous comment with more accurate information

```
<<[UNRESOLVED disable-by-default ]>>

- MAYBE eventually move `endpoints-controller` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with moving towards being able to turn off Endpoints controller by default. I don't see how we'd ever turn off EndpointsMirroring controller by default without it being a hard break of behavior tied to a v1 API.

@danwinship danwinship changed the title KEP-4794: initial proposal for deprecating v1.Endpoints KEP-4974: initial proposal for deprecating v1.Endpoints Jan 20, 2025
@sftim
Copy link

sftim commented Jan 20, 2025

Have we ever made a commitment such as: “any feature that is part of conformance will remain part of conformance for that major version”?

(maybe we should; we can still deprecate Endpoints even if we do commit to that)

@kannon92
Copy link
Contributor

kannon92 commented Feb 7, 2025

PRR shadow:

Minor feedback but overall LGTM!

@kannon92
Copy link
Contributor

kannon92 commented Feb 7, 2025

/assign @soltysh
for PRR

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2025
@thockin
Copy link
Member

thockin commented Feb 9, 2025

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2025
@thockin
Copy link
Member

thockin commented Feb 10, 2025

FTR I am not PRR

@wojtek-t wojtek-t assigned wojtek-t and unassigned soltysh Feb 10, 2025
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added few small comments from PRR, but I'm overall aligned with the direction. Once applied, this lgtm.

because a controller _isn't_ enabled seems slightly magical? But
perhaps there is precedent somewhere else?)

Another possibility would be to create an `endpoints-cleanup`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but I would go even a step further - this controller is automatically enabled if endpoints controller is disabled (and vice versa).

I agree it's permanent, but I think it's ok.

make use of Endpoints. They will need to be updated to use
EndpointSlices (and, perhaps, to fall back to Endpoints for a while,
in case any users were manually creating Endpoints and not mirroring
them?).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would endpoint-slice-mirror controller mirror them in such case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yes. FYI, there's already a PR for this (kubernetes/kubernetes#129837) and the release note is currently:

When proxying to an aggregated API server, kube-apiserver now uses the
EndpointSlices of the `service` indicated by the `APIServer`, rather than
using Endpoints.

If you are using the aggregated API server feature, and you are writing out
the endpoints for it by hand (rather than letting kube-controller-manager
generate Endpoints and EndpointSlices for it automatically based on the
Service definition), then you should write out an EndpointSlice object rather
than (or in addition to) an Endpoints object.


###### How can this feature be enabled / disabled in a live cluster?

The KEP does not define a new feature, it just proposes that we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have a feature-gate for the changes we will do to stop using Endpoint in aggregation etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was considering that a bugfix. The only way the behavior change could actually break someone is if they were writing out the Endpoints by hand rather than having them be generated from a Service and they were explicitly setting the skip-mirroring annotation for some reason...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I don't feel strongly enough about it to fight for FG here.

But can you please extend the answer to this question to reflect your above comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


###### What happens if we reenable the feature if it was previously rolled back?

N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker now, but I think that before Stage 3, we actually should test endpoints-controller disablement -> enablement -> disablement (again).

Can you please mention that such test will be run before stage 3 (I don't want to block you by bothering you about full description now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Is there some particular failure mode you're worried about? (Is this more about the stale Endpoints cleanup than about the Endpoints controller itself?)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
@wojtek-t
Copy link
Member

/lgtm
/approve PRR

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 11, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2025
@k8s-ci-robot k8s-ci-robot merged commit 477ce42 into kubernetes:master Feb 11, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 11, 2025
@danwinship danwinship deleted the deprecate-endpoints branch February 11, 2025 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/network Categorizes an issue or PR as relevant to SIG Network. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.