Use LinearCache to optimize StreamEndpoint discovery. #6906

tsaarni · 2025-02-19T14:03:42Z

This change attempts to improve performance in clusters with a large number of endpoints, as discussed in #6743 (comment).

Envoy does not send a wildcard DiscoveryRequest (a request without a resource name) for EDS / ClusterLoadAssignment resources. Instead, it creates a separate EDS stream for each CDS entry and requests specific resource by name. For example, with 10000 upstream clusters, each Envoy instance sends 10000 DiscoveryRequests, one per endpoint, e.g. from echoserver-0000 to echoserver-9999.

As a result, the SotW-style update, where full set of resources is sent at every update, is not applicable. If echoserver-0000 changes, updates should not be sent to streams watching echoserver-0001 - echoserver-9999. SotW update should consists of single update, for echoserver-0000. Effectively, EDS behaves like incremental update mechanism, since each endpoint has its own stream / watch.

Using SnapshotCache is problematic in this scenario because it broadcasts DiscoveryResponse update to all EDS streams whenever any endpoint changes. Even if only echoserver-0000 is updated, SnapshotCache will send 10000 updates, from echoserver-0000 to echoserver-9999 to each Envoy instance. In each of these updates Contour sends DiscoveryResponse, followed by Envoy immediately sending new DiscoveryRequests back to Contour to watch further updates. Since these messages are relatively heavy-weight, this creates unnecessary overhead compared to typical SotW update.

This PR replaces SnapshotCache with LinearCache for EDS. LinearCache addresses the issue by tracking which stream requested which resource and using versioning to ensure that updates are sent only to streams watching the specific endpoints that changed. When echoserver-0000 is updated, only the EDS streams watching echoserver-0000 will receive the update.

The LinearCache was previously considered but not adopted due to complications outlined by @skriss in a prior PR

If Envoy already has config for a given resource at a particular version, then on a control plane restart, the version number of the resource in the cache will be reset to 1 (or close to 1), therefore will not be sent to Envoy since Envoy already has a "later" version of the resource.

This PR attempts to mitigate this by generating unique version prefix at each startup.

Fixes #6743

codecov · 2025-02-19T14:11:53Z

Codecov Report

Attention: Patch coverage is 94.28571% with 2 lines in your changes missing coverage. Please review.

Project coverage is 81.07%. Comparing base (299fd82) to head (49910b7).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/xdscache/v3/snapshot.go	93.75%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6906      +/-   ##
==========================================
+ Coverage   81.04%   81.07%   +0.02%     
==========================================
  Files         130      130              
  Lines       19659    19663       +4     
==========================================
+ Hits        15932    15941       +9     
+ Misses       3442     3438       -4     
+ Partials      285      284       -1

Files with missing lines	Coverage Δ
internal/xdscache/v3/endpointslicetranslator.go	`87.10% <100.00%> (-0.72%)`	⬇️
internal/xdscache/v3/snapshot.go	`86.91% <93.75%> (+8.50%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

davinci26

before I review, we are extremely interested in this but is there a way to have this behind a feature flag?

Coming from a bit of ignorance but could contour be updated in place and things should just work or would it require to restart all the envoy pods?

tsaarni · 2025-02-19T14:31:30Z

before I review, we are extremely interested in this but is there a way to have this behind a feature flag?

I have not worked with go-control-plane and xDS subscription versioning details, so this should be carefully reviewed. I'd appreciate extra eyes on this.

If necessary, we can add a feature flag, but I'm not sure if it is needed - see below.

Coming from a bit of ignorance but could contour be updated in place and things should just work or would it require to restart all the envoy pods?

I don't believe Envoy pods need to be restarted. As far as I understand, Envoys are completely unaware of the algorithm the server uses; they simply return the last received version info to the server.

I'm still working on fully understanding the difference between the cache implementations. I created https://github.com/tsaarni/grpc-json-sniffer to gain more insight into this issue.

tsaarni · 2025-02-20T15:21:29Z

I've ran some test scenarios and documented them here https://gist.github.com/tsaarni/db319d5d9935d18f8856fcdd9b2a89ae

The go-control-plane cache implementations have some details that might be interesting to study as well

github-actions · 2025-03-07T00:27:22Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2025-03-22T00:26:46Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

tsaarni · 2025-03-30T09:20:19Z

I've updated the PR description to provide a clearer explanation of the issue. I'd really appreciate any reviews when you have time. Thanks!

github-actions · 2025-04-14T00:30:27Z

The Contour project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 14d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the PR is closed

You can:

Ensure your PR is passing all CI checks. PRs that are fully green are more likely to be reviewed. If you are having trouble with CI checks, reach out to the #contour channel in the Kubernetes Slack workspace.
Mark this PR as fresh by commenting or pushing a commit
Close this PR
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

geomacy · 2025-05-07T08:31:26Z

I've updated the PR description to provide a clearer explanation of the issue. I'd really appreciate any reviews when you have time. Thanks!

Hi @tsaarni @davinci26 just checking in to monitor progress on this PR. The code changes look good to me, and I can tell you that we have grabbed this patch to apply to our own clusters, and it has helped stabilise Contour's behaviour with large numbers of HTTPProxies. We would hope to see this merged so that we can eventually avoid the need to patch.

cc also @sunjayBhatia

sunjayBhatia · 2025-05-12T22:13:00Z

With this change: #7047

and running the below:

kubectl exec -n projectcontour deployment/contour -- contour cli eds --node-id test --cert-file /certs/tls.crt --key-file /certs/tls.key --cafile /certs/ca.crt <cluster 1 name>

kubectl exec -n projectcontour deployment/contour -- contour cli eds --node-id test --cert-file /certs/tls.crt --key-file /certs/tls.key --cafile /certs/ca.crt <cluster 2 name>

scaling up and down the deployment for cluster 1 with the existing implementation using snapshot cache for EDS, we get updates on both subscriptions, with this PR (and the change from #7047 applied on top) scaling up/down the deployment for cluster 1 we only get updates on the relevant subscription

sunjayBhatia · 2025-05-12T22:38:36Z

needs a rebase/conflict resolution but this change looks good to me

Signed-off-by: Tero Saarni <[email protected]>

tsaarni · 2025-05-13T11:52:20Z

Thanks @sunjayBhatia! I've rebased.

One thing that still bothers me a bit is whether there's any possibility that our cluster configuration is influencing Envoy’s behavior (link), or if what we're seeing is actually expected - meaning SnapshotCache might just be a poor fit for EDS by design. If that is the case, I'm curious how other go-control-plane users are handling it? I haven't looked into it deeply, but for example Envoy Gateway might be running into the same issue (link).

sunjayBhatia · 2025-05-13T18:18:39Z

One thing that still bothers me a bit is whether there's any possibility that our cluster configuration is influencing Envoy’s behavior (link), or if what we're seeing is actually expected - meaning SnapshotCache might just be a poor fit for EDS by design.

yeah I think the snapshot cache inherently is the issue here

	// SetSnapshot sets a response snapshot for a node. For ADS, the snapshots
	// should have distinct versions and be internally consistent (e.g. all
	// referenced resources must be included in the snapshot).
	//
	// This method will cause the server to respond to all open watches, for which
	// the version differs from the snapshot version.
	SetSnapshot(ctx context.Context, node string, snapshot ResourceSnapshot) error

we always generate a new unique version when we set a new snapshot (which includes all resources) so that means all open watches will be updated on each new snapshot

with the Linear cache change we could have used the cache method SetResources rather than UpdateResources and been in the same situation effectively, the cache would increment resource versions of the all endpoints rather than the logic we have now to only update things that have changed/are new (https://github.com/envoyproxy/go-control-plane/blob/6ad1c197ae06e7b916db645768828baac6473f0d/pkg/cache/v3/linear.go#L232-L287)

an example of our cluster config/config source below, should always be using the same cluster to fetch config from, only difference between eds_cluster_configs is the service_name, i dont think this makes a difference/can be implemented differently tbh

{
  "version_info": "744889b8-61cb-44a7-882b-5af6b2413e86",
  "cluster": {
    "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
    "name": "default/s1/80/da39a3ee5e",
    "type": "EDS",
    "eds_cluster_config": {
      "eds_config": {
        "api_config_source": {
          "api_type": "GRPC",
          "grpc_services": [
            {
              "envoy_grpc": {
                "cluster_name": "contour",
                "authority": "contour"
              }
            }
          ],
          "transport_api_version": "V3"
        },
        "resource_api_version": "V3"
      },
      "service_name": "default/s1/http"
    },
    "connect_timeout": "2s",
    "common_lb_config": {
      "healthy_panic_threshold": {}
    },
    "alt_stat_name": "default_s1_80"
  },
  "last_updated": "2025-05-12T21:33:45.727Z"
}
{
  "version_info": "744889b8-61cb-44a7-882b-5af6b2413e86",
  "cluster": {
    "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
    "name": "marketing/s2/80/da39a3ee5e",
    "type": "EDS",
    "eds_cluster_config": {
      "eds_config": {
        "api_config_source": {
          "api_type": "GRPC",
          "grpc_services": [
            {
              "envoy_grpc": {
                "cluster_name": "contour",
                "authority": "contour"
              }
            }
          ],
          "transport_api_version": "V3"
        },
        "resource_api_version": "V3"
      },
      "service_name": "marketing/s2/http"
    },
    "connect_timeout": "2s",
    "common_lb_config": {
      "healthy_panic_threshold": {}
    },
    "alt_stat_name": "marketing_s2_80"
  },
  "last_updated": "2025-05-12T21:33:45.728Z"
}

sunjayBhatia · 2025-05-13T18:20:40Z

If that is the case, I'm curious how other go-control-plane users are handling it? I haven't looked into it deeply, but for example Envoy Gateway might be running into the same issue

probably because they are using delta xDS this is less of an issue, go-control-plane will handle only sending updates for diffs in that case, rather than a whole snapshot every time it changes

sunjayBhatia · 2025-05-13T18:38:52Z

internal/xdscache/v3/snapshot.go

+	for resourceName, previousResource := range previouslyNotifiedResources {
+		if newResource, ok := currentResources[resourceName]; ok {
+			// Add resources that were updated.
+			if !proto.Equal(newResource, previousResource) {


we're effectively trading the req/resp expense for this comparison expense to ensure we don't send unnecessary updates

tsaarni requested a review from a team as a code owner February 19, 2025 14:03

tsaarni requested review from skriss and sunjayBhatia and removed request for a team February 19, 2025 14:03

sunjayBhatia requested review from a team, davinci26 and izturn and removed request for a team February 19, 2025 14:04

tsaarni force-pushed the eds-performance-fix branch from 8f2cf3c to 7bc7a34 Compare February 19, 2025 14:07

tsaarni added the release-note/small A small change that needs one line of explanation in the release notes. label Feb 19, 2025

davinci26 reviewed Feb 19, 2025

View reviewed changes

github-actions bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2025

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2025

tsaarni force-pushed the eds-performance-fix branch from 7bc7a34 to 9ab7ac8 Compare March 30, 2025 08:57

tsaarni removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 30, 2025

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2025

tsaarni removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 15, 2025

sunjayBhatia added this to Contour Apr 29, 2025

sunjayBhatia modified the milestone: 1.32.0 Apr 29, 2025

sunjayBhatia removed this from Contour Apr 29, 2025

Use LinearCache to optimize StreamEndpoint discovery.

49910b7

Signed-off-by: Tero Saarni <[email protected]>

tsaarni force-pushed the eds-performance-fix branch from 9ab7ac8 to 49910b7 Compare May 13, 2025 11:30

sunjayBhatia reviewed May 13, 2025

View reviewed changes

sunjayBhatia approved these changes May 13, 2025

View reviewed changes

sunjayBhatia merged commit 9556b6a into projectcontour:main May 13, 2025
26 checks passed

Use LinearCache to optimize StreamEndpoint discovery. #6906

Use LinearCache to optimize StreamEndpoint discovery. #6906

Uh oh!

Conversation

tsaarni commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

davinci26 left a comment

Choose a reason for hiding this comment

Uh oh!

tsaarni commented Feb 19, 2025

Uh oh!

tsaarni commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 7, 2025

Uh oh!

github-actions bot commented Mar 22, 2025

Uh oh!

tsaarni commented Mar 30, 2025

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

geomacy commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunjayBhatia commented May 12, 2025

Uh oh!

sunjayBhatia commented May 12, 2025

Uh oh!

tsaarni commented May 13, 2025

Uh oh!

sunjayBhatia commented May 13, 2025

Uh oh!

sunjayBhatia commented May 13, 2025

Uh oh!

sunjayBhatia May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tsaarni commented Feb 19, 2025 •

edited

Loading

codecov bot commented Feb 19, 2025 •

edited

Loading

tsaarni commented Feb 20, 2025 •

edited

Loading

geomacy commented May 7, 2025 •

edited

Loading