-
Notifications
You must be signed in to change notification settings - Fork 694
Use LinearCache to optimize StreamEndpoint discovery. #6906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8f2cf3c
to
7bc7a34
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6906 +/- ##
==========================================
+ Coverage 81.04% 81.07% +0.02%
==========================================
Files 130 130
Lines 19659 19663 +4
==========================================
+ Hits 15932 15941 +9
+ Misses 3442 3438 -4
+ Partials 285 284 -1
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before I review, we are extremely interested in this but is there a way to have this behind a feature flag?
Coming from a bit of ignorance but could contour be updated in place and things should just work or would it require to restart all the envoy pods?
I have not worked with go-control-plane and xDS subscription versioning details, so this should be carefully reviewed. I'd appreciate extra eyes on this. If necessary, we can add a feature flag, but I'm not sure if it is needed - see below.
I don't believe Envoy pods need to be restarted. As far as I understand, Envoys are completely unaware of the algorithm the server uses; they simply return the last received version info to the server. I'm still working on fully understanding the difference between the cache implementations. I created https://github.com/tsaarni/grpc-json-sniffer to gain more insight into this issue. |
I've ran some test scenarios and documented them here https://gist.github.com/tsaarni/db319d5d9935d18f8856fcdd9b2a89ae The go-control-plane cache implementations have some details that might be interesting to study as well |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
7bc7a34
to
9ab7ac8
Compare
I've updated the PR description to provide a clearer explanation of the issue. I'd really appreciate any reviews when you have time. Thanks! |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
Hi @tsaarni @davinci26 just checking in to monitor progress on this PR. The code changes look good to me, and I can tell you that we have grabbed this patch to apply to our own clusters, and it has helped stabilise Contour's behaviour with large numbers of HTTPProxies. We would hope to see this merged so that we can eventually avoid the need to patch. cc also @sunjayBhatia |
With this change: #7047 and running the below:
scaling up and down the deployment for cluster 1 with the existing implementation using snapshot cache for EDS, we get updates on both subscriptions, with this PR (and the change from #7047 applied on top) scaling up/down the deployment for cluster 1 we only get updates on the relevant subscription |
needs a rebase/conflict resolution but this change looks good to me |
Signed-off-by: Tero Saarni <[email protected]>
9ab7ac8
to
49910b7
Compare
Thanks @sunjayBhatia! I've rebased. One thing that still bothers me a bit is whether there's any possibility that our cluster configuration is influencing Envoy’s behavior (link), or if what we're seeing is actually expected - meaning |
yeah I think the snapshot cache inherently is the issue here
we always generate a new unique version when we set a new snapshot (which includes all resources) so that means all open watches will be updated on each new snapshot with the Linear cache change we could have used the cache method an example of our cluster config/config source below, should always be using the same cluster to fetch config from, only difference between
|
probably because they are using delta xDS this is less of an issue, go-control-plane will handle only sending updates for diffs in that case, rather than a whole snapshot every time it changes |
for resourceName, previousResource := range previouslyNotifiedResources { | ||
if newResource, ok := currentResources[resourceName]; ok { | ||
// Add resources that were updated. | ||
if !proto.Equal(newResource, previousResource) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're effectively trading the req/resp expense for this comparison expense to ensure we don't send unnecessary updates
This change attempts to improve performance in clusters with a large number of endpoints, as discussed in #6743 (comment).
Envoy does not send a wildcard
DiscoveryRequest
(a request without a resource name) for EDS /ClusterLoadAssignment
resources. Instead, it creates a separate EDS stream for each CDS entry and requests specific resource by name. For example, with 10000 upstream clusters, each Envoy instance sends 10000DiscoveryRequests
, one per endpoint, e.g. fromechoserver-0000
toechoserver-9999
.As a result, the SotW-style update, where full set of resources is sent at every update, is not applicable. If
echoserver-0000
changes, updates should not be sent to streams watchingechoserver-0001
-echoserver-9999
. SotW update should consists of single update, forechoserver-0000
. Effectively, EDS behaves like incremental update mechanism, since each endpoint has its own stream / watch.Using
SnapshotCache
is problematic in this scenario because it broadcastsDiscoveryResponse
update to all EDS streams whenever any endpoint changes. Even if onlyechoserver-0000
is updated,SnapshotCache
will send 10000 updates, fromechoserver-0000
toechoserver-9999
to each Envoy instance. In each of these updates Contour sendsDiscoveryResponse
, followed by Envoy immediately sending newDiscoveryRequests
back to Contour to watch further updates. Since these messages are relatively heavy-weight, this creates unnecessary overhead compared to typical SotW update.This PR replaces
SnapshotCache
withLinearCache
for EDS.LinearCache
addresses the issue by tracking which stream requested which resource and using versioning to ensure that updates are sent only to streams watching the specific endpoints that changed. Whenechoserver-0000
is updated, only the EDS streams watchingechoserver-0000
will receive the update.The
LinearCache
was previously considered but not adopted due to complications outlined by @skriss in a prior PRThis PR attempts to mitigate this by generating unique version prefix at each startup.
Fixes #6743