Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ext_authz cache #37953

Closed
wants to merge 15 commits into from
Closed

ext_authz cache #37953

wants to merge 15 commits into from

Conversation

naoitoi-v
Copy link

Commit Message: This is a draft PR to discuss the design of the response cache in the ext_authz filter.

We are seeking feedback on the design to make sure this is headed to the right direction.

Similar requests were discussed in #37449 and #3023

Additional Description:

Purpose:
We would like to cache the response from an external authorization server in the ext_authz filter for scalability. Depending on the use case, if you do not have to check the authentication token on every request, you can reduce the number of calls to the server by caching the response from it.

Functionality:
To use this functionality, you need to configure the name(s) of the header(s) in which the auth token is carried, e.g., "Authorization" or "x-auth-token." You can also configure the max cache size and TTL. When this is configured, ext_authz will cache the response from the external authorization server.

Implementation:
We implemented a simple cache. Responses are stored in a simple unordered_map. We implemented a simple FIFO eviction policy over the LRU policy to make the cache size small. In our benchmark, for each cache key (35 bytes), we need ~100 bytes of memory.

We store a response in the cache in the onComplete() method.
We read a cached response in the decoderHeaders() method.

Next Steps:

  • Write tests.
  • Make sure the code complies with the coding style.
  • Do pre-commit checks.
  • May add an option to NOT cache 4xx errors to avoid cache poisoning.
  • May add an option to inject randomness in TTL to avoid synchronization.

Risk Level: Low. This functionality would be turned off by default.

Testing: TBD
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

Copy link

Hi @naoitoi-v, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #37953 was opened by naoitoi-v.

see: more, trace.

Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #37953 was opened by naoitoi-v.

see: more, trace.

Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @markdroth
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).

🐱

Caused by: #37953 was opened by naoitoi-v.

see: more, trace.

@naoitoi-v naoitoi-v changed the title Ext authz cache ext_authz cache Jan 9, 2025
@ggreenway
Copy link
Member

Some early feedback:

  • This doesn't do FIFO eviction. I think the evicted element is the one with the lowest hash value. You'll need additional bookkeeping to know the order the elements were inserted.
  • The semantics of having multiple possible headers that contain the cache key are confusing, and possibly insecure.
  • You'll need to refactor all of this to conform to Envoy style and norms

@adisuissa
Copy link
Contributor

A drive-by suggestion: it may be beneficial to generalize the caching for both ext_authz and ext_proc.
I suggest creating the cache configuration (an implementation) separate, and use it only for ext_authz in the first step.

@ggreenway
Copy link
Member

Related to what Adi said, I think you need to cache the response message and replay it on each request. It may add or remove headers, etc. Just looking at the status code is insufficient.

@naoitoi-v
Copy link
Author

Thank you for your early and insightful feedback, @ggreenway, @adisuissa.

  1. This cache eviction implementation is not FIFO.

Great point!

In our (Verkada's) use case, it's important to keep the memory footprint as small as possible. We expect few evictions. We'll change this implementation to random eviction.

In the future, there may be users who want more sophisficated eviction policies (e.g., LRU, random sampling-based LRU). We'll restructure the code so that one can provide different cache implementation.

  1. Looking for auth tokens in multiple headers

Good point. For now we'll make it to look at only one header.

Our use case may require it to look at multiple headers for leagacy reasons. If we have to go that route, we'll use header + auth token as a cache key, so we wouldn't mix auth tokens from multiple headers.

  1. Compliance to Envoy coding style

Yes, we started this effort. Fixing things pointed out by //tools/code_format:check_format and //tools/code:check .

  1. Making cache config and code one detached from ext_authz so that it can be reused in ext_proc

This makes sense. We'll try it.

  1. Storing response message and headers

This makes sense.

In our use case, we'd only need the status code. It's important to minimize the memory footprint.

We'll write the code to store and replay the response messages and all the headers. We'll make this configurable so that the user can decide to store response & headers, or not.

**

Thanks for the great discussion, again. We'll keep you updated on the progress.

Nao

@naoitoi-v
Copy link
Author

I took care of (3) coding style.

@naoitoi-v
Copy link
Author

I took care of (2) Looking for auth tokens in multiple headers.
Now it reads only one header.

@naoitoi-v
Copy link
Author

I added an eviction mechanism to address (1).

It emulates LRU by:

  1. Advance iterator on the cache randomly
  2. Read 1000 items to decide average TTL
  3. Look at some (by default, 1% of the whole cache) items, and delete items whose TTL are older than a fraction (by default, 50%) of the average.

With these default, it was able to complete eviction in around 1ms on my testing (running on Docker - 12 CPU and 24GB memory - on Mac Book Pro).

@naoitoi-v
Copy link
Author

Moved the cache code to a common location to address (4).

uint32 response_cache_ttl = 31;

// Array of header names. ext_authz will use the first header that appears in the HTTP request as the cache key.
string response_cache_header_name = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply that in order for the ext_authz caching to function, the data plane request needs to have a particular header. This seems sub-optimal, because it basically requires the sending application to have all of the logic needed to set this header.

In practice, I think the cache key will wind up needing to be a combination of various attributes of the data plane request, but the exact set of those attributes is going to depend primarily on the ext_authz filter config, which determines which attributes of the data plane request to include in the ext_authz request. (It may also depend on the ext_authz server itself, which might only use a subset of the fields in the ext_authz request, but that will probably be reflected in the ext_authz filter config in most cases.) So this woud basically mean that the application would need logic to determine which attributes of the data plane request to take into account as part of the cache key, thus duplicating the logic in the ext_authz filter config -- which would mean that the ext_authz filter config can't be changed later without also changing the application logic.

I think it would make more sense to have some matching rules here that dictate the cache key as a function of the attributes of the data plane request. That way, all of the logic is self-contained in the ext_authz filter config and can be changed as needed, without breaking the application.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @markdroth

I'd like to make sure I understand you correctly.

I believe the problem you are raising is that, currently, the cache is configured by specifying one header.

I have looked into Matcher: https://www.envoyproxy.io/docs/envoy/latest/xds/type/matcher/v3/matcher.proto#envoy-v3-api-msg-xds-type-matcher-v3-matcher

Matcher can express the choice of cache key very richly. For example, we can change the proto file like this:

message ExtAuthz {
// Other fields...

// Configuration for the response cache matcher
envoy.type.matcher.v3.Matcher response_cache_matcher = 32;

}

Then, we can give a configuration like this:

resp_cache:
enabled: true
response_cache_matcher:
matcher_tree:
input:
name: request-headers
typed_config:
"@type": type.googleapis.com/envoy.type.matcher.v3.HttpRequestHeaderMatchInput
header_name: "auth-token"
exact_match_map:
map:
".*":
action:
name: "auth-token"

This would let ext_authz to use a header named "auth-token" as the cache key.

In addition, configuration can be changed to use combinations of a lot more parameters - path, query parameters, metadata, etc.

Would this address your concern?

Thank you,

Nao

Copy link

@duckya duckya Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Mark, this is much flexible design which allows us to build the cache key with more than just header field. E.g. We can define a cache policy like below

message RequestCachingPolicy {
// Specifies how the cache key should be built.
CacheKeyBuilder key_builder = 1;

// Specifies default expiration policy for the cache entries.
// It can be overridden by individual policy service by setting the
// TTLs field in ext_authz response.
CacheExpirationPolicy expiration_policy = 2;
}

message CacheKeyBuilder {
// A list of CEL expressions used to build a composite cache key.
// An empty list indicates no caching.
// The CEL expression will be evaluated against the envoy attributes
repeated string cel_expressions = 1
[(datapol.semantic_type) = ST_NOT_REQUIRED];
}

message CacheExpirationPolicy {
// Specifies the cache staleness TTL for all cache entries unless it is
// overwritten by the CacheExpirationPolicy in the ext_authz response.
google.protobuf.Duration max_age = 1;
}

An sample policy will be like:

caching_policy_matchers:
matcher_list:
matchers:
// Requests with header["x-goog-policy-set-name"]='something' will
// have cache key on "x-goog-proxy-auth-bin" and "x-goog-request-path"
// headers.

  • predicate:
    single_predicate:
    input:
    typed_config:
    '@type':
    type.googleapis.com/envoy.type.matcher.v3.HttpRequestHeaderMatchInput
    header_name: x-goog-policy-set-name
    value_match:
    exact: something
    on_match:
    action:
    typed_config:
    '@type': type.googleapis.com/envoy.extensions.filters.http.ext_authz.v3.CachingPolicy
    key_builder:
    • cel_expression:request.headers["x-goog-request-path"]
    • cel_expression:request.headers["x-goog-proxy-auth-bin"]
      expiration_policy:
      max_age:
      seconds: 300

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markdroth

I have spent a couple of days trying to do this with CEL (Common Expression Language). It turns out to be significant amount of work. It also made us re-think the request.

We agree that there may be different requirements for cache key creation. For example, we can imagine:

  • If header A exists, use that as cache key. If not, try header B.
  • Use regex to extract some part of header value.
  • Combine some parts of request, e.g., URI, path, query parameters.

WASM filters (e.g. Lua filter) are capable of this type of modification.

Our recommendation is to keep ext_authz's cache key simple in this PR: configure one header. For use cases that require custom cache key, we'd recommend using Lua filter or other WASM filters to construct the cache key and set it to a header, which ext_authz would read from.

If this approach becomes unacceptable in some cases (e.g., for performance reasons), we can consider implementing optimization to solve that problem. Keep it simple, and revisit it when we know more about the problem.

This PR is getting bigger and I still have some ways to go, so it would be really helpful if I can limit the scope and focus on getting it done well.

Please let us know if this would work.

Thank you,

Nao

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naoitoi-v it's a hassle, both to implement and for users, to change configuration after it is released. I understand that you don't want to make this PR too big, but it is important that we think through what the config will look like, both immediately and for anticipated future changes, and do something sensible.

The more we discuss this and I understand your very-specific requirements, I think the entire thing should be an extension point. I think the interface in the ext_authz filter should be kept very simple: a function call that passes in the entire request, which returns either a message CheckResponse (which can be constructed from some cached data; it doesn't need to be a copy of the response from the original ext_authz service), or a response indicating "this wasn't in the cache; do the normal side-call". And a function call to add items to cache after a successful side-call.

Then the only configuration in the ext_authz filter will be an optional core.v3.TypedExtensionConfig cache_config.

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggreenway

I think we are thinking on similar tracks. I have just pushed a commit to split the cache interface (RespCache) from this particular implementation (SimpleCache).

I believe what you are suggesting re: core.v3.TypedExtensionConfig cache_config is -

cache_config specifies which type of cache implementation user wants to use (e.g. SimpleCache), and provides all the configurations necessary for that implementation. We just need to add one message to ext_authz.proto.

Is this correct?
I'll work on this based on my understanding. Please LMK if this is not consistent with your understanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggreenway

I pushed a commit to take care of this.

As discussed, ext_authz.proto has this:

optional envoy.config.core.v3.TypedExtensionConfig response_cache_config = 30;

And it's up to the cache implementation to interpret the config. I've created simple_cache.proto for this.

I left only one other config, response_cache_remember_body_headers, in ext_authz. I considered moving this to simple_cache.proto as well. However, doing so would require SimpleCache to be aware of what's stored in cache (e.g., Filters::Common::ExtAuthz::Response, which is an ext_authz specific data structure), which makes cache less generic. I kept what's stored in cache a template type, that way cache implementation doesn't have to know the type of stored data.

// Configuration parameters for ext_authz response cache
uint32 response_cache_max_size = 30;

uint32 response_cache_ttl = 31;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need more than 1 TTL, at minimum, 1 TTL when the ext_authz server is healthy and 1TTL when the server is not available. The https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control has more TTLs that controls the cache behavior.

I would recommend grouping TTLs into a separate message for extensibility.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @duckya

I believe the motivation for your suggestion of multiple TTLs is - If the external authz server is unavailable, we want to rely on the cached decision longer. This would enhance the system's availability. Is this correct?

To do this, having multiple TTLs wouldn't be enough. We'd have to implement a new feature, that is, "if ext_authz detects the authz service to be down, extend the TTL of the current cached objects."

This makes sense, but, as we still have some ways to go in this PR (cache headers & response body, use matcher, create a new proto message, write tests, rebase, etc.), we'd like to consider this as a feature request in the future, instead of including it in this PR. Would this work?

Thanks,

Nao

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not asking implementing the multiple TTLs for now and I understand we could put it as a new feature request. But I think we should design the configuration in a extensible way to accommodate future improvement. We could have a proto message called CacheExpirationPolicy, and moving response_cache_ttl field there, we could even rename response_cache_ttl to max_age to match the generic cache-control semantic.

@@ -310,6 +310,21 @@ message ExtAuthz {
// Field ``latency_us`` is exposed for CEL and logging when using gRPC or HTTP service.
// Fields ``bytesSent`` and ``bytesReceived`` are exposed for CEL and logging only when using gRPC service.
bool emit_filter_state_stats = 29;

// Configuration parameters for ext_authz response cache
uint32 response_cache_max_size = 30;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it probably makes more sense to group cache related settings into one message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @duckya

I agree. We can create a new proto file resp_cache.proto and move all these cache related settings in there, as a message "RespCacheSettings".

Then import resp_cache.proto from ext_authz.proto.

This would open a possibility for another filter, e.g., ext_proc, to use the response cache.

Does this address your concern?

Thanks,

Nao

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's exactly what want. Maybe we don't specify the "Resp" in the name, just go with CacheSettings.

…e() so that part of it can be called by onDecodeHeaders() when cache is hit.

Signed-off-by: Nao Itoi <[email protected]>
@naoitoi-v
Copy link
Author

We'd like to summarize where we are. (Especially for @markdroth , who is the assigned reviewer.)

Thank you all for the great feedback.

These are the remaining tasks for this PR:

  1. Cache response body and headers from ext_authz, and apply them on cache hit.
  2. Make cache key more configurable, perhaps by using mapper.
  3. Make configuration more contained and extensible, e.g., by introducing new proto messages like CacheSettings.
  4. Write tests and provide coverage.
  5. Rebase - it may need meaningful amount of work to rebase this PR.

We'd like to complete and merge this PR ASAP, so that we can start running this in our environments to get more data points (that may help us improve it further).

We'd like to consider further enhancements as feature requirements post-PR.

Please let us know if this makes sense.

Thank you,

Nao

// than a fraction (by default, 50%) of the average. Eviction takes two configuration parameters:
// - Eviction candidate ratio: The fraction of the cache to read for potential eviction
// - Eviction threshold ratio: Evict items with TTL lower than X% of the average TTL
void Evict() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eviction function is problematic. You're always starting at begin() and iterating forward, so the removed items are going to be biased towards certain hash values.

Also, since you're doing a bunch of them at once, while holding a mutex, you'll block all envoy workers during the eviction, which isn't acceptable.

If you really don't want to do a traditional LRU, you'll need to come up with something better than this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the code starting on line 117, "//Step 1: Advance pointer randomly". It advances the iterator randomly to avoid the bias.

I understand your perspective on this eviction policy.

We believe this eviction policy will work well for our use case. Let me walk through the thought process:

  1. We want to minimize the memory footprint of the cache because we run Envoy on multiple K8 pods.
  2. Our cache key space is predictable. We plan to make cache large enough to fit all cache keys. Therefore, eviction should be rare.
  3. We are storing each item's TTL, so we can use this information.

We measured this eviction mechanism. It seems fast enough for our use case (takes ~1ms to delete 2500 items out of 1 million cached items). It is configurable, so we can lower the number to delete if we need eviction to take shorter amount of time.

I suspect the best eviction policy would depend on the use case.

We will make eviction policy configurable. Hopefully, different policies will be implemented, and users will be able to pick the right one for their use case.

Do you have a particular policy in mind? If we have a policy that can achieve better results without increasing memory footprint, we'd be very interested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We believe this eviction policy will work well for our use case.

I'm trying to make sure that we end up with something that will work for the general case, not just your specific use case.

If you make the policy (and thus the datastructure) configurable, I think that will be fine. You could add an extension point ext_authz caching, and then it would be easy to create any policy you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the code starting on line 117, "//Step 1: Advance pointer randomly". It advances the iterator randomly to avoid the bias.

I missed that. That's a very expensive operation on a large cache. It will likely blow out all levels of the CPU cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have measured eviction.

Environment: Linux Docker container running on my Mac (Apple M3 Pro). I took average of 3 tries of CPU time. I fixed advance_steps to 500,000 to avoid randomness in measurement.

Cache max size is 1,000,000. eviction_candidate_ratio is 1%, and eviction_threshold_ratio is 50%. It would result in 1M * 0.01 * 0.25 = ~2500 items getting evicted.

  1. Advancing iterator: 593 microseconds
  2. Eviction (including (1)): 6774 microseconds

Advancing iterator is a small part of whole Evict(). On average, eviction overall takes ~7ms.
If the user needs shorter eviction time, they can decrease eviction_candidate_ratio / increase eviction_threshold_ratio.

As we discussed earlier, we will make cache replaceable. We suggest that we move forward with this algorithm in this PR. We can adjust as we receive performance data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to build a speed_test (see other envoy test examples) for this benchmark so it's easy for others to run, run on different hardware, and run with different cache config parameters


bool Insert(const char* key, uint16_t value, int ttl_seconds = -1) {
Thread::LockGuard lock{mutex_};
const char* c_key = strdup(key);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the hashmap key be std::string and get rid of all the c-style string handling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we had std::string, it increased memory footprint. We tried c-style string, and it helped reduce the memory footprint. We could revisit this (avoid c-style string while making sure memory footprint doesn't increase), but would prefer to do it after the initial PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I won't accept this PR with this style of string/memory management. At a minimum, it all needs to be wrapped in RAII-style objects.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, will work on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took care of this (convert usages of char* to std::string) in the latest commit.

@@ -0,0 +1,209 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only has a single use right now, in ext_authz. Move this file into that extension.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had the cache code inside ext_authz initially.

We moved it to the common location in response to this comment by @adisuissa :

A drive-by suggestion: it may be beneficial to generalize the caching for both ext_authz and ext_proc.
I suggest creating the cache configuration (an implementation) separate, and use it only for ext_authz in the first step.

I think this feedback makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we know enough yet to know whether we'll be able to use the same cache structure for ext_authz and ext_proc. I think that at the xDS layer, we should have a separate API for each one. If it turns out that the underlying implementations can actually share code, they can do so easily enough.

@naoitoi-v
Copy link
Author

Pushed a commit to take care of "Cache response body and headers from ext_authz, and apply them on cache hit."

…mplementation can be replaced without modifying ext_authz.

Signed-off-by: Nao Itoi <[email protected]>

bool response_cache_remember_body_headers = 35;
// Cache configuration to store response from external auth server.
optional envoy.config.core.v3.TypedExtensionConfig response_cache_config = 30;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
optional envoy.config.core.v3.TypedExtensionConfig response_cache_config = 30;
envoy.config.core.v3.TypedExtensionConfig response_cache_config = 30;

optional is the default in proto3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thank you. I'll fix it in the next commit.

// Cache configuration to store response from external auth server.
optional envoy.config.core.v3.TypedExtensionConfig response_cache_config = 30;
// Tell cache whether to remember the whole response from authz server (response body, header modification, meta data, status) or just status.
optional bool response_cache_remember_body_headers = 31;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the extension's configuration; it's likely that some cache types wouldn't support this option.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had put in a lot of thoughts into this choice. Let me explain the pros & cons.

  • Option 1: "remember_body_headers" config is in SimpleCache:
    Pro: All cache related configurations are in one place (inside response_cache_config).
    Con: This would make SimpleCache specific to ext_authz. The full response is defined by Filters::Common::ExtAuthz::Response, which is ext_authz data structure. If SimpleCache uses this, then it can only used for ext_authz. Earlier, we received a feedback to try to make cache independent of the use case, so that it can be used from another use case (e.g. ext_proc).

  • Option 2: "remember_body_headers" config is in ExtAuthz:
    Pro: Keeps RespCache and SimpleCache agnostic to the type of stored data. We think this is generally a good design - makes cache more flexible.
    Con: ExtAuthz has two cache related configurations, instead of one.

The way we think is the choice of caching the full response, or just status code, is a functionality of ExtAuthz. ExtAuthz does different things depending on this configuration.

I do agree with an earlier feedback that suggested to decouple the cache from ExtAuthz, so that it can be used in other contexts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need another layer that is the implementation behind the typed_config. It will have this boolean setting. It will instantiate a cache (however that is done) and return a pointer to an interface that is implementation-agnostic.

This setting definitely belongs in the typed_config, not here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if we have this field, it belongs in the cache config extension, not directly here in the ext_authz filter config.

That having been said, I don't think we should actually have this field at all. See my comment elsewhere about performance considerations: I don't think we want to have to construct a CheckRequest proto or decode a CheckResponse proto in the cache-hit case. Given that, I don't think we want to be caching the CheckResponse proto or its associated headers in the first place.

@@ -0,0 +1,20 @@
syntax = "proto3";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs to be under api/. This is a new type of extension. It could either go in api/envoy/extensions/ext_authz_cache/simple, or maybe it belongs under api/envoy/extensions/filters/http/ext_authz/cache/simple. @envoyproxy/api-shepherds can you provide guidance on the correct location?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think api/envoy/extensions/ext_authz_cache/simple/v3 is probably the right place.

const std::string auth_header_str(auth_header[0]->value().getStringView());

if (config_->responseCacheConfig().has_value()) {
#ifdef CACHE_PERF_TEST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of code should be moved to somewhere under /test. Search around for *_speed_test.cc to see many examples of speed-tests and how they're setup, parameterized, etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


if (response_cache_remember_body_headers_) {
// Initialize the cache that stores full response (body, header modification, metadata, status).
auto cache_result = Envoy::Common::RespCache::createCache<Filters::Common::ExtAuthz::Response>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With an extension config, you'll declare a factory interface, and then call something like

Config::Utility::getAndCheckFactory<ExtAuthzFactory>(
          config)

And each extension will register it's factory with that type.

Comment on lines +332 to +356
if (config_->responseCacheRememberBodyHeaders()) {
auto cached_response =
config_->responseCache<Filters::Common::ExtAuthz::Response>().Get(headers);
if (cached_response.has_value()) {
onCompleteSub(std::make_unique<Filters::Common::ExtAuthz::Response>(
std::move(cached_response.value())));
return Http::FilterHeadersStatus::StopIteration;
}
} else {
// Retrieve the HTTP status code from the cache
auto cached_status_code = config_->responseCache<uint16_t>().Get(headers);
if (cached_status_code.has_value()) {
if (*cached_status_code >= 200 && *cached_status_code < 300) {
// Any 2xx response is a success: let the request proceed
return Http::FilterHeadersStatus::Continue;
} else {
// Non-2xx response: reject the request
decoder_callbacks_->streamInfo().setResponseFlag(
StreamInfo::CoreResponseFlag::UnauthorizedExternalService);
decoder_callbacks_->sendLocalReply(
static_cast<Http::Code>(*cached_status_code), "Unauthorized", nullptr, absl::nullopt,
Filters::Common::ExtAuthz::ResponseCodeDetails::get().AuthzDenied);

return Http::FilterHeadersStatus::StopIteration;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this logic belongs in the cache extension implementation.

The changes here (in the core ext_authz filter) should be the following:

Right before ext_authz gRPC call is made, pass the generated CheckRequest to the cache via absl::optional<CheckResponse> cached = config_->responseCacheConfig().getCachedResponse(check_request); If this has a value, onComplete(cached.value()); return;

When receiving a response from the ext_authz server, call config_->responseCacheConfig().cacheResponse(check_response);

All work to extract various headers, response codes, etc, should be done in your cache extension. The code here should be extremely minimal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think that's actually the right layer at which to be doing this caching. This would mean that even in the cache-hit case, we'd still need to construct the request to the ext_authz server, and we'd still need to decode the cached response from the ext_authz server. For performance reasons, at least in gRPC, I don't think that's going to be feasible.

I think what we actually want here is to use the attributes of the data plane request (see here; I'm not sure what data structure Envoy's implementation uses for this internally) to determine the cache key, and then to have the contents of the cache entry be the already-decoded authz decision. In other words, when we see the data plane request, we do something like this:

absl::optional<AuthzDecision> cached = config_->cache().getCachedResponse(data_plane_request_properties);
if (cached.has_value()) {
  // ...use *cached to determine how to handle data plane request...
} else {
  // ...construct CheckRequest and send it to the ext_authz server...
}

Then when receiving a response from the ext_authz server:

AuthzDecision authz_decision = decodeCheckResponse(check_response);
config_->cache().cacheAuthzDecision(authz_decision);
// ...use authz_decision to determine how to handle data plane request...

@ggreenway ggreenway self-assigned this Jan 30, 2025
Copy link
Contributor

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I am the assigned reviewer for the xDS API, not for the Envoy implementation. It looks like @ggreenway is the reviewer for the Envoy implementation. The reason for this separation is that the xDS API is not just for Envoy, it is also used by other data planes, such as gRPC, so we need to consider more than just what works for Envoy before we approve API changes.

Please let me know if you have any questions about that.

Comment on lines +332 to +356
if (config_->responseCacheRememberBodyHeaders()) {
auto cached_response =
config_->responseCache<Filters::Common::ExtAuthz::Response>().Get(headers);
if (cached_response.has_value()) {
onCompleteSub(std::make_unique<Filters::Common::ExtAuthz::Response>(
std::move(cached_response.value())));
return Http::FilterHeadersStatus::StopIteration;
}
} else {
// Retrieve the HTTP status code from the cache
auto cached_status_code = config_->responseCache<uint16_t>().Get(headers);
if (cached_status_code.has_value()) {
if (*cached_status_code >= 200 && *cached_status_code < 300) {
// Any 2xx response is a success: let the request proceed
return Http::FilterHeadersStatus::Continue;
} else {
// Non-2xx response: reject the request
decoder_callbacks_->streamInfo().setResponseFlag(
StreamInfo::CoreResponseFlag::UnauthorizedExternalService);
decoder_callbacks_->sendLocalReply(
static_cast<Http::Code>(*cached_status_code), "Unauthorized", nullptr, absl::nullopt,
Filters::Common::ExtAuthz::ResponseCodeDetails::get().AuthzDenied);

return Http::FilterHeadersStatus::StopIteration;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think that's actually the right layer at which to be doing this caching. This would mean that even in the cache-hit case, we'd still need to construct the request to the ext_authz server, and we'd still need to decode the cached response from the ext_authz server. For performance reasons, at least in gRPC, I don't think that's going to be feasible.

I think what we actually want here is to use the attributes of the data plane request (see here; I'm not sure what data structure Envoy's implementation uses for this internally) to determine the cache key, and then to have the contents of the cache entry be the already-decoded authz decision. In other words, when we see the data plane request, we do something like this:

absl::optional<AuthzDecision> cached = config_->cache().getCachedResponse(data_plane_request_properties);
if (cached.has_value()) {
  // ...use *cached to determine how to handle data plane request...
} else {
  // ...construct CheckRequest and send it to the ext_authz server...
}

Then when receiving a response from the ext_authz server:

AuthzDecision authz_decision = decodeCheckResponse(check_response);
config_->cache().cacheAuthzDecision(authz_decision);
// ...use authz_decision to determine how to handle data plane request...

@@ -0,0 +1,20 @@
syntax = "proto3";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think api/envoy/extensions/ext_authz_cache/simple/v3 is probably the right place.

@@ -0,0 +1,209 @@
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we know enough yet to know whether we'll be able to use the same cache structure for ext_authz and ext_proc. I think that at the xDS layer, we should have a separate API for each one. If it turns out that the underlying implementations can actually share code, they can do so easily enough.

// Cache configuration to store response from external auth server.
optional envoy.config.core.v3.TypedExtensionConfig response_cache_config = 30;
// Tell cache whether to remember the whole response from authz server (response body, header modification, meta data, status) or just status.
optional bool response_cache_remember_body_headers = 31;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if we have this field, it belongs in the cache config extension, not directly here in the ext_authz filter config.

That having been said, I don't think we should actually have this field at all. See my comment elsewhere about performance considerations: I don't think we want to have to construct a CheckRequest proto or decode a CheckResponse proto in the cache-hit case. Given that, I don't think we want to be caching the CheckResponse proto or its associated headers in the first place.

@andrii-korotkov-verkada
Copy link

Hey, Andrii from Verkada and Argo CD community here. I'm curious to see what's going on here as seems like more and more feature requests keep getting added. @markdroth, @ggreenway, for the sake of @naoitoi-v's time, is it possible to define clear goals and steps of what's needed to get the proposal and changes accepted, please? @naoitoi-v is excited about open source contributions, but this effort might just die out if he's being pushed back too much. Can I help somehow?

@ggreenway
Copy link
Member

Hey, Andrii from Verkada and Argo CD community here. I'm curious to see what's going on here as seems like more and more feature requests keep getting added. @markdroth, @ggreenway, for the sake of @naoitoi-v's time, is it possible to define clear goals and steps of what's needed to get the proposal and changes accepted, please? @naoitoi-v is excited about open source contributions, but this effort might just die out if he's being pushed back too much. Can I help somehow?

We're having a separate discussion on slack about this.

The reason for most of the additional work is that the initial proposal was a solution optimized for a specific use case, not a general solution, and we are asking for additional work so that adding a general solution, or other specifically-optimized solutions, makes sense in the way envoy is configured. It is painful to change the configuration after it has been released, so it's important to get it right the first time, but that does involve some extra work.

@naoitoi-v
Copy link
Author

Remaining tasks identified so far:

  1. Move simple_cache.proto to api/envoy/extensions/ext_authz_cache/simple/v3
  2. Move all cache related configurations to simple_cache.proto.
  3. Make cache an extension point.
  4. Write speed test for eviction.
  5. Write unit tests.

Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 22, 2025
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Mar 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants