Skip to content

CPU spikes result from inconsistent route order after each pod restarts #6140

@fayizk1

Description

@fayizk1

Description:

We noticed that envoy CPU spikes whenever a pod restarts. This worsens when multiple deployments roll out together. The perf output shows most of these CPU spent on parsing and compiling CORS regex. This spike persists even zero traffic sent to the envoy.

Image

We can re-produce the same issue in a test environment with a gateway and multiple Httproutes configured in different namespaces (see details below). After digging into a little deeper, we identified the following issues/facts.

  1. Every pod, with at least an Httproute, restarts triggers a full route configuration reload. It also re-compiles paths and cores domains with regex patterns.
  2. We can confirm this by watching Envoy::Regex::Utility::parseRegex calls. see screenshots.
  3. We also noticed that, after every pod restarts, the order of the route list changes. See attached screenshots; 1 = before first restart, 2 = after first pod restart, 3 = after second pod restarts.
  4. Applying Patchpolicy consistently results in this behavior (See the attached sample config).
  5. I can see this behaviour without patch policy when envoy proxy is newly started. But it is reproducible only one or two times. Not sure if too many pod churn can produce this.

I think this is the reason envoy spending too much CPU in the CORS re-compile. In our setup, we have 400+ routes and most of them are prefix or domain match. But we have 15+ regex based global cors domains. This means, 6000+ regex compilation on every pod restarts.

Image

Image

Repro steps:

  1. Apply a patch policy on type.googleapis.com/envoy.config.listener.v3.Listener
  2. Create a single Gateway config and allow cross-namespace reference.
  3. Create multiple http routes in multiple namespaces with no domain names(i.e. *). Note: It should have regex pattern in few routes(or a global cors with wildcard match) to watch via bpftrace.
  4. Take config dump
  5. Restart one of the pod from any namespaces.
  6. Wait till new pod is back and endpoint is updated.
  7. Take second config dump
  8. Compare route order in config dumps.

How to watch via bpftrace?

You can run following onliner in host machine to list total Envoy::Regex::Utility::parseRegex call made after a pod restarts.

bpftrace --unsafe -e ‘BEGIN{print(“Probing Envoy::Regex::Utility::parseRegex call”)} uprobe:/proc/<pid>/root/usr/local/bin/envoy:0x2087a00 { @total = count();print(“call”)}’ 2>/dev/null

Note that, the production image of envoy has a stripped binary and you need to calculate offset of fucntion manually from corresponding debug image. The offset 0x2087a00 may only work with envoyproxy/envoy:contrib-v1.32.5.

Alternatively, you can use debug image itself as proxy image and run bpftrace -l <path>|grep parseRegex to list the mangled symbol.

Environment
Gateway version: v1.2.7
Proxy version/image: envoyproxy/envoy:contrib-v1.32.5

Sample config:

patchpolicy.txt

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions