Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
db255ec
Adding charts for envoy-gateway
Mar 9, 2026
3f2fa53
Fixing envoy-proxy repo url
Mar 10, 2026
23b43bd
Fixing name of envoy-gateway chart and aliasing
Mar 10, 2026
fdb6a5d
Updating chart version
Mar 13, 2026
247ea21
Adding envoy gateway and routes
Mar 17, 2026
907f8ec
Cleaning up charts to remove unused blocks
Mar 17, 2026
a190f34
Adding missing EG variables
Mar 18, 2026
39bd801
Updating global variables to new format
Mar 18, 2026
cab6c6a
Fixing port for flyteconsole
Mar 19, 2026
5443050
Removing auth-proxy reference in selfhosted
Mar 19, 2026
f330f62
Adding gate check for enabled features
Mar 20, 2026
f495938
Moving to using values.yaml
Mar 20, 2026
f73c6bd
Refactoring charts for envoy gateway
Mar 20, 2026
923f4df
Adding backend policy for http/2 on grpc for selfmanaged/hosted
Mar 20, 2026
e47c926
Adding timeouts and buffer limits to connections
Mar 20, 2026
5485f25
Adding bypass for unprotected endpoint and identity filter
Mar 20, 2026
5859dae
Fixing validation error on timeouts
Mar 23, 2026
7808e35
Adding redis caching for rate limiting
Mar 23, 2026
36b774f
Renaming some configs
Mar 24, 2026
826f473
Fixing usage of dig
Mar 24, 2026
26c9fb4
Refactoring config names
Mar 24, 2026
5486748
Fixing naming of services
Mar 24, 2026
115065b
Removing unused services
Mar 24, 2026
581ca34
Fix redis url
Mar 24, 2026
0307858
Fixing backend traffic policy to merge rate limit and connenction tim…
Mar 24, 2026
30de1d8
Bumping up the default value for rps
Mar 25, 2026
5df5c6d
Testing rate limiting
Mar 25, 2026
0df936c
Reverting rate limit rps back to desired value
Mar 25, 2026
5f32f59
Cleaning up rate limit config
Mar 25, 2026
afa9731
Fixing grpc routes for self-managed/hosted to match ingress-nginx
Mar 25, 2026
5ff7c7f
Fixing control plane auth plugins deployment
Mar 26, 2026
eaf5e12
Fixing filter plugins...I hope
Mar 26, 2026
1d7bc3e
Updating the comment in values files
Mar 27, 2026
4a377e3
Cleaning up the values files
Mar 27, 2026
834b5b8
Giving gateway service a consistent name
Mar 27, 2026
afb1c61
Adding self-signed cert and route for handling intra cluster communic…
Mar 27, 2026
a03985d
Removing comment
Mar 27, 2026
1b500fe
Adding loginUrl for redirect
Mar 28, 2026
947ffc9
Removing v2 gating
Mar 30, 2026
3d3d179
Updating tests
Mar 30, 2026
7604013
Updating keep alive settings
Mar 30, 2026
f1a4483
Http2 keepalive is not valid
Mar 30, 2026
552f933
Adding missing rate limiting policy on protected grpc routes
Mar 30, 2026
330abab
Fixing values inconsistency for intracluster
Mar 30, 2026
df97756
Updating readme files
Mar 30, 2026
b981d25
Fixing expected helm charts for test
Apr 3, 2026
fd20e2f
Merge branch 'main' into laura/pii-108-add-gateway-api
aviator-app[bot] Apr 10, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions charts/controlplane/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,8 @@ dependencies:
version: 80.8.0
alias: monitoring
condition: monitoring.enabled
- name: gateway-helm
alias: envoy-gateway
repository: oci://docker.io/envoyproxy
version: v1.6.4
condition: envoy-gateway.enabled
40 changes: 34 additions & 6 deletions charts/controlplane/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ helm repo add flyte https://helm.flyte.org
# Add Ingress NGINX Helm repository (if using ingress-nginx)
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx

# Add Envoy Gateway Helm repository (if using Envoy Gateway)
helm repo add envoy-gateway oci://docker.io/envoyproxy

# Add ScyllaDB Helm repository (if using ScyllaDB)
helm repo add scylla https://scylla-operator-charts.storage.googleapis.com/stable

Expand Down Expand Up @@ -52,6 +55,7 @@ Kubernetes: `>= 1.28.0-0`
|------------|------|---------|----------|-------|
| https://helm.flyte.org | flyte-core(flyte) | v1.16.0-b2 | No | Required |
| https://kubernetes.github.io/ingress-nginx | ingress-nginx | 4.12.3 | Yes | Only if `ingress-nginx.enabled: true` |
| oci://docker.io/envoyproxy | gateway-helm(envoy-gateway) | v1.6.4 | Yes | Only if `envoy-gateway.enabled: true`; for selfmanaged deployments install via ArgoCD ApplicationSet instead |
| https://scylla-operator-charts.storage.googleapis.com/stable | scylla-operator | v1.18.1 | Yes | Only if `scylla.enabled: true` |
| https://scylla-operator-charts.storage.googleapis.com/stable | scylla | v1.18.1 | Yes | Only if `scylla.enabled: true` |
| https://prometheus-community.github.io/helm-charts | monitoring(kube-prometheus-stack) | 80.8.0 | Yes | Only if `monitoring.enabled: true` |
Expand Down Expand Up @@ -261,17 +265,41 @@ helm upgrade --install union-controlplane unionai/controlplane \
--values values.yaml
```

### Installation with Ingress NGINX
### Ingress Controller

The chart supports two ingress controllers, selected via `global.INGRESS_PROVIDER`:

| Value | Behavior |
|-------|----------|
| `nginx` | Only nginx Ingress objects rendered (default) |
| `envoy` | Only Envoy Gateway API resources rendered (HTTPRoute/GRPCRoute/Gateway) |
| `both` | Both sets rendered simultaneously — use during migration |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a section somewhere that explains a bit more on how the setup looks like with both enabled? The pre and post migration states are pretty clear, but it's a bit fuzzy how ingress will function during the dual deployment phase

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment in a follow-up PR. But essentially, they aren't meant to operate both for long. This is so when we are in the process of migrating we can switch back if something goes wrong. There is a flag (set the cloud repo in the terraform) which will update the external-dns to either use "ingress" or "httproute"/"grpcroute" as it's target which tell the dns to to either use nginx or envoy. I have also setup weighted routing so we can filter only a small percentage of requests during testing if we would like


If you need ingress support:
#### Installation with Ingress NGINX

```yaml
global:
INGRESS_PROVIDER: nginx

ingress-nginx:
enabled: true
```

#### Installation with Envoy Gateway

Envoy Gateway can be installed as a sub-chart (managed deployments) or as a separate Helm release via ArgoCD (selfmanaged deployments — see [Self-Hosted Guides](#alternative-deployment-models)).

For sub-chart installation:

```yaml
global:
INGRESS_PROVIDER: envoy

envoy-gateway:
enabled: true # installs gateway-helm as a sub-chart

ingress:
className: "controlplane"
secretService: true
envoyGateway:
gatewayClassName: envoy # must match the GatewayClass created by the EG install
```

## Verification
Expand Down Expand Up @@ -302,7 +330,7 @@ helm show values unionai/controlplane
- **Postgres Configuration** (Required): Set `dbHost`, `dbName`, `dbUser`, and `dbPass` for the primary database used by all control plane services except the queue service
- **ScyllaDB Configuration** (Required): Configure `scylla` section for the queue service database. Set `scylla.enabled: true` for embedded cluster or provide `scylla.externalHost` for external ScyllaDB
- **Object Storage**: Configure `bucketName`, `artifactsBucketName`, and `region` for S3-compatible storage
- **Ingress**: Enable and configure ingress under `ingress-nginx` section
- **Ingress**: Set `global.INGRESS_PROVIDER` to `nginx`, `envoy`, or `both`. Enable the relevant controller (`ingress-nginx.enabled` or `envoy-gateway.enabled`) and configure `envoyGateway.gatewayClassName` when using Envoy Gateway

---

Expand Down
41 changes: 37 additions & 4 deletions charts/controlplane/SELFHOSTED_INTRA_CLUSTER_AWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,13 +275,23 @@ global:
# override is configured.
```

### TLS Requirements
### Ingress Controller

gRPC requires TLS for HTTP/2 with NGINX. Refer to [values.aws.selfhosted-intracluster.yaml](./values.aws.selfhosted-intracluster.yaml) for example configuration.
The chart supports two ingress controllers, selected via `global.INGRESS_PROVIDER`:

| Value | Behavior |
|-------|----------|
| `nginx` | Only nginx Ingress objects rendered (default) |
| `envoy` | Only Envoy Gateway API resources rendered (HTTPRoute/GRPCRoute/Gateway) |
| `both` | Both sets rendered simultaneously — use during migration |

#### NGINX (default)

TLS is required for gRPC over HTTP/2. Refer to [values.aws.selfhosted-intracluster.yaml](./values.aws.selfhosted-intracluster.yaml) for example configuration.

```yaml
global:
# Configure namespace and name of the Kubernetes TLS secret.
INGRESS_PROVIDER: nginx
TLS_SECRET_NAMESPACE: ""
TLS_SECRET_NAME: ""

Expand All @@ -292,12 +302,35 @@ ingress-nginx:
default-ssl-certificate: "<TLS_SECRET_NAMESPACE>/<TLS_SECRET_NAME>"
```

#### Envoy Gateway

Envoy Gateway is installed as a **separate Helm release** via an ArgoCD ApplicationSet — it is not a sub-chart of the controlplane chart. To enable it:

1. Deploy the Envoy Gateway controller into the cluster (see `cloud/infra/argocd/deploy/manifests/appset-selfmanaged-envoy-gateway.yaml`).
2. Set the ingress provider and gateway class in your overrides:

```yaml
global:
INGRESS_PROVIDER: envoy # or "both" during parallel rollout

envoyGateway:
gatewayClassName: controlplane-envoy # must match the GatewayClass created by the EG install
```

The `envoy-gateway.enabled` key controls whether the chart's bundled sub-chart dependency is installed. For selfmanaged deployments this stays `false` because EG is managed separately:

```yaml
envoy-gateway:
enabled: false # EG is installed via its own ArgoCD ApplicationSet, not as a sub-chart
```

### Service Discovery

Control plane services discover each other via Kubernetes DNS:

- **Flyteadmin**: `flyteadmin.union-cp.svc.cluster.local:81`
- **NGINX Ingress**: `controlplane-nginx-controller.union-cp.svc.cluster.local`
- **Envoy Gateway**: `controlplane-envoy-gateway.union-cp.svc.cluster.local` (when using EG)
- **Dataplane** (for dataproxy): `dataplane-nginx-controller.union.svc.cluster.local`

## Authentication (OIDC/OAuth2)
Expand Down Expand Up @@ -403,7 +436,7 @@ flyte:
useAuth: true
```

This enables nginx auth-subrequest validation on protected ingress routes.
This enables auth validation on protected ingress routes (nginx auth-subrequest for the nginx path; the Envoy Gateway path uses an equivalent Go auth filter via EnvoyPatchPolicy).

### Verifying Authentication

Expand Down
41 changes: 37 additions & 4 deletions charts/controlplane/SELFHOSTED_INTRA_CLUSTER_GCP.md
Original file line number Diff line number Diff line change
Expand Up @@ -287,13 +287,23 @@ global:
# override is configured.
```

### TLS Requirements
### Ingress Controller

gRPC requires TLS for HTTP/2 with NGINX. Refer to [values.gcp.selfhosted-intracluster.yaml](./values.gcp.selfhosted-intracluster.yaml) for example configuration.
The chart supports two ingress controllers, selected via `global.INGRESS_PROVIDER`:

| Value | Behavior |
|-------|----------|
| `nginx` | Only nginx Ingress objects rendered (default) |
| `envoy` | Only Envoy Gateway API resources rendered (HTTPRoute/GRPCRoute/Gateway) |
| `both` | Both sets rendered simultaneously — use during migration |

#### NGINX (default)

TLS is required for gRPC over HTTP/2. Refer to [values.gcp.selfhosted-intracluster.yaml](./values.gcp.selfhosted-intracluster.yaml) for example configuration.

```yaml
global:
# Configure namespace and name of the Kubernetes TLS secret.
INGRESS_PROVIDER: nginx
TLS_SECRET_NAMESPACE: ""
TLS_SECRET_NAME: ""

Expand All @@ -304,12 +314,35 @@ ingress-nginx:
default-ssl-certificate: "<TLS_SECRET_NAMESPACE>/<TLS_SECRET_NAME>"
```

#### Envoy Gateway

Envoy Gateway is installed as a **separate Helm release** via an ArgoCD ApplicationSet — it is not a sub-chart of the controlplane chart. To enable it:

1. Deploy the Envoy Gateway controller into the cluster (see `cloud/infra/argocd/deploy/manifests/appset-selfmanaged-envoy-gateway.yaml`).
2. Set the ingress provider and gateway class in your overrides:

```yaml
global:
INGRESS_PROVIDER: envoy # or "both" during parallel rollout

envoyGateway:
gatewayClassName: controlplane-envoy # must match the GatewayClass created by the EG install
```

The `envoy-gateway.enabled` key controls whether the chart's bundled sub-chart dependency is installed. For selfmanaged deployments this stays `false` because EG is managed separately:

```yaml
envoy-gateway:
enabled: false # EG is installed via its own ArgoCD ApplicationSet, not as a sub-chart
```

### Service Discovery

Control plane services discover each other via Kubernetes DNS:

- **Flyteadmin**: `flyteadmin.union-cp.svc.cluster.local:81`
- **NGINX Ingress**: `controlplane-nginx-controller.union-cp.svc.cluster.local`
- **Envoy Gateway**: `controlplane-envoy-gateway.union-cp.svc.cluster.local` (when using EG)
- **Dataplane** (for dataproxy): `dataplane-nginx-controller.union.svc.cluster.local`

## Authentication (OIDC/OAuth2)
Expand Down Expand Up @@ -415,7 +448,7 @@ flyte:
useAuth: true
```

This enables nginx auth-subrequest validation on protected ingress routes.
This enables auth validation on protected ingress routes (nginx auth-subrequest for the nginx path; the Envoy Gateway path uses an equivalent Go auth filter via EnvoyPatchPolicy).

### Verifying Authentication

Expand Down
108 changes: 108 additions & 0 deletions charts/controlplane/templates/common/_backendtrafficpolicy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
{{- define "control-plane-library.backendtrafficpolicy" }}
# BackendTrafficPolicy — configures Envoy→gRPC backend connection settings.
# Replaces nginx grpc_connect_timeout, grpc_read_timeout, grpc_send_timeout.
# Two policies (one per GRPCRoute) so h2c and timeouts are scoped to gRPC traffic only.
#
# requestTimeout applies to unary calls; maxStreamDuration applies to streaming calls.
# "0s" for maxStreamDuration means no limit (equivalent to grpc_read_timeout 604800s on streaming routes).
# Both protected and unprotected GRPCRoutes contain streaming methods so both get the same config.
#
# Rate limit is also included here (when enabled) because route-level BTPs override gateway-level ones,
# so the gateway-level rate-limit BTP below would be suppressed for these two GRPCRoutes without it.
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: {{ template "flyte.name" . }}-grpc-protected-h2c
namespace: {{ template "flyte.namespace" . }}
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: GRPCRoute
name: {{ template "flyte.name" . }}-grpc-protected
timeout:
tcp:
connectTimeout: "1200s" # grpc_connect_timeout 1200s
http:
requestTimeout: "1200s" # grpc_read_timeout 1200s (unary calls)
maxStreamDuration: "0s" # no limit for streaming (grpc_read_timeout 604800s on streaming routes)
tcpKeepalive:
probes: 9
idleTime: "15s"
interval: "15s"
http2: {}
{{- if .Values.envoyGateway.rateLimit.enabled }}
rateLimit:
type: Global
global:
rules:
- clientSelectors:
- sourceCIDR:
type: Distinct
value: "0.0.0.0/0"
limit:
requests: {{ .Values.envoyGateway.rateLimit.requestsPerUnit | default 100 }}
unit: {{ .Values.envoyGateway.rateLimit.unit | default "Second" }}
{{- end }}
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: {{ template "flyte.name" . }}-grpc-unprotected-h2c
namespace: {{ template "flyte.namespace" . }}
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: GRPCRoute
name: {{ template "flyte.name" . }}-grpc-unprotected
timeout:
tcp:
connectTimeout: "1200s" # grpc_connect_timeout 1200s
http:
requestTimeout: "1200s" # grpc_read_timeout 1200s (unary calls)
maxStreamDuration: "0s" # no limit for WatchExecutionStatusUpdates streaming
tcpKeepalive:
probes: 9
idleTime: "15s"
interval: "15s"
http2: {}
{{- if .Values.envoyGateway.rateLimit.enabled }}
rateLimit:
type: Global
global:
rules:
- clientSelectors:
- sourceCIDR:
type: Distinct
value: "0.0.0.0/0"
limit:
requests: {{ .Values.envoyGateway.rateLimit.requestsPerUnit | default 100 }}
unit: {{ .Values.envoyGateway.rateLimit.unit | default "Second" }}
{{- end }}
{{- if .Values.envoyGateway.rateLimit.enabled }}
---
# Global per-source-IP rate limit — replaces nginx.ingress.kubernetes.io/limit-rps annotation.
# Requires EG rateLimit backend (envoyproxy/ratelimit + Redis) to be running.
# Enable via envoyGateway.rateLimit.enabled: true once the backend is confirmed healthy.
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: {{ template "flyte.name" . }}-global-rate-limit
namespace: {{ template "flyte.namespace" . }}
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: {{ template "flyte.name" . }}
rateLimit:
type: Global
global:
rules:
- clientSelectors:
- sourceCIDR:
type: Distinct
value: "0.0.0.0/0"
limit:
requests: {{ .Values.envoyGateway.rateLimit.requestsPerUnit | default 100 }}
unit: {{ .Values.envoyGateway.rateLimit.unit | default "Second" }}
{{- end }}
{{- end }}
23 changes: 23 additions & 0 deletions charts/controlplane/templates/common/_clienttrafficpolicy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{{- define "control-plane-library.clienttrafficpolicy" }}
# ClientTrafficPolicy — configures inbound client connection settings on the Gateway.
# Replaces nginx server-snippet: client_header_timeout, client_body_timeout,
# client_header_buffer_size, and large_client_header_buffers.
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: ClientTrafficPolicy
metadata:
name: {{ template "flyte.name" . }}-client-policy
namespace: {{ template "flyte.namespace" . }}
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: {{ template "flyte.name" . }}
timeout:
http:
requestReceivedTimeout: "0s" # client_header_timeout 604800
streamIdleTimeout: "0s" # client_body_timeout 604800
connection:
# large_client_header_buffers 64 32k = 2Mi total; mitigates 400 errors from large cookies
# at the /me auth endpoint (see PE-1101).
bufferLimit: "2Mi"
{{- end }}
Loading
Loading