diff --git a/content/deployment/_index.md b/content/deployment/_index.md index 27b9c21ec..d6c909a67 100644 --- a/content/deployment/_index.md +++ b/content/deployment/_index.md @@ -35,6 +35,20 @@ The Self-managed deployment allows you to manage the data plane yourself on clou * The **control plane**, as with all Union.ai deployment options, resides in the Union.ai Amazon Web Services (AWS) account and is administered by Union.ai. However, as mentioned, data separation is maintained between the data plane and the control plane, with no control plane access to the code, input/output, images or logs in the data plane. +## Self-hosted deployment + +For complete data sovereignty, you can host both the control plane and data plane in the same Kubernetes cluster. In this model, all communication stays within your infrastructure via Kubernetes internal networking. + +Self-hosted deployment is distinct from Self-managed deployment: in a self-hosted deployment, you manage both the control plane and the data plane, whereas in a self-managed deployment, Union.ai hosts the control plane and you manage the data plane. + +{{< grid >}} + +{{< link-card target="./selfhosted/_index" icon="server" title="Self-hosted deployment" >}} +Deploy both control plane and data plane in your own Kubernetes cluster +{{< /link-card >}} + +{{< /grid >}} + ## Data plane The data plane runs in your cloud account and VPC. It is composed of the required services to run and monitor workflows: diff --git a/content/deployment/glossary.md b/content/deployment/glossary.md new file mode 100644 index 000000000..1d6de3819 --- /dev/null +++ b/content/deployment/glossary.md @@ -0,0 +1,34 @@ +--- +title: Deployment glossary +weight: 99 +variants: -flyte +union +--- + +# Deployment glossary + +## Deployment models + +**Self-managed deployment** +: {{< key product_name >}} hosts the control plane. You manage the data plane in your own cloud account. Data plane provisioning is handled via `uctl selfserve`. See [Platform deployment](./_index). + +**Self-hosted deployment** +: You host both the control plane and data plane in the same Kubernetes cluster. All communication stays within your infrastructure via Kubernetes internal networking. See [Self-hosted deployment](./selfhosted/_index). + +## Architecture components + +**Control plane** +: The orchestration layer that manages workflow execution state. Includes FlyteAdmin, scheduler, queue service, cache service, and supporting services. In self-managed deployments, the control plane is hosted by {{< key product_name >}}. In self-hosted deployments, you deploy it in the `union-cp` namespace. + +**Data plane** +: The execution layer where your code and data reside. Includes the operator, propeller, and worker pods that run your tasks. Deployed in the `union` namespace. + +**Intra-cluster** +: A deployment topology where the control plane and data plane run in the same Kubernetes cluster and communicate via internal DNS (e.g., `controlplane-nginx-controller.union-cp.svc.cluster.local`) rather than external networking. + +## Cloud provider concepts + +**IRSA** (IAM Roles for Service Accounts) +: AWS mechanism that allows Kubernetes service accounts to assume IAM roles. Used by control plane and data plane pods to access AWS resources (S3, RDS) without static credentials. + +**Workload Identity** +: GCP mechanism that allows Kubernetes service accounts to impersonate GCP service accounts. The GCP equivalent of IRSA. diff --git a/content/deployment/selfhosted/_index.md b/content/deployment/selfhosted/_index.md new file mode 100644 index 000000000..7bc6d7112 --- /dev/null +++ b/content/deployment/selfhosted/_index.md @@ -0,0 +1,141 @@ +--- +title: Self-hosted deployment +weight: 6 +variants: -flyte +union +sidebar_expanded: true +mermaid: true +--- + +# Self-hosted deployment + +In a self-hosted deployment, you host both the **control plane** and the **data plane** in the same Kubernetes cluster. This gives you complete control over your {{< key product_name >}} installation with full data sovereignty. + +> [!NOTE] +> Self-hosted deployment is distinct from [self-managed deployment](../selfmanaged/_index), where {{< key product_name >}} hosts the control plane and you manage only the data plane. + +## When to use self-hosted deployment + +Choose self-hosted deployment when: + +- You need full control over both control plane and data plane +- You have strict data locality or sovereignty requirements +- You want to minimize network egress costs +- You are running in an air-gapped or restricted network environment + +Choose [self-managed deployment](../selfmanaged/_index) when: + +- You want {{< key product_name >}} to manage the control plane +- You need {{< key product_name >}}'s managed services and support +- Control plane and data plane are in separate clusters + +## Architecture + +In a self-hosted intra-cluster deployment, the control plane and data plane communicate using Kubernetes internal networking rather than external endpoints. + +```mermaid +graph TB + subgraph cluster["Kubernetes Cluster"] + subgraph cp["Controlplane Namespace"] + cpingress["NGINX Ingress\n(TLS/HTTP2)\nClusterIP"] + admin["Admin"] + identity["Identity"] + services["Services"] + + cpingress --> admin + cpingress --> identity + cpingress --> services + end + + subgraph dp["Dataplane Namespace"] + dpingress["NGINX Ingress\nClusterIP"] + operator["Operator"] + propeller["Propeller"] + clusterresource["Cluster Resource\nSync"] + + dpingress --> operator + dpingress --> propeller + dpingress --> clusterresource + end + + subgraph external["External Resources"] + db["PostgreSQL"] + storage["Object Storage\n(S3 / GCS)"] + end + + dpingress -.->|"Internal DNS"| cpingress + cpingress -.->|"Internal DNS"| dpingress + + admin --> db + identity --> db + services --> db + admin --> storage + operator --> storage + end +``` + +**Key characteristics:** + +- **Simplified networking**: All communication stays within the cluster via Kubernetes DNS +- **No external dependencies**: No internet connectivity required for control plane to data plane communication +- **Cost-effective**: No network egress costs between control plane and data plane +- **Self-signed certificates**: Can use self-signed certificates for intra-cluster TLS +- **Single-tenant mode**: Simplified security model with explicit organization configuration + +## Prerequisites + +### Infrastructure + +- **Kubernetes cluster** (>= 1.28.0) with sufficient resources for both control plane and data plane. Recommended: at least 6 nodes with 8 CPU / 16GB RAM each. +- **PostgreSQL database** (12+), either cloud-managed (RDS, Cloud SQL) or self-hosted. +- **Object storage** (S3 or GCS) for metadata and artifacts. +- **IAM roles** or **service accounts** configured for cloud resource access. + +### Tools + +- [Helm](https://helm.sh/docs/intro/install/) 3.18+ +- `kubectl` configured to access your cluster +- `openssl` or `cert-manager` for TLS certificate generation + +### Registry access + +{{< key product_name >}} control plane images are hosted in a private registry. You will receive registry credentials from the {{< key product_name >}} team for your organization. + +## Deployment guides + +Deploy the control plane first, then the data plane. + +{{< grid >}} + +{{< link-card target="./control-plane-aws" icon="server" title="Control plane on AWS" >}} +Deploy the control plane with Amazon RDS and S3 +{{< /link-card >}} + +{{< link-card target="./control-plane-gcp" icon="server" title="Control plane on GCP (Preview)" >}} +Deploy the control plane with Cloud SQL and GCS +{{< /link-card >}} + +{{< link-card target="./data-plane-aws" icon="cpu" title="Data plane on AWS" >}} +Deploy the data plane with S3 and IRSA +{{< /link-card >}} + +{{< link-card target="./data-plane-gcp" icon="cpu" title="Data plane on GCP (Preview)" >}} +Deploy the data plane with GCS and Workload Identity +{{< /link-card >}} + +{{< link-card target="./authentication" icon="lock" title="Authentication" >}} +Configure OIDC/OAuth2 authentication for your deployment +{{< /link-card >}} + +{{< link-card target="./authorization" icon="shield" title="Authorization" >}} +Configure authorization mode (Noop, External, or Union built-in RBAC) +{{< /link-card >}} + +{{< link-card target="./image-builder" icon="package" title="Image builder" >}} +Register the image builder for automatic container image builds +{{< /link-card >}} + +{{< link-card target="./operations" icon="settings" title="Operations" >}} +Operational guides: CI/CD integration, key rotation, and more +{{< /link-card >}} + +{{< /grid >}} diff --git a/content/deployment/selfhosted/authentication.md b/content/deployment/selfhosted/authentication.md new file mode 100644 index 000000000..50b03c140 --- /dev/null +++ b/content/deployment/selfhosted/authentication.md @@ -0,0 +1,560 @@ +--- +title: Authentication +weight: 5 +variants: -flyte +union +--- + +# Authentication + +{{< key product_name >}} self-hosted deployments use [OpenID Connect (OIDC)](https://openid.net/specs/openid-connect-core-1_0.html) for user authentication and [OAuth 2.0](https://tools.ietf.org/html/rfc6749) for service-to-service authorization. + +Unlike serverless and BYOC deployments where {{< key product_name >}} manages authentication for you, **self-hosted deployments require you to create and manage OAuth applications in your own identity provider** (e.g. Okta, Microsoft Entra ID, Google Workspace, or any OIDC-compliant provider). {{< key product_name >}} does not provision or manage these applications — you are responsible for their lifecycle, credential rotation, and access policies. + +> [!NOTE] +> This guide covers authentication for **self-hosted** deployments where you manage both the control plane and data plane. For **self-managed** deployments ({{< key product_name >}}-hosted control plane), authentication is handled automatically via `uctl selfserve provision-dataplane-resources` and `uctl create apikey`. + +## Overview + +Self-hosted authentication requires creating **five OAuth2 client applications** in your own identity provider (plus an optional sixth for CI/CD). Each application serves a different authentication flow: + +| # | Application | Type | Grant types | Purpose | +|---|-------------|------|-------------|---------| +| 1 | Browser login | Confidential (web) | `authorization_code`, `refresh_token`, `client_credentials` | Console/web UI login | +| 2 | CLI | Public (native) | `authorization_code`, `refresh_token`, `device_code` | `uctl` / `flytectl` CLI authentication via PKCE | +| 3 | Service-to-service | Confidential (service) | `client_credentials` | Control plane inter-service communication through NGINX | +| 4 | Operator | Confidential (service) | `client_credentials` | Data plane operator, propeller, and cluster-resource-sync authentication to control plane | +| 5 | EAGER | Confidential (service) | `client_credentials` | Task pod authentication (EAGER_API_KEY) | +| 6 | CI/CD _(optional)_ | Confidential (service) | `client_credentials` | Non-interactive workflow deployment from CI/CD pipelines | + +> [!NOTE] +> App 6 (CI/CD) is only needed if you deploy workflows from automated pipelines. See the [CI/CD integration](./operations/cicd) guide for full setup instructions. + +## Identity provider requirements + +You must use an OIDC-compliant identity provider that you manage outside of {{< key product_name >}}. Any standards-compliant provider will work. {{< key product_name >}} uses [Okta](https://www.okta.com/) for its internal deployments, but you can use whichever provider your organization already uses. + +Your identity provider must support: + +1. **OpenID Connect Discovery** — `/.well-known/openid-configuration` or `/.well-known/oauth-authorization-server` endpoint +2. **Authorization Code flow** — for browser and CLI login +3. **Client Credentials flow** — for service-to-service tokens +4. **PKCE** (Proof Key for Code Exchange) — for the CLI public client +5. **Custom scopes** — ability to create an `all` scope (or equivalent) +6. **Custom claims** — ability to emit `sub` and `preferred_username` claims in access tokens. An identity type claim (`identitytype` or equivalent) is recommended for authorization. + +### Authorization server setup + +Create a custom authorization server (or equivalent) in your identity provider. The setup differs by provider: + +{{< tabs >}} +{{< tab "Okta" >}} +{{< markdown >}} +Create a **Custom Authorization Server** in Okta: + +- **Audience**: `https://` (the control plane ingress domain) +- **Default scope**: `all` +- **Metadata URL**: `.well-known/oauth-authorization-server` (Okta-specific, not the standard `openid-configuration`) +- **Claims** (add as access token claims): + - `sub` — Okta populates this natively. For client_credentials tokens, `sub` equals the app's Client ID. + - `identitytype` — set to `"user"` for user tokens, `"app"` for client_credentials tokens + - `preferred_username` — set to the user's login for user tokens, or the app's Client ID for app tokens +{{< /markdown >}} +{{< /tab >}} +{{< tab "Entra ID" >}} +{{< markdown >}} +Register an **App Registration** in Microsoft Entra ID: + +- **App ID URI**: `api://` (this becomes the audience) +- **Metadata URL**: `.well-known/openid-configuration` (standard OIDC) +- **Scopes**: The `/.default` scope is used automatically for client_credentials and CLI flows. No custom scopes are required — browser login uses standard OIDC scopes only. +- **Claims** — configure via the app manifest's `optionalClaims`: + - `sub` — Entra populates this natively. For client_credentials tokens, `sub` equals the **Service Principal Object ID** (not the Client ID). + - `idtyp` — add as an optional access token claim **on the browser login app registration (App 1)**, since it is the resource server. Emits `"app"` for client_credentials tokens (maps to the `identitytype` concept). + - `preferred_username` — included by default for user tokens + +> [!WARNING] +> Entra ID uses `sub` = Service Principal Object ID for client_credentials tokens, not the Client ID. When configuring trusted identities for service-to-service auth, use the SP Object ID (found in Enterprise Applications, not App Registrations). + +> [!NOTE] +> Entra ID scope usage by flow: +> - **Browser login** (authorization_code): standard OIDC scopes only (`profile`, `openid`, `offline_access`) — the IdP returns a plain ID token +> - **CLI** (authorization_code + PKCE): `api:///.default` +> - **Service-to-service** (client_credentials): `api:///.default` +{{< /markdown >}} +{{< /tab >}} +{{< tab "Generic OIDC" >}} +{{< markdown >}} +For other OIDC providers (Keycloak, Authentik, Auth0, etc.): + +- **Audience**: `https://` or a custom resource identifier +- **Metadata URL**: Usually `.well-known/openid-configuration` +- **Scopes**: Create an `all` scope (or use your provider's default scope) +- **Claims**: Ensure access tokens include: + - `sub` — a stable identifier for the authenticated principal + - `preferred_username` — display name for identity injection + - An identity type claim is optional but recommended for authorization + +If your IdP cannot emit an `identitytype` claim, see the [identity type claim requirements](#identity-type-claim-requirements) section below. + +If your IdP's client_credentials tokens omit the `sub` claim, configure `subjectClaimNames` to specify a fallback chain (e.g., `["sub", "client_id", "azp"]`). +{{< /markdown >}} +{{< /tab >}} +{{< /tabs >}} + +### Identity type claim requirements + +{{< key product_name >}} uses an identity type claim to distinguish human users from service applications. This distinction is **required for Union (built-in RBAC) authorization** and affects how access control decisions are made. + +Your IdP must emit a claim that maps to the `identitytype` concept, with values that distinguish user tokens from application tokens. The claim name and values are configurable: + +| Provider | Claim name | User value | App value | Configuration | +|----------|-----------|------------|-----------|---------------| +| Okta | `identitytype` | `"user"` | `"app"` | Custom access token claim on authorization server | +| Entra ID | `idtyp` | (not emitted) | `"app"` | Enable via optional claims in app manifest. Map with `identityTypeClaimsForApps: {idtyp: ["app"]}` | +| Generic | varies | varies | varies | Configure `identityTypeClaimsForApps` to map your claim name and values | + +> [!WARNING] +> **Union authorization mode requires identity type resolution.** If your IdP cannot emit any claim that distinguishes users from applications, you must either: +> 1. Set `global.USE_EXTERNAL_IDENTITY: true` — the platform will infer identity type from the authentication context (e.g., whether the token was issued via authorization_code or client_credentials flow). This works for basic cases but may not cover all scenarios. +> 2. Use **External authorization mode** instead of Union mode — your external authorization server can determine identity type from the JWT payload, `sub` claim, or any other token attribute directly, without relying on the platform's identity type resolution. +> +> Without identity type resolution, Union authorization cannot distinguish user requests from service account requests, which may result in incorrect access control decisions. + +### Subject claim requirements + +{{< key product_name >}} uses the JWT `sub` claim as the **primary identifier** for all callers — users and service accounts alike. This value is used for: + +- **Authorization decisions** — matching callers to roles and permissions +- **Trusted identity validation** — verifying internal service-to-service callers +- **Audit logging** — recording who performed each action +- **Resource ownership** — the "Owned By" relationship in the console + +> [!WARNING] +> The `sub` claim value must be **stable and unique** per principal. If your IdP returns different `sub` values for the same user across token refreshes, authorization and ownership tracking will break. + +**Your IdP must emit a `sub` claim in all access tokens.** If your IdP's client_credentials tokens use a different claim for the caller identity (or omit `sub` entirely), configure `subjectClaimNames` to specify a fallback chain: + +```yaml +# In flyte.configmap.adminServer.auth.appAuth.externalAuthServer: +subjectClaimNames: + - sub # Standard OIDC subject (tried first) + - client_id # OAuth2 client ID (common fallback) + - azp # Authorized party (alternative) +``` + +The platform tries each claim in order and uses the first non-empty value as the caller's identity. + +> [!NOTE] +> **Provider-specific `sub` values:** +> - **Okta**: `sub` equals the Client ID for client_credentials tokens and the user's Okta ID for user tokens. +> - **Entra ID**: `sub` equals the **Service Principal Object ID** for client_credentials tokens (not the Client ID). Find this in Entra ID > Enterprise Applications > your app > Object ID. +> - When configuring trusted identities for internal services (e.g., `INTERNAL_SUBJECT_ID`), use the value that your IdP places in the `sub` claim — not necessarily the Client ID. + +## Step 1: Create OAuth2 applications + +### Application 1: Browser login (Confidential) + +Used by the web console for user authentication. + +| Property | Value | +|----------|-------| +| Type | Web (confidential client) | +| Grant types | `authorization_code`, `refresh_token`, `client_credentials` | +| Redirect URI | `https:///callback` | +| Post-logout redirect URI | `https:///logout` | +| Scopes | `openid`, `profile`, `offline_access` | + +Note the **Client ID** (used as `OIDC_CLIENT_ID`) and the **Client Secret** (stored in Kubernetes secrets). + +### Application 2: CLI (Public) + +Used by `uctl` and `flytectl` for CLI-based authentication with PKCE. + +| Property | Value | +|----------|-------| +| Type | Native (public client) | +| Grant types | `authorization_code`, `refresh_token`, `device_code` | +| Redirect URIs | `http://localhost:53593/callback`, `http://localhost:12345/callback` | +| PKCE | Required | +| Client authentication | None (PKCE only, no client secret) | + +Note the **Client ID** (used as `CLI_CLIENT_ID`). + +### Application 3: Service-to-service (Confidential) + +Used by control plane services (executions, cluster, identity, etc.) to authenticate with each other through NGINX when OIDC is enabled. + +| Property | Value | +|----------|-------| +| Type | Service (confidential client) | +| Grant types | `client_credentials` | + +Note the **Client ID** (used as `INTERNAL_CLIENT_ID`) and the **Client Secret** (stored in Kubernetes secrets). + +### Application 4: Operator (Confidential) + +Used by data plane services (operator, propeller, cluster-resource-sync) to authenticate to the control plane. + +| Property | Value | +|----------|-------| +| Type | Service (confidential client) | +| Grant types | `client_credentials` | + +Note the **Client ID** (used as `AUTH_CLIENT_ID` in data plane configuration) and the **Client Secret** (stored in Kubernetes secrets). + +### Application 5: EAGER (Confidential) + +Used for task pod authentication. The encoded credentials form the `EAGER_API_KEY`. + +| Property | Value | +|----------|-------| +| Type | Service (confidential client) | +| Grant types | `client_credentials` | + +Note the **Client ID** and **Client Secret** — these are encoded into the EAGER_API_KEY. + +## Step 2: Configure control plane + +Authentication is configured in the `flyte.configmap.adminServer.auth` block in your control plane Helm values. This block defines how the admin service validates tokens, which clients are trusted, and how browser login works. + +You also need to set a few global variables for service-to-service authentication: + +```yaml +global: + INTERNAL_CLIENT_ID: "" # App 3 + AUTH_TOKEN_URL: "" # OAuth2 token endpoint + OIDC_S2S_SCOPE: "" # Leave empty for Okta, set to "api:///.default" for Entra ID +``` + +Then configure the auth block. Select your identity provider below: + +{{< tabs >}} +{{< tab "Okta" >}} +{{< markdown >}} +```yaml +flyte: + configmap: + adminServer: + server: + security: + useAuth: true + auth: + appAuth: + authServerType: External + externalAuthServer: + baseUrl: "https://dev-123456.okta.com/oauth2/default" + metadataUrl: ".well-known/oauth-authorization-server" + allowedAudience: + - "https://" + thirdPartyConfig: + flyteClient: + clientId: "" # App 2 + redirectUri: "http://localhost:53593/callback" + scopes: + - all + userAuth: + openId: + baseUrl: "https://dev-123456.okta.com/oauth2/default" + clientId: "" # App 1 + scopes: + - profile + - openid + - offline_access + cookieSetting: + sameSitePolicy: LaxMode + domain: "" +``` + +Set globals: +```yaml +global: + INTERNAL_CLIENT_ID: "" + AUTH_TOKEN_URL: "https://dev-123456.okta.com/oauth2/default/v1/token" + OIDC_S2S_SCOPE: "" # Okta defaults to "all" +``` +{{< /markdown >}} +{{< /tab >}} +{{< tab "Entra ID" >}} +{{< markdown >}} +```yaml +flyte: + configmap: + adminServer: + server: + security: + useAuth: true + auth: + appAuth: + authServerType: External + externalAuthServer: + baseUrl: "https://login.microsoftonline.com//v2.0" + metadataUrl: ".well-known/openid-configuration" + allowedAudience: + - "api://" + - "" # App 1 Client ID + identityTypeClaimsForApps: + idtyp: + - app + thirdPartyConfig: + flyteClient: + clientId: "" # App 2 + redirectUri: "http://localhost:53593/callback" + scopes: + - "api:///.default" + audience: "api://" + userAuth: + openId: + baseUrl: "https://login.microsoftonline.com//v2.0" + clientId: "" # App 1 + scopes: + - profile + - openid + - offline_access + cookieSetting: + sameSitePolicy: LaxMode + domain: "" + idpQueryParameter: "idp" +``` + +> [!WARNING] +> After creating service apps (Apps 3-5), you must **grant admin consent** for their App Role assignments in the Azure portal (**Enterprise Applications > your app > Permissions > Grant admin consent**) or via `az ad app permission admin-consent`. Without admin consent, client_credentials token requests will fail. + +Set globals: +```yaml +global: + INTERNAL_CLIENT_ID: "" + AUTH_TOKEN_URL: "https://login.microsoftonline.com//oauth2/v2.0/token" + OIDC_S2S_SCOPE: "api:///.default" +``` + +> [!NOTE] +> `INTERNAL_SUBJECT_ID` defaults to `INTERNAL_CLIENT_ID` for backward compatibility. For Entra ID, where the token `sub` claim is the Service Principal Object ID (not the Client ID), set `INTERNAL_SUBJECT_ID` to the SP Object ID. Find this in **Entra ID > Enterprise Applications > your app > Object ID**. +{{< /markdown >}} +{{< /tab >}} +{{< tab "Generic OIDC" >}} +{{< markdown >}} +```yaml +flyte: + configmap: + adminServer: + server: + security: + useAuth: true + auth: + appAuth: + authServerType: External + externalAuthServer: + baseUrl: "" + metadataUrl: ".well-known/openid-configuration" + allowedAudience: + - "" + thirdPartyConfig: + flyteClient: + clientId: "" # App 2 + redirectUri: "http://localhost:53593/callback" + scopes: + - all + userAuth: + openId: + baseUrl: "" + clientId: "" # App 1 + scopes: + - profile + - openid + - offline_access + cookieSetting: + sameSitePolicy: LaxMode + domain: "" +``` + +Set globals: +```yaml +global: + INTERNAL_CLIENT_ID: "" + AUTH_TOKEN_URL: "/token" # Your IdP's token endpoint + OIDC_S2S_SCOPE: "" # Set if your IdP requires a specific scope for client_credentials +``` + +If your IdP's client_credentials tokens don't include a `sub` claim, add: +```yaml + subjectClaimNames: + - sub + - client_id + - azp +``` +{{< /markdown >}} +{{< /tab >}} +{{< /tabs >}} + +> [!NOTE] +> Setting `useAuth: true` is required for the `/login`, `/callback`, and `/me` endpoints to register. Without this, auth endpoints will return 404. + +> [!NOTE] +> **Deprecated globals:** `OIDC_BASE_URL`, `OIDC_CLIENT_ID`, and `CLI_CLIENT_ID` are deprecated but still functional. New deployments should use the `auth` block directly as shown above. Existing deployments using these globals will continue to work. + +## Step 3: Create Kubernetes secrets (control plane) + +The control plane needs secrets for the browser login app (App 1) and the service-to-service app (App 3): + +```shell +# Secret for admin service (mounted at /etc/secrets/) +# Note: "flyte-admin-secrets" is the default name expected by the Helm chart +kubectl create secret generic flyte-admin-secrets \ + --from-literal=client_secret='' \ + -n + +# Secret for scheduler (mounted at /etc/secrets/) +# Note: "flyte-secret-auth" is the default name expected by the Helm chart +kubectl create secret generic flyte-secret-auth \ + --from-literal=client_secret='' \ + -n + +# Add service-to-service client secret to the controlplane secrets +kubectl create secret generic \ + --from-literal=pass.txt='' \ + --from-literal=client_secret='' \ + -n --dry-run=client -o yaml | kubectl apply -f - +``` + +> [!NOTE] +> For production, use External Secrets Operator or a similar tool to sync secrets from your cloud provider's secret manager (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault). + +## Step 4: Configure data plane + +Add the operator client ID to your data plane overrides file: + +```yaml +global: + AUTH_CLIENT_ID: "" # App 4 +``` + +Create the data plane auth secret: + +```shell +kubectl create secret generic union-secret-auth \ + --from-literal=client_secret='' \ + -n +``` + +## Step 5: Configure EAGER_API_KEY + +The EAGER_API_KEY is a base64-encoded string containing the EAGER app credentials. It enables task pods to authenticate to the control plane. + +Generate the key: + +```shell +# Format: base64(":::") +echo -n ":::" | base64 +``` + +Create the Kubernetes secret in the data plane namespace: + +```shell +kubectl create secret generic \ + --from-literal=='' \ + -n +``` + +> [!NOTE] +> The exact secret name and key depend on your deployment's embedded K8s secret manager configuration. The secret name is typically an MD5 hash of a logical identifier. Contact {{< key product_name >}} support for the exact values for your organization. + +## Step 6: Deploy + +Deploy or upgrade both the control plane and data plane with the updated configurations: + +```shell +# Upgrade control plane +helm upgrade unionai-controlplane unionai/controlplane \ + --namespace \ + -f values..selfhosted-intracluster.yaml \ + -f values.registry.yaml \ + -f values..selfhosted-overrides.yaml \ + --timeout 15m --wait + +# Upgrade data plane +helm upgrade unionai-dataplane unionai/dataplane \ + --namespace \ + -f values..selfhosted-intracluster.yaml \ + -f values..selfhosted-overrides.yaml \ + --timeout 10m --wait +``` + +## Verification + +```shell +# Check admin service logs for auth initialization +kubectl logs -n deploy/ | grep -i auth + +# Test the /me endpoint (should return 401 without a token) +kubectl exec -n deploy/ -- \ + curl -s -o /dev/null -w "%{http_code}" \ + https://..svc.cluster.local/me -k + +# Test CLI login +uctl config init --host https:// +uctl get project + +# Check data plane operator auth +kubectl logs -n -l app.kubernetes.io/name=operator --tail=50 | grep -i "token\|auth" +``` + +## Summary of secrets + +| Secret name | Namespace | Keys | Source | +|-------------|-----------|------|--------| +| `flyte-admin-secrets` (Helm chart default) | `` | `client_secret` | Browser login app (App 1) secret | +| `flyte-secret-auth` (Helm chart default) | `` | `client_secret` | Browser login app (App 1) secret | +| `` | `` | `pass.txt`, `client_secret` | DB password, Service-to-service app (App 3) secret | +| `union-secret-auth` | `` | `client_secret` | Operator app (App 4) secret | +| EAGER secret | `` | varies | EAGER app (App 5) encoded key | + +## Self-hosted vs. self-managed authentication + +| Aspect | Self-hosted | Self-managed | +|--------|------------|--------------| +| OAuth app creation | Manual — create all 5 apps | Automatic — `uctl selfserve provision-dataplane-resources` creates apps 1-3 | +| EAGER_API_KEY | Manual — encode and create secret | Automatic — `uctl create apikey` generates and provisions | +| Control plane auth | Configure via Helm values | Managed by {{< key product_name >}} | +| Data plane auth | Configure `AUTH_CLIENT_ID` and secret | Provisioned by `uctl selfserve` | + +## Troubleshooting + +### Admin service auth endpoints return 404 + +Ensure `useAuth: true` is set under `flyte.configmap.adminServer.server.security`. Without this, the `/login`, `/callback`, and `/me` endpoints are not registered. + +### Token validation fails with "audience mismatch" + +The `allowedAudience` in the admin service configuration must include `https://`. This should match the audience configured on your authorization server. + +### Data plane cannot authenticate to control plane + +```shell +# Verify AUTH_CLIENT_ID is set +kubectl get configmap -n -o yaml | grep -i auth_client + +# Check that union-secret-auth exists +kubectl get secret union-secret-auth -n \ + -o jsonpath='{.data.client_secret}' | base64 -d + +# Check operator logs +kubectl logs -n -l app.kubernetes.io/name=operator --tail=50 \ + | grep -i "auth\|token\|401" +``` + +### CLI login fails + +Ensure the CLI app (App 2) redirect URIs include `http://localhost:53593/callback` and PKCE is enabled. Test with: + +```shell +uctl config init --host https:// +uctl get project +``` + +### Entra ID: `AADSTS1002012` invalid_scope for service-to-service + +Client_credentials flows in Entra ID require the `/.default` scope. Ensure `OIDC_S2S_SCOPE` is set to `api:///.default` in your globals. + +### Subject not found in token + +If flyteadmin logs show `subject claim not found`, your IdP's client_credentials tokens may not include a `sub` claim. Configure `subjectClaimNames` in the auth block to specify a fallback chain (e.g., `["sub", "client_id"]`). diff --git a/content/deployment/selfhosted/authorization.md b/content/deployment/selfhosted/authorization.md new file mode 100644 index 000000000..98d49aa85 --- /dev/null +++ b/content/deployment/selfhosted/authorization.md @@ -0,0 +1,645 @@ +--- +title: Authorization +weight: 6 +variants: -flyte +union +mermaid: true +--- +# Authorization + +{{< key product_name >}} self-hosted deployments support configurable authorization backends to control who can perform which actions on platform resources. The authorization mode determines how access control decisions are made for API requests from the console, CLI, and SDK. + +Unlike other deployment models where {{< key product_name >}} manages RBAC for you, **self-hosted deployments let you choose the authorization model** that fits your organization's security requirements. + +## Prerequisites + +Authorization builds on top of [authentication]({{< relref "authentication" >}}). Before configuring authorization, ensure: + +1. **Authentication is configured and working** — all five OAuth2 applications are created and the control plane is accepting authenticated requests. +2. **Custom claims are configured on your authorization server:** + +| Claim | Values | Required for | Used for | +|-------|--------|-------------|----------| +| `sub` | User's internal ID or app's client ID | All modes | Primary identity for authorization decisions | +| `identitytype` | `"user"` or `"app"` | Union mode | Distinguishes human users from service accounts. Not strictly required for External mode — your external server can determine identity type from the `sub` claim or JWT payload directly. | +| `preferred_username` | User login or app client ID | All modes | Identity injection ("Owned By" display in the console) | + +3. **You understand which OAuth apps generate which identity types:** + +| OAuth App | # | Token `sub` claim | Identity type | Purpose in authorization | +|-----------|---|-------------------|---------------|--------------------------| +| Browser login | 1 | User's internal ID | `user` | End-user console/UI actions | +| CLI | 2 | User's internal ID (interactive) or app's client ID (service credentials) | `user` or `app` | End-user or automated CLI actions | +| Service-to-service | 3 | App's client ID | `app` | Internal platform calls | +| Operator | 4 | App's client ID | `app` | Dataplane → controlplane operations | +| EAGER | 5 | App's client ID | `app` | Task pod operations on behalf of users | + +> [!NOTE] +> Apps 3–5 are internal platform service accounts. Your external authorization server must grant them appropriate permissions for the platform to function. See [Service account permissions](#service-account-permissions) below. + +## Architecture + +All controlplane services route authorization decisions through a centralized authorization component that delegates to the configured backend: + +```mermaid +graph LR + subgraph cp["Controlplane"] + A["Service A"] --> Auth["Authorize()"] + B["Service B"] --> Auth + C["Service C"] --> Auth + D["Service ..."] --> Auth + Auth --> Backend["Backend\n(Noop / Union / External)"] + end +``` + +Each controlplane service forwards `Authorize()` calls and the configured backend returns allow/deny decisions. + +## Authorization modes + +{{< key product_name >}} supports three authorization modes: + +| Mode | Backend | Best for | Enforcement | Configuration | +|------|---------|----------|-------------|---------------| +| **Noop** | None | Isolated or high-trust environments | All requests allowed | Default, no config needed | +| **Union** | {{< key product_name >}} RBAC | Out-of-the-box RBAC, fully integrated with the {{< key product_name >}} console | {{< key product_name >}}-managed policies | Built-in, enable via config | +| **External** | BYO gRPC server | Organizations with existing RBAC/policy systems | Your own policies | Requires external server | + +### Noop (default) + +No authorization enforcement — all authenticated requests are allowed. This is the default mode. + +**When to use:** +- Development and testing environments +- Small teams where all users are trusted +- Initial deployment before configuring authorization +- Environments where authentication alone provides sufficient access control + +**Trade-offs:** +- No access control beyond authentication +- Any authenticated user can perform any action on any resource +- No audit trail of authorization decisions + +### Union (built-in RBAC) — recommended + +{{< key product_name >}}'s built-in authorization engine, **embedded in the controlplane Helm chart**. It deploys automatically when enabled, with no separate chart installation required. Provides role-based access control with predefined roles (Admin, Contributor, Viewer) and policy-based fine-grained permissions. + +> [!NOTE] +> The Helm config accepts both `type: "Union"` (preferred) and `type: "UserClouds"` (legacy name). Both activate the same built-in authorization engine. Use `"Union"` for new deployments. + +**When to use:** +- Production deployments wanting out-of-the-box RBAC with no additional infrastructure +- Organizations that need role management through the {{< key product_name >}} console +- Teams wanting a performant, battle-tested authorization backend with low operational burden + +> [!WARNING] +> **Union mode requires identity type claim resolution.** Your IdP must emit a claim that the platform can map to distinguish user tokens from application tokens (e.g., Okta's `identitytype` or Entra ID's `idtyp`). If your IdP cannot provide this, either set `global.USE_EXTERNAL_IDENTITY: true` or use **External** authorization mode instead. See [Identity type claim requirements]({{< relref "authentication#identity-type-claim-requirements" >}}) in the authentication guide. + +**Trade-offs:** +- Built-in role management (Admin, Contributor, Viewer) with full RBAC — assign users and groups to roles with resource-level granularity +- Zero additional infrastructure — embedded in the controlplane chart, managed by {{< key product_name >}} +- Uses the controlplane database for policy storage — no separate database required +- This is the same authorization engine used by {{< key product_name >}}'s managed deployments + +#### Enabling Union mode + +Set the authorizer type in your controlplane Helm values: + +```yaml +services: + authorizer: + configMap: + authorizer: + type: "Union" # or "UserClouds" (legacy name, same engine) +``` + +No additional infrastructure is needed — the authorization engine is embedded in the controlplane chart and deploys automatically. + +#### Bootstrap configuration + +When Union mode starts for the first time, it bootstraps the authorization database with: +- An **organization** (your deployment's org ID) +- **Domains** (development, staging, production) +- **Projects** to pre-create (optional) +- **Service accounts** with their roles +- **Admin users** (name + subject) who can manage RBAC via the console + +Configure bootstrap in your Helm values: + +```yaml +services: + authorizer: + configMap: + authorizer: + type: "Union" + bootstrap: + organization: "{{ .Values.global.UNION_ORG }}" + domains: + - development + - staging + - production + projects: + - "" # Projects to bootstrap (e.g. "union-health-monitoring") + serviceAccounts: + - clientId: "" # App 3 — sub claim value + name: "service-to-service" + role: "Admin" + - clientId: "" # App 4 — sub claim value + name: "operator" + role: "Admin" + - clientId: "" # App 5 — sub claim value + name: "eager" + role: "Admin" + adminUsers: + - name: "" + subject: "" # The user's sub claim value from your IdP +``` + +> [!WARNING] +> The `clientId` field in `serviceAccounts` must match the **resolved `sub` claim value** from your IdP's client_credentials tokens — not necessarily the OAuth Client ID. For Okta, `sub` equals the Client ID. For Entra ID, `sub` equals the **Service Principal Object ID**. See [Subject claim requirements]({{< relref "authentication#subject-claim-requirements" >}}) in the authentication guide. + +> [!NOTE] +> **All three service accounts (Apps 3, 4, 5) must be bootstrapped with Admin role.** Without this, internal platform operations will fail: +> - **App 3** (service-to-service): Internal controlplane service communication +> - **App 4** (operator): Dataplane registration, heartbeats, cluster management +> - **App 5** (EAGER): Task pod execution, workflow registration + +#### Trusted identity claims + +The controlplane must know which callers are trusted internal services. This is configured via `trustedIdentityClaims` in the Helm values: + +```yaml +configMap: + union: + connection: + trustedIdentityClaims: + enabled: true + externalIdentityClaim: "" # The subject value of the internal S2S client + externalIdentityTypeClaim: "app" # Identity type for internal services +``` + +The `externalIdentityClaim` is typically set via the `INTERNAL_SUBJECT_ID` global (defaults to `INTERNAL_CLIENT_ID`). This tells the controlplane: "tokens with this `sub` claim are from our internal S2S service and should be trusted for inter-service communication." + +#### Recommended migration path + +1. **Start with Noop** — deploy with `type: "Noop"` to verify authentication works end-to-end without authorization enforcement +2. **Verify all five OAuth apps** — ensure browser login, CLI, and service-to-service authentication all work +3. **Configure bootstrap** — set `serviceAccounts` with the correct subject values for your IdP, and `adminUsers` with your initial admin +4. **Switch to Union** — change `type: "Union"` and redeploy. The authorizer will bootstrap on first start +5. **Assign roles** — use the {{< key product_name >}} console to assign roles to additional users + +> [!NOTE] +> If you switch from Noop to Union and internal services start failing with permission errors, check the authorizer logs for denied subjects. The most common cause is `clientId` in `serviceAccounts` not matching the actual `sub` claim value from your IdP. + +### External + +Delegates authorization decisions to a BYO (bring-your-own) gRPC server. The external server receives the caller's identity, the requested action, and the target resource, and returns an allow/deny decision. + +> [!WARNING] +> The external authorization server is called on **every API request**. Its latency directly impacts platform response times. Ensure your server can handle the request volume with low latency (<10ms p99 recommended). + +**When to use:** +- Organizations with existing RBAC/policy engines (e.g. OPA, Cedar, custom systems) where a sync with {{< key product_name >}}'s native authorization is undesirable or not possible +- Enterprises requiring authorization integration with internal identity management +- Deployments needing custom authorization logic beyond role-based access + +**Trade-offs:** +- Full control over authorization policies and logic +- Requires building, deploying, and operating an external authorization server +- The external server is on the critical path — its reliability and performance directly impact platform availability +- Higher operational burden than Union mode — you own the server's uptime, scaling, and policy management +- {{< key product_name >}} owns the authorization routing layer; you own the external backend + +> [!NOTE] +> A **fail-open** option (`failOpen: true`) allows requests when the external server is unreachable. This trades security for availability — use with caution in production. + +## Configuration + +Authorization mode is set in the controlplane Helm values under `services.authorizer.configMap.authorizer`. The key configuration fields are: + +- **`type`** — `"Noop"` (default), `"Union"` or `"UserClouds"` (built-in RBAC), or `"External"` (BYO server) +- **`bootstrap`** — initial service accounts, admin users, and organization (Union mode only) +- **`externalClient.grpcConfig.host`** — gRPC target for your external server (External mode only). Uses standard gRPC name resolution (`dns:///`, `unix:///`, etc.) +- **`externalClient.grpcConfig.insecure`** — `true` for plaintext, `false` for TLS +- **`externalClient.failOpen`** — `true` to allow requests when the external server is unreachable (default: `false`) + +See [Enabling Union mode](#enabling-union-mode) or [External authorization server contract](#external-authorization-server-contract) below for mode-specific configuration. + +## External authorization server contract + +This section applies only to **External** mode and defines what your authorization server must implement. + +### gRPC contract + +Your server must implement the `AuthorizerService.Authorize` unary RPC: + +```protobuf +service AuthorizerService { + rpc Authorize(AuthorizeRequest) returns (AuthorizeResponse); +} +``` + +**Request fields:** + +| Field | Type | Description | +|-------|------|-------------| +| `identity` | `Identity` | The caller — an `external_identity` containing the subject string and the raw OIDC token (when available) | +| `action` | `Action` enum | The operation being requested | +| `resource` | `Resource` | The target resource (organization, domain, project, or cluster) | +| `organization` | `string` | The organization identifier | + +**Response:** + +| Field | Type | Description | +|-------|------|-------------| +| `allowed` | `bool` | `true` to allow the request, `false` to deny | + +### Identity resolution + +The caller's identity is resolved and forwarded to your server through two channels: + +1. **`AuthorizeRequest.identity` protobuf field** (recommended) — always an `external_identity` containing: + - `subject`: the caller's identity (resolved from `X-User-Subject` for browser/CLI requests, or from the JWT `sub` claim for service-to-service requests) + - `token`: the raw OIDC/JWT token (when available) + + This provides a consistent interface regardless of how the caller authenticated. + +2. **gRPC metadata headers** — the raw JWT/OIDC token is forwarded to your server in the `authorization` metadata header (as `Bearer `). Your server can decode the JWT payload to read claims (`sub`, `identitytype`, `email`, `groups`, etc.) without signature verification — the token has already been validated upstream by the platform. + +> [!NOTE] +> **Token availability by auth flow:** +> - **SDK/CLI (PKCE):** The token arrives via the `authorization` header and is available in both the protobuf `identity.token` field and forwarded metadata. +> - **Browser (cookie-based):** The token is extracted from the encrypted session cookie by the `/me` auth subrequest and forwarded via the `X-User-Token` header. The authorizer normalizes it to the standard `authorization` header before calling the external server, so your server sees a consistent interface on all paths. +> - **Service-to-service:** The token arrives via the `authorization` or `flyte-authorization` header. + +### Actions + +Your server must handle the following authorization actions: + +| Action | Description | Typical callers | +|--------|-------------|-----------------| +| `ACTION_VIEW_FLYTE_INVENTORY` | View workflows, tasks, launch plans | All users and services | +| `ACTION_VIEW_FLYTE_EXECUTIONS` | View executions and run details | All users and services | +| `ACTION_REGISTER_FLYTE_INVENTORY` | Register workflows, tasks, launch plans | Contributors, operators, EAGER | +| `ACTION_CREATE_FLYTE_EXECUTIONS` | Launch executions | Contributors, operators, EAGER | +| `ACTION_ADMINISTER_PROJECT` | Manage project settings | Admins | +| `ACTION_MANAGE_PERMISSIONS` | Manage user roles and policies | Admins | +| `ACTION_ADMINISTER_ACCOUNT` | Account-level administration | Admins | +| `ACTION_MANAGE_CLUSTER` | Cluster lifecycle operations | Operators (App 4) | +| `ACTION_EDIT_EXECUTION_RELATED_ATTRIBUTES` | Modify execution attributes | Contributors, operators | +| `ACTION_EDIT_CLUSTER_RELATED_ATTRIBUTES` | Modify cluster attributes | Operators | +| `ACTION_EDIT_UNUSED_ATTRIBUTES` | Modify other attributes | Contributors | +| `ACTION_SUPPORT_SYSTEM_LOGS` | Access system logs | Admins | +| `ACTION_VIEW_IDENTITIES` | View user/app identities | Admins | + +### Service account permissions + +Your external authorization server **must** grant appropriate permissions to the internal platform service accounts (OAuth Apps 3–5 from [Authentication]({{< relref "authentication" >}})). **Without these, internal platform operations will fail.** + +| OAuth App | # | Subject (`sub` claim) | Required permissions | +|-----------|---|----------------------|----------------------| +| Service-to-service | 3 | `INTERNAL_CLIENT_ID` value | All actions listed above (this is the internal platform identity) | +| Operator | 4 | `AUTH_CLIENT_ID` value | `MANAGE_CLUSTER`, `VIEW_FLYTE_INVENTORY`, `VIEW_FLYTE_EXECUTIONS`, `CREATE_FLYTE_EXECUTIONS` | +| EAGER | 5 | EAGER app client ID | `VIEW_FLYTE_INVENTORY`, `VIEW_FLYTE_EXECUTIONS`, `REGISTER_FLYTE_INVENTORY`, `CREATE_FLYTE_EXECUTIONS`, `EDIT_EXECUTION_RELATED_ATTRIBUTES`, `EDIT_CLUSTER_RELATED_ATTRIBUTES` | + +> [!WARNING] +> If the operator service account (App 4) is not granted `MANAGE_CLUSTER`, the dataplane will be unable to register with the controlplane or send heartbeats. If the EAGER service account (App 5) is not granted execution permissions, task pods will fail to launch child tasks or register workflow artifacts. + +### Configuring service accounts + +The three internal OAuth apps must be registered in your external server's permission mapping. Their `sub` claims identify the calling application — use the values configured during [authentication setup]({{< relref "authentication" >}}). + +> [!NOTE] +> **Provider-specific subject values:** For Okta, the `sub` claim in client_credentials tokens equals the app's **Client ID**. For Entra ID, the `sub` claim equals the app's **Service Principal Object ID** (found in Entra ID > Enterprise Applications). If you configured `subjectClaimNames` with a fallback chain, the resolved subject may come from a different claim (e.g., `client_id` as fallback). + +To find the client IDs for your deployment, check the controlplane Helm values: + +| OAuth App | # | Helm global variable | Required permissions | +|-----------|---|---------------------|----------------------| +| Service-to-service | 3 | `INTERNAL_CLIENT_ID` | All actions (platform admin) | +| Operator | 4 | `AUTH_CLIENT_ID` | `MANAGE_CLUSTER`, `VIEW_FLYTE_INVENTORY`, `VIEW_FLYTE_EXECUTIONS`, `CREATE_FLYTE_EXECUTIONS` | +| EAGER | 5 | (EAGER app client ID) | `VIEW_FLYTE_INVENTORY`, `VIEW_FLYTE_EXECUTIONS`, `REGISTER_FLYTE_INVENTORY`, `CREATE_FLYTE_EXECUTIONS`, `EDIT_EXECUTION_RELATED_ATTRIBUTES`, `EDIT_CLUSTER_RELATED_ATTRIBUTES` | + +> [!WARNING] +> The platform does **not** bypass external authorization for its own service accounts. Your server must explicitly grant them the required permissions. If the operator (App 4) is denied `MANAGE_CLUSTER`, the dataplane cannot register or heartbeat. If EAGER (App 5) is denied execution permissions, task pods cannot launch child tasks. + +### Reference implementation + +The following is a complete Python example of an external authorization server. It uses a static YAML config to map subjects to roles, and roles to permitted actions. + +**config.yaml** — subject-to-role mapping: + +```yaml +port: 50051 + +# Internal platform service accounts — REQUIRED. +# Use the OAuth app client IDs from your identity provider (the same +# values configured during authentication setup). The server grants +# each the minimum permissions needed for the platform to function. +service_accounts: + internal: "" # App 3: service-to-service (all actions) + operator: "" # App 4: dataplane operator (cluster mgmt) + eager: "" # App 5: task execution (register + launch) +``` + +**server.py** — identity extraction and service account authorization: + +```python +#!/usr/bin/env python3 +"""External authorization server for Union selfhosted deployments. + +Demonstrates how to: +1. Extract identity and token from the Authorize() request +2. Grant required permissions to internal platform service accounts +3. Delegate human user authorization to your own logic +""" + +import argparse +import base64 +import json +import logging +from concurrent import futures + +import grpc +import yaml + +from gen.authorizer import authorizer_pb2_grpc, payload_pb2 +from gen.common import authorization_pb2 + +log = logging.getLogger("authz") + +# Action enum → name for logging +ACTION_NAMES = { + v.number: v.name + for v in authorization_pb2.DESCRIPTOR.enum_types_by_name["Action"].values + if v.number != 0 +} + +# Minimum permissions each internal service account needs. +# These are non-negotiable — without them the platform breaks. +INTERNAL_SERVICE_ACCOUNT_PERMISSIONS = { + # App 3 (INTERNAL_CLIENT_ID): the platform's own identity. + # Used for all internal service-to-service communication. + "internal": set(ACTION_NAMES.keys()), # all actions + # App 4 (AUTH_CLIENT_ID): dataplane operator. + # Registers clusters, sends heartbeats, reads inventory. + "operator": { + authorization_pb2.ACTION_MANAGE_CLUSTER, + authorization_pb2.ACTION_VIEW_FLYTE_INVENTORY, + authorization_pb2.ACTION_VIEW_FLYTE_EXECUTIONS, + authorization_pb2.ACTION_CREATE_FLYTE_EXECUTIONS, + }, + # App 5 (EAGER client ID): task execution runtime. + # Launches child tasks, registers workflows on behalf of users. + "eager": { + authorization_pb2.ACTION_VIEW_FLYTE_INVENTORY, + authorization_pb2.ACTION_VIEW_FLYTE_EXECUTIONS, + authorization_pb2.ACTION_REGISTER_FLYTE_INVENTORY, + authorization_pb2.ACTION_CREATE_FLYTE_EXECUTIONS, + authorization_pb2.ACTION_EDIT_EXECUTION_RELATED_ATTRIBUTES, + authorization_pb2.ACTION_EDIT_CLUSTER_RELATED_ATTRIBUTES, + }, +} + + +def decode_jwt_payload(token: str) -> dict | None: + """Base64-decode the JWT payload (no signature verification needed — + the token is pre-validated by the platform).""" + parts = token.split(".") + if len(parts) != 3: + return None + payload = parts[1] + "=" * ((-len(parts[1])) % 4) + return json.loads(base64.urlsafe_b64decode(payload)) + + +class AuthorizerServicer(authorizer_pb2_grpc.AuthorizerServiceServicer): + + def __init__(self, config: dict): + # Map internal service account client IDs → their permission sets. + # These come from config.yaml and match the OAuth app client IDs + # configured during authentication setup. + self.service_accounts: dict[str, set[int]] = {} + for key, client_id in config.get("service_accounts", {}).items(): + perms = INTERNAL_SERVICE_ACCOUNT_PERMISSIONS.get(key) + if perms and client_id: + self.service_accounts[client_id] = perms + + def Authorize(self, request, context): + # --- 1. Extract identity from the proto request --- + # The platform always sends ExternalIdentity with subject + token. + # Every request is authenticated before reaching authorization, so + # both fields are always populated. + ext_id = request.identity.external_identity + subject = ext_id.subject + proto_token = ext_id.token # Raw JWT from the proto field + + # --- 2. Extract token from gRPC metadata --- + # The same JWT is also forwarded in the "authorization" metadata + # header as "Bearer ". Both channels carry the same token — + # use whichever fits your architecture. + # - proto_token: ready to use (no prefix to strip) + # - metadata_token: standard HTTP authorization header format + metadata = dict(context.invocation_metadata()) + auth_header = metadata.get("authorization", "") + metadata_token = auth_header[7:] if auth_header.lower().startswith("bearer ") else "" + + # --- 3. Decode JWT claims --- + # The token is pre-validated upstream — no signature verification + # needed. Decode to read claims: sub, identitytype, email, groups, + # preferred_username, iss, aud, exp. + claims = decode_jwt_payload(proto_token) + + # --- 4. Extract the action and resource --- + action = request.action + action_name = ACTION_NAMES.get(action, str(action)) + organization = request.organization + + # Resource is a oneof: project, domain, organization, or cluster + resource = request.resource + resource_desc = "" + if resource.HasField("project"): + p = resource.project + domain = p.domain.name if p.HasField("domain") else "?" + resource_desc = f"{organization}/{domain}/{p.name}" + elif resource.HasField("domain"): + resource_desc = f"{organization}/{resource.domain.name}" + elif resource.HasField("cluster"): + resource_desc = f"{organization}/{resource.cluster.name}" + else: + resource_desc = organization + + # --- 5. Authorize internal service accounts --- + # The platform does NOT bypass external authorization for its own + # service accounts. Your server MUST grant them the required + # permissions or the platform will stop functioning. + if subject in self.service_accounts: + allowed = action in self.service_accounts[subject] + log.info( + "%s sub=%s (service-account) action=%s resource=%s", + "ALLOWED" if allowed else "DENIED", subject, action_name, resource_desc, + ) + return payload_pb2.AuthorizeResponse(allowed=allowed) + + # --- 6. Authorize human users --- + # Replace this with your own authorization logic (OPA, Cedar, + # internal RBAC, policy engine, token exchange, etc.) + # Available data: + # - subject: the user's identity (from 'sub' claim) + # - claims: decoded JWT with sub, email, groups, identitytype, etc. + # - proto_token / metadata_token: raw JWT for downstream use + # - action: the Action enum value being requested + # - resource: the target resource (project, domain, cluster, org) + log.info("subject=%s action=%s resource=%s", subject, action_name, resource_desc) + if claims: + log.info( + " JWT: sub=%s type=%s iss=%s exp=%s", + claims.get("sub"), claims.get("identitytype"), + claims.get("iss"), claims.get("exp"), + ) + + allowed = True # TODO: implement your authorization logic + return payload_pb2.AuthorizeResponse(allowed=allowed) + + +if __name__ == "__main__": + logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") + parser = argparse.ArgumentParser() + parser.add_argument("--config", default="config.yaml") + args = parser.parse_args() + with open(args.config) as f: + config = yaml.safe_load(f) + + server = grpc.server(futures.ThreadPoolExecutor(max_workers=4)) + authorizer_pb2_grpc.add_AuthorizerServiceServicer_to_server( + AuthorizerServicer(config), server + ) + port = config.get("port", 50051) + server.add_insecure_port(f"[::]:{port}") + server.start() + log.info("External AuthZ server listening on port %d", port) + server.wait_for_termination() +``` + +**Proto definitions:** The server requires generated Python code from Union's authorization protobuf definitions. Contact {{< key product_name >}} support for access to the `.proto` files, or use [buf](https://buf.build) to generate them: + +```shell +pip install grpcio grpcio-tools protobuf pyyaml +buf generate # using buf.gen.yaml pointing to Union's IDL +python server.py --config config.yaml +``` + +> [!NOTE] +> This reference implementation is intended for testing and development. Production implementations should integrate with your organization's identity and policy management systems. + +## Observability + +The controlplane exposes Prometheus metrics for monitoring authorization decisions and backend health. These are included in the controlplane Grafana dashboard under the **Authorizer** row. + +### Key metrics + +| Metric | Type | Description | +|--------|------|-------------| +| `authz_allowed{action}` | Counter | Allowed decisions by action type | +| `authz_denied{action}` | Counter | Denied decisions by action type | +| `authorize_duration` | Histogram | End-to-end Authorize() latency | +| `authorize_errors_total{error_source}` | Counter | Errors by source (backend, identity_resolution) | +| `authz_type_info{type}` | Gauge | Active authorization mode | +| `external:errors{grpc_code}` | Counter | External backend errors by gRPC status code | +| `external:authorize_duration` | Histogram | External backend call latency | +| `external:fail_open_activated` | Counter | Fail-open bypass events | +| `external:connection_state` | Gauge | gRPC connection state to external backend | + +### Alerts + +When [alerting is enabled]({{< relref "monitoring#alerting" >}}), the following authorization-specific alerts are available: + +| Alert | Severity | Condition | +|-------|----------|-----------| +| `UnionCPAuthorizerExternalErrors` | Warning | External backend errors >0.1/s for 5 minutes | +| `UnionCPAuthorizerFailOpenActive` | Critical | Fail-open is actively bypassing authorization | +| `UnionCPAuthorizerHighDenyRate` | Warning | Authorization deny rate exceeds 50% for 10 minutes | + +## Verification + +After configuring authorization, verify it's working: + +1. **Check the authorization component is running:** + +```shell +kubectl get pods -n -l app.kubernetes.io/name=authorizer +``` + +2. **Verify the authorization mode in logs:** + +```shell +kubectl logs -n deployment/authorizer | grep "Authz client config" +# Expected: Authz client config: type=External (or Noop, UserClouds) +``` + +3. **For External mode, verify connectivity:** + +```shell +kubectl logs -n deployment/authorizer | grep "external authorization" +# Expected: Initializing an external authorization proxy service with endpoint ... +``` + +4. **Verify from the console:** Navigate to the {{< key product_name >}} console and confirm you can view projects and runs without errors. + +5. **Verify from the CLI:** Trigger a workflow execution to confirm the non-browser flow works: + +```shell +uctl get project +uctl create execution --project --domain development --launch-plan +``` + +6. **For External mode, verify service account access:** Monitor the external server logs for requests from the internal platform service accounts (Apps 3, 4, 5). Ensure all are receiving `ALLOWED` decisions. + +## Troubleshooting + +### All requests denied + +- **Check service account mappings** — the most common cause is that the internal platform service accounts (Apps 3, 4, 5) are not granted permissions in the external server. Check the external server logs for `DENIED` decisions with service account subjects. +- Check that the external authorization server is running and reachable +- Verify the `grpcConfig.host` endpoint is correct (use `dns:///` prefix for DNS-based resolution) +- Temporarily set `failOpen: true` to confirm the issue is with the external backend + +### Dataplane cannot register or heartbeat + +The operator (App 4) needs `ACTION_MANAGE_CLUSTER` permission. Check: + +```shell +kubectl logs -n deployment/authorizer | grep "MANAGE_CLUSTER" +``` + +If you see denied decisions for the operator's client ID, add it to your external server's permission configuration. + +### Workflows fail to launch child tasks + +The EAGER service account (App 5) needs `ACTION_CREATE_FLYTE_EXECUTIONS` and `ACTION_REGISTER_FLYTE_INVENTORY`. Check: + +```shell +kubectl logs -n deployment/authorizer | grep "" +``` + +### "Owned By: Unknown" in the console + +The `preferred_username` claim is not configured in your identity provider. See [Authentication — Authorization server setup]({{< relref "authentication#authorization-server-setup" >}}). + +### Authorization component crashlooping + +- Check logs: `kubectl logs -n deployment/authorizer` +- Verify the `type` field is a valid value (`Noop`, `External`, or `UserClouds` for Union RBAC) +- Ensure the `externalClient.grpcConfig.host` is set when using `External` mode + +### High latency on API calls + +- Check `external:authorize_duration` metrics in the Grafana dashboard +- The authorization backend is on the critical path — external backend latency directly impacts API response times +- Consider reducing `perRetryTimeout` or setting `maxRetries: 0` for fail-fast behavior + +### Connection errors to external backend + +- Check `external:errors{grpc_code}` metrics for the failure mode: + - `Unavailable`: Network connectivity issue — verify the service endpoint and port + - `DeadlineExceeded`: Timeout — the external server is too slow to respond + - `Internal`/`Unknown`: Application error in the external server +- Use `insecure: true` for plaintext connections within the cluster +- Use `insecureSkipVerify: true` only for testing with self-signed certificates diff --git a/content/deployment/selfhosted/control-plane-aws.md b/content/deployment/selfhosted/control-plane-aws.md new file mode 100644 index 000000000..d25e5e48a --- /dev/null +++ b/content/deployment/selfhosted/control-plane-aws.md @@ -0,0 +1,245 @@ +--- +title: Control plane on AWS +weight: 1 +variants: -flyte +union +--- + +# Control plane on AWS + +This guide covers deploying the {{< key product_name >}} control plane in an AWS environment as part of a [self-hosted deployment](./_index). + +## Prerequisites + +In addition to the [general prerequisites](./_index#prerequisites), you need: + +1. **Amazon RDS** PostgreSQL instance (12+) +2. **S3 buckets** for control plane metadata and artifacts storage +3. **IAM roles** configured with [IRSA](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) for control plane services and artifacts + +## Installation + +### Step 1: Install prerequisites + +#### Install ScyllaDB CRDs (if using embedded ScyllaDB) + +```shell +cd helm-charts/charts/controlplane +./scripts/install-scylla-crds.sh +``` + +#### Add Helm repositories + +```shell +helm repo add unionai https://unionai.github.io/helm-charts/ +helm repo add flyte https://helm.flyte.org +helm repo update +``` + +### Step 2: Create registry image pull secret + +Create the registry secret in the control plane namespace: + +```shell +kubectl create namespace + +kubectl create secret docker-registry union-registry-secret \ + --docker-server="registry.unionai.cloud" \ + --docker-username="" \ + --docker-password="" \ + -n +``` + +> [!NOTE] +> The registry username typically follows the format `robot$`. +> Note the backslash escape (`\$`) before the `$` character in the username when running in a shell. +> Contact {{< key product_name >}} support if you haven't received your registry credentials. + +### Step 3: Generate TLS certificates + +gRPC requires TLS for HTTP/2 with NGINX. You can use self-signed certificates for intra-cluster communication. + +{{< tabs >}} +{{< tab "OpenSSL (self-signed)" >}} + +```shell +openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ + -keyout controlplane-tls.key \ + -out controlplane-tls.crt \ + -subj "/CN=..svc.cluster.local" + +kubectl create secret tls controlplane-tls-cert \ + --key controlplane-tls.key \ + --cert controlplane-tls.crt \ + -n +``` + +{{< /tab >}} +{{< tab "cert-manager (recommended)" >}} + +For production deployments, use cert-manager with a self-signed `ClusterIssuer` or your organization's CA. See the `extraObjects` section in [`values.aws.selfhosted-intracluster.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/controlplane/values.aws.selfhosted-intracluster.yaml) for an example configuration. + +{{< /tab >}} +{{< /tabs >}} + +### Step 4: Create database password secret + +```shell +kubectl create secret generic \ + --from-literal=pass.txt='' \ + -n +``` + +> [!NOTE] +> The secret must contain a key named `pass.txt` with the database password. +> The default secret name is set in your Helm values. + +### Step 5: Download values files + +```shell +curl -O https://raw.githubusercontent.com/unionai/helm-charts/main/charts/controlplane/values.aws.selfhosted-intracluster.yaml + +curl -O https://raw.githubusercontent.com/unionai/helm-charts/main/charts/controlplane/values.registry.yaml +``` + +Create an overrides file `values.aws.selfhosted-overrides.yaml`: + +```yaml +global: + AWS_REGION: "us-east-1" + DB_HOST: "my-rds-instance.abcdef.us-east-1.rds.amazonaws.com" + DB_NAME: "unionai" + DB_USER: "unionai" + BUCKET_NAME: "my-company-cp-flyte" + ARTIFACTS_BUCKET_NAME: "my-company-cp-artifacts" + ARTIFACT_IAM_ROLE_ARN: "arn:aws:iam::123456789012:role/union-artifacts" + FLYTEADMIN_IAM_ROLE_ARN: "arn:aws:iam::123456789012:role/union-flyteadmin" + UNION_ORG: "my-company" +``` + +To enable authentication, add the OIDC configuration to this file. See the [Authentication](./authentication) guide. + +### Step 6: Install control plane + +```shell +helm upgrade --install unionai-controlplane unionai/controlplane \ + --namespace \ + --create-namespace \ + -f values.aws.selfhosted-intracluster.yaml \ + -f values.registry.yaml \ + -f values.aws.selfhosted-overrides.yaml \ + --timeout 15m \ + --wait +``` + +**Values file layers (applied in order):** + +1. [`values.aws.selfhosted-intracluster.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/controlplane/values.aws.selfhosted-intracluster.yaml) — AWS infrastructure defaults (database, storage, networking) +2. [`values.registry.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/controlplane/values.registry.yaml) — Registry configuration and image pull secrets +3. `values.aws.selfhosted-overrides.yaml` — Your environment-specific overrides + +### Step 7: Verify installation + +```shell +# Check pod status +kubectl get pods -n + +# Verify services are running +kubectl get svc -n + +# Check admin service logs +kubectl logs -n deploy/ --tail=50 + +# Test internal connectivity +kubectl exec -n deploy/ -- \ + curl -k https://..svc.cluster.local +``` + +All pods should be in `Running` state and internal connectivity should succeed. + +> [!NOTE] +> Replace `` with your Helm release namespace (the namespace you used during `helm install`). Replace `` and `` with the actual deployment names from `kubectl get deploy -n `. + +## Key configuration + +### Single-tenant mode + +Self-hosted deployments use single-tenant mode with an explicit organization: + +```yaml +global: + UNION_ORG: "my-company" +``` + +### TLS + +Configure the namespace and name of the Kubernetes TLS secret: + +```yaml +global: + TLS_SECRET_NAMESPACE: "" + TLS_SECRET_NAME: "controlplane-tls-cert" + +ingress-nginx: + controller: + extraArgs: + default-ssl-certificate: "/controlplane-tls-cert" +``` + +### Service discovery + +Control plane services discover each other via Kubernetes DNS: + +- **Admin service**: `..svc.cluster.local:81` +- **NGINX Ingress**: `..svc.cluster.local` +- **Data plane** (for dataproxy): `..svc.cluster.local` + +## Next steps + +1. [Deploy the data plane](./data-plane-aws) +2. [Configure authentication](./authentication) + +## Troubleshooting + +### Control plane pods not starting + +```shell +kubectl describe pod -n +kubectl top nodes +kubectl get secret -n +``` + +### TLS/Certificate errors + +```shell +kubectl get secret controlplane-tls-cert -n +kubectl get secret controlplane-tls-cert -n \ + -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout +kubectl logs -n deploy/ +``` + +### Database connection failures + +```shell +# Verify credentials +kubectl get secret -n \ + -o jsonpath='{.data.pass\.txt}' | base64 -d + +# Test connectivity +kubectl run -n test-db --image=postgres:14 --rm -it -- \ + psql -h -U -d +``` + +### Data plane cannot connect to control plane + +```shell +# Verify service endpoints +kubectl get svc -n | grep -E 'admin\|nginx-controller' + +# Test DNS resolution from data plane namespace +kubectl run -n test-dns --image=busybox --rm -it -- \ + nslookup ..svc.cluster.local + +# Check network policies +kubectl get networkpolicies -n +kubectl get networkpolicies -n +``` diff --git a/content/deployment/selfhosted/control-plane-gcp.md b/content/deployment/selfhosted/control-plane-gcp.md new file mode 100644 index 000000000..ecc1b446f --- /dev/null +++ b/content/deployment/selfhosted/control-plane-gcp.md @@ -0,0 +1,266 @@ +--- +title: Control plane on GCP +weight: 2 +variants: -flyte +union +--- + +# Control plane on GCP + +This guide covers deploying the {{< key product_name >}} control plane in a GCP environment as part of a [self-hosted deployment](./_index). + +> [!NOTE] +> Self-hosted intra-cluster deployment is currently officially supported on **AWS** only. +> GCP support is in preview and additional cloud providers are coming soon. +> For production deployments, see [Control plane on AWS](./control-plane-aws). + +## Prerequisites + +In addition to the [general prerequisites](./_index#prerequisites), you need: + +1. **Cloud SQL** PostgreSQL instance (12+) +2. **GCS buckets** for control plane metadata and artifacts storage +3. **GCP service accounts** configured with [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) for control plane services and artifacts + +## Installation + +### Step 1: Install prerequisites + +#### Install ScyllaDB CRDs (if using embedded ScyllaDB) + +```shell +cd helm-charts/charts/controlplane +./scripts/install-scylla-crds.sh +``` + +#### Add Helm repositories + +```shell +helm repo add unionai https://unionai.github.io/helm-charts/ +helm repo add flyte https://helm.flyte.org +helm repo update +``` + +### Step 2: Create registry image pull secret + +Create the registry secret in the control plane namespace: + +```shell +kubectl create namespace + +kubectl create secret docker-registry union-registry-secret \ + --docker-server="registry.unionai.cloud" \ + --docker-username="" \ + --docker-password="" \ + -n +``` + +> [!NOTE] +> The registry username typically follows the format `robot$`. +> Note the backslash escape (`\$`) before the `$` character in the username when running in a shell. +> Contact {{< key product_name >}} support if you haven't received your registry credentials. + +### Step 3: Generate TLS certificates + +gRPC requires TLS for HTTP/2 with NGINX. You can use self-signed certificates for intra-cluster communication. + +{{< tabs >}} +{{< tab "OpenSSL (self-signed)" >}} + +```shell +openssl req -x509 -nodes -days 365 -newkey rsa:2048 \ + -keyout controlplane-tls.key \ + -out controlplane-tls.crt \ + -subj "/CN=..svc.cluster.local" + +kubectl create secret tls controlplane-tls-cert \ + --key controlplane-tls.key \ + --cert controlplane-tls.crt \ + -n +``` + +{{< /tab >}} +{{< tab "cert-manager (recommended)" >}} + +For production deployments, use cert-manager with a self-signed `ClusterIssuer` or your organization's CA. See the `extraObjects` section in [`values.gcp.selfhosted-intracluster.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/controlplane/values.gcp.selfhosted-intracluster.yaml) for an example configuration. + +{{< /tab >}} +{{< /tabs >}} + +### Step 4: Create database password secret + +```shell +kubectl create secret generic \ + --from-literal=pass.txt='' \ + -n +``` + +> [!NOTE] +> The secret must contain a key named `pass.txt` with the database password. +> The default secret name is set in your Helm values. + +### Step 5: Download values files + +```shell +curl -O https://raw.githubusercontent.com/unionai/helm-charts/main/charts/controlplane/values.gcp.selfhosted-intracluster.yaml + +curl -O https://raw.githubusercontent.com/unionai/helm-charts/main/charts/controlplane/values.registry.yaml +``` + +Create an overrides file `values.gcp.selfhosted-overrides.yaml`: + +```yaml +global: + GCP_REGION: "us-central1" + DB_HOST: "10.247.0.3" + DB_NAME: "unionai" + DB_USER: "unionai" + BUCKET_NAME: "my-company-cp-flyte" + ARTIFACTS_BUCKET_NAME: "my-company-cp-artifacts" + ARTIFACT_IAM_ROLE_ARN: "artifacts@my-project.iam.gserviceaccount.com" + FLYTEADMIN_IAM_ROLE_ARN: "flyteadmin@my-project.iam.gserviceaccount.com" + UNION_ORG: "my-company" + GOOGLE_PROJECT_ID: "my-gcp-project" +``` + +To enable authentication, add the OIDC configuration to this file. See the [Authentication](./authentication) guide. + +### Step 6: Install control plane + +```shell +helm upgrade --install unionai-controlplane unionai/controlplane \ + --namespace \ + --create-namespace \ + -f values.gcp.selfhosted-intracluster.yaml \ + -f values.registry.yaml \ + -f values.gcp.selfhosted-overrides.yaml \ + --timeout 15m \ + --wait +``` + +**Values file layers (applied in order):** + +1. [`values.gcp.selfhosted-intracluster.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/controlplane/values.gcp.selfhosted-intracluster.yaml) — GCP infrastructure defaults (database, storage, networking) +2. [`values.registry.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/controlplane/values.registry.yaml) — Registry configuration and image pull secrets +3. `values.gcp.selfhosted-overrides.yaml` — Your environment-specific overrides + +### Step 7: Verify installation + +```shell +# Check pod status +kubectl get pods -n + +# Verify services are running +kubectl get svc -n + +# Check admin service logs +kubectl logs -n deploy/ --tail=50 + +# Test internal connectivity +kubectl exec -n deploy/ -- \ + curl -k https://..svc.cluster.local +``` + +All pods should be in `Running` state and internal connectivity should succeed. + +> [!NOTE] +> Replace `` with your Helm release namespace (the namespace you used during `helm install`). Replace `` and `` with the actual deployment names from `kubectl get deploy -n `. + +## Key configuration + +### Single-tenant mode + +Self-hosted deployments use single-tenant mode with an explicit organization: + +```yaml +global: + UNION_ORG: "my-company" +``` + +### TLS + +Configure the namespace and name of the Kubernetes TLS secret: + +```yaml +global: + TLS_SECRET_NAMESPACE: "" + TLS_SECRET_NAME: "controlplane-tls-cert" + +ingress-nginx: + controller: + extraArgs: + default-ssl-certificate: "/controlplane-tls-cert" +``` + +### Service discovery + +Control plane services discover each other via Kubernetes DNS: + +- **Admin service**: `..svc.cluster.local:81` +- **NGINX Ingress**: `..svc.cluster.local` +- **Data plane** (for dataproxy): `..svc.cluster.local` + +## Next steps + +1. [Deploy the data plane](./data-plane-gcp) +2. [Configure authentication](./authentication) + +## Troubleshooting + +### Control plane pods not starting + +```shell +kubectl describe pod -n +kubectl top nodes +kubectl get secret -n +``` + +### TLS/Certificate errors + +```shell +kubectl get secret controlplane-tls-cert -n +kubectl get secret controlplane-tls-cert -n \ + -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout +kubectl logs -n deploy/ +``` + +### Database connection failures + +```shell +# Verify credentials +kubectl get secret -n \ + -o jsonpath='{.data.pass\.txt}' | base64 -d + +# Test connectivity +kubectl run -n test-db --image=postgres:14 --rm -it -- \ + psql -h -U -d +``` + +### Workload Identity issues + +```shell +# Verify service account annotations +kubectl get sa -n -o yaml | grep iam.gke.io/gcp-service-account + +# Check IAM bindings +gcloud iam service-accounts get-iam-policy + +# Verify pod can authenticate +kubectl exec -n deploy/ -- \ + curl -H "Metadata-Flavor: Google" \ + http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email +``` + +### Data plane cannot connect to control plane + +```shell +# Verify service endpoints +kubectl get svc -n | grep -E 'admin\|nginx-controller' + +# Test DNS resolution from data plane namespace +kubectl run -n test-dns --image=busybox --rm -it -- \ + nslookup ..svc.cluster.local + +# Check network policies +kubectl get networkpolicies -n +kubectl get networkpolicies -n +``` diff --git a/content/deployment/selfhosted/data-plane-aws.md b/content/deployment/selfhosted/data-plane-aws.md new file mode 100644 index 000000000..15165469f --- /dev/null +++ b/content/deployment/selfhosted/data-plane-aws.md @@ -0,0 +1,129 @@ +--- +title: Data plane on AWS +weight: 3 +variants: -flyte +union +--- + +# Data plane on AWS + +This guide covers deploying the {{< key product_name >}} data plane in the same cluster as your control plane, as part of a [self-hosted deployment](./_index). + +> [!NOTE] +> Deploy the [control plane](./control-plane-aws) first before proceeding with data plane installation. + +## Prerequisites + +In addition to the [general prerequisites](./_index#prerequisites): + +1. **{{< key product_name >}} control plane** deployed in the same cluster +2. **S3 buckets** for data plane metadata storage +3. **IAM roles** configured with [IRSA](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) for backend and worker service accounts +4. **Network connectivity** between data plane and control plane namespaces + +## Installation + +### Step 1: Install data plane CRDs + +```shell +helm upgrade --install unionai-dataplane-crds unionai/dataplane-crds \ + --namespace \ + --create-namespace +``` + +### Step 2: Download values file + +```shell +curl -O https://raw.githubusercontent.com/unionai/helm-charts/main/charts/dataplane/values.aws.selfhosted-intracluster.yaml +``` + +Create an overrides file `values.aws.selfhosted-overrides.yaml`: + +```yaml +global: + CLUSTER_NAME: "prod-us-east-1" + ORG_NAME: "my-company" + METADATA_BUCKET: "my-company-dp-metadata" + FAST_REGISTRATION_BUCKET: "my-company-dp-metadata" + AWS_REGION: "us-east-1" + BACKEND_IAM_ROLE_ARN: "arn:aws:iam::123456789012:role/union-backend" + WORKER_IAM_ROLE_ARN: "arn:aws:iam::123456789012:role/union-worker" + CONTROLPLANE_INTRA_CLUSTER_HOST: "..svc.cluster.local" + QUEUE_SERVICE_HOST: "..svc.cluster.local:80" + CACHESERVICE_ENDPOINT: "..svc.cluster.local:89" +``` + +If authentication is enabled on the control plane, also set `AUTH_CLIENT_ID`. See the [Authentication](./authentication) guide. + +### Step 3: Install data plane + +```shell +helm upgrade --install unionai-dataplane unionai/dataplane \ + --namespace \ + --create-namespace \ + -f values.aws.selfhosted-intracluster.yaml \ + -f values.aws.selfhosted-overrides.yaml \ + --timeout 10m \ + --wait +``` + +**Values file layers (applied in order):** + +1. [`values.aws.selfhosted-intracluster.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/dataplane/values.aws.selfhosted-intracluster.yaml) — AWS infrastructure defaults (storage, networking, intra-cluster communication) +2. `values.aws.selfhosted-overrides.yaml` — Your environment-specific overrides + +### Step 4: Verify installation + +```shell +# Check that data plane pods are running +kubectl get pods -n + +# Verify connectivity to control plane +kubectl logs -n -l app.kubernetes.io/name=operator --tail=50 | grep "connection" + +# Check service DNS resolution +kubectl exec -n deploy/unionai-dataplane-operator -- \ + nslookup ..svc.cluster.local +``` + +## Key differences from self-managed deployment + +| Feature | Self-managed | Self-hosted (intra-cluster) | +|---|---|---| +| Control plane location | External ({{< key product_name >}}-managed) | Same Kubernetes cluster | +| Network path | Internet or VPN | Kubernetes internal networking | +| Authentication | OAuth2 via `uctl selfserve` | OAuth2 with [manual setup](./authentication) | +| TLS certificates | Trusted CA certificates | Can use self-signed certificates | +| Ingress type | LoadBalancer (external) | ClusterIP (internal) | + +## Troubleshooting + +### Cannot resolve control plane services + +```shell +# Check DNS resolution from data plane namespace +kubectl run -n test-dns --image=busybox --rm -it -- \ + nslookup ..svc.cluster.local + +# Verify the service exists +kubectl get svc -n | grep nginx-controller +``` + +### Connection refused errors + +```shell +# Verify control plane services are running +kubectl get svc -n +kubectl get pods -n + +# Check network policies +kubectl get networkpolicies -n +kubectl get networkpolicies -n +``` + +### Certificate verification errors + +If using self-signed certificates, ensure `insecureSkipVerify: true` is set in `values.aws.selfhosted-intracluster.yaml`. Verify the `_U_INSECURE_SKIP_VERIFY` environment variable is set in task pods. + +### Authentication errors + +See the [Authentication troubleshooting](./authentication#troubleshooting) section. diff --git a/content/deployment/selfhosted/data-plane-gcp.md b/content/deployment/selfhosted/data-plane-gcp.md new file mode 100644 index 000000000..67d0ce2ad --- /dev/null +++ b/content/deployment/selfhosted/data-plane-gcp.md @@ -0,0 +1,163 @@ +--- +title: Data plane on GCP +weight: 4 +variants: -flyte +union +--- + +# Data plane on GCP + +This guide covers deploying the {{< key product_name >}} data plane in the same cluster as your control plane, as part of a [self-hosted deployment](./_index). + +> [!NOTE] +> Self-hosted intra-cluster deployment is currently officially supported on **AWS** only. +> GCP support is in preview and additional cloud providers are coming soon. +> For production deployments, see [Data plane on AWS](./data-plane-aws). + +> [!NOTE] +> Deploy the [control plane](./control-plane-gcp) first before proceeding with data plane installation. + +## Prerequisites + +In addition to the [general prerequisites](./_index#prerequisites): + +1. **{{< key product_name >}} control plane** deployed in the same cluster +2. **GCS buckets** for data plane metadata storage +3. **GCP service accounts** configured with [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) for backend and worker service accounts +4. **Network connectivity** between data plane and control plane namespaces + +## Installation + +### Step 1: Install data plane CRDs + +```shell +helm upgrade --install unionai-dataplane-crds unionai/dataplane-crds \ + --namespace \ + --create-namespace +``` + +### Step 2: Download values file + +```shell +curl -O https://raw.githubusercontent.com/unionai/helm-charts/main/charts/dataplane/values.gcp.selfhosted-intracluster.yaml +``` + +Create an overrides file `values.gcp.selfhosted-overrides.yaml`: + +```yaml +global: + CLUSTER_NAME: "prod-us-central1" + ORG_NAME: "my-company" + METADATA_BUCKET: "my-company-dp-metadata" + FAST_REGISTRATION_BUCKET: "my-company-dp-metadata" + GCP_REGION: "us-central1" + GOOGLE_PROJECT_ID: "my-gcp-project" + BACKEND_IAM_ROLE_ARN: "union-backend@my-project.iam.gserviceaccount.com" + WORKER_IAM_ROLE_ARN: "union-worker@my-project.iam.gserviceaccount.com" + CONTROLPLANE_INTRA_CLUSTER_HOST: "..svc.cluster.local" + QUEUE_SERVICE_HOST: "..svc.cluster.local:80" + CACHESERVICE_ENDPOINT: "..svc.cluster.local:89" +``` + +If authentication is enabled on the control plane, also set `AUTH_CLIENT_ID`. See the [Authentication](./authentication) guide. + +### Step 3: Install data plane + +```shell +helm upgrade --install unionai-dataplane unionai/dataplane \ + --namespace \ + --create-namespace \ + -f values.gcp.selfhosted-intracluster.yaml \ + -f values.gcp.selfhosted-overrides.yaml \ + --timeout 10m \ + --wait +``` + +**Values file layers (applied in order):** + +1. [`values.gcp.selfhosted-intracluster.yaml`](https://github.com/unionai/helm-charts/blob/main/charts/dataplane/values.gcp.selfhosted-intracluster.yaml) — GCP infrastructure defaults (storage, networking, intra-cluster communication) +2. `values.gcp.selfhosted-overrides.yaml` — Your environment-specific overrides + +### Step 4: Verify installation + +```shell +# Check that data plane pods are running +kubectl get pods -n + +# Verify connectivity to control plane +kubectl logs -n -l app.kubernetes.io/name=operator --tail=50 | grep "connection" + +# Check service DNS resolution +kubectl exec -n deploy/unionai-dataplane-operator -- \ + nslookup ..svc.cluster.local +``` + +## Key differences from self-managed deployment + +| Feature | Self-managed | Self-hosted (intra-cluster) | +|---|---|---| +| Control plane location | External ({{< key product_name >}}-managed) | Same Kubernetes cluster | +| Network path | Internet or VPN | Kubernetes internal networking | +| Authentication | OAuth2 via `uctl selfserve` | OAuth2 with [manual setup](./authentication) | +| TLS certificates | Trusted CA certificates | Can use self-signed certificates | +| Ingress type | LoadBalancer (external) | ClusterIP (internal) | + +## Troubleshooting + +### Cannot resolve control plane services + +```shell +# Check DNS resolution from data plane namespace +kubectl run -n test-dns --image=busybox --rm -it -- \ + nslookup ..svc.cluster.local + +# Verify the service exists +kubectl get svc -n | grep nginx-controller +``` + +### Connection refused errors + +```shell +# Verify control plane services are running +kubectl get svc -n +kubectl get pods -n + +# Check network policies +kubectl get networkpolicies -n +kubectl get networkpolicies -n +``` + +### Workload Identity issues + +```shell +# Verify service account annotations +kubectl get sa -n -o yaml | grep iam.gke.io/gcp-service-account + +# Check IAM bindings +gcloud iam service-accounts get-iam-policy +gcloud iam service-accounts get-iam-policy + +# Verify pod can authenticate +kubectl exec -n deploy/unionai-dataplane-operator -- \ + curl -H "Metadata-Flavor: Google" \ + http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/email +``` + +### GCS access issues + +```shell +# Verify bucket exists +gsutil ls -b gs:// + +# Check bucket IAM permissions +gsutil iam get gs:// +``` + +Backend service accounts need `roles/storage.objectAdmin` or equivalent on the metadata bucket. + +### Certificate verification errors + +If using self-signed certificates, ensure `insecureSkipVerify: true` is set in `values.gcp.selfhosted-intracluster.yaml`. Verify the `_U_INSECURE_SKIP_VERIFY` environment variable is set in task pods. + +### Authentication errors + +See the [Authentication troubleshooting](./authentication#troubleshooting) section. diff --git a/content/deployment/selfhosted/image-builder.md b/content/deployment/selfhosted/image-builder.md new file mode 100644 index 000000000..9a6007829 --- /dev/null +++ b/content/deployment/selfhosted/image-builder.md @@ -0,0 +1,309 @@ +--- +title: Image builder +weight: 8 +variants: -flyte +union +--- + +# Image builder + +The image builder enables {{< key product_name >}} to automatically build container images for your tasks when using the `flyte.Image` API. In self-hosted deployments, the image builder requires a one-time registration step after initial deployment. + +## Prerequisites + +- A running self-hosted deployment with both control plane and data plane healthy +- The `flyte` CLI installed (`pip install flyte` or `uv pip install flyte`) +- CLI access configured for your self-hosted environment (see [Authentication](./authentication)) +- Image builder enabled in your data plane Helm values (`imageBuilder.enabled: true`) +- An account with permission to register tasks in the `system` project + +## Register the build-image task + +The `build-image` task must be registered in the `system/production` project before users can build images. This step must be repeated each time you upgrade to a new {{< key product_name >}} version. + +### Step 1: Determine your appVersion + +Find the `appVersion` from your deployed controlplane Helm chart: + +```shell +helm list -n -o json | jq '.[0].app_version' +``` + +Or check the chart directly: + +```shell +grep appVersion charts/controlplane/Chart.yaml +``` + +This returns a version like `2026.3.4`. + +### Step 2: Create the task definition file + +Create a file named `build_image_task.py` with the following contents: + +```python +import os + +from kubernetes.client import ( + V1PodSpec, + V1Container, + V1EnvVar, + V1VolumeMount, + V1EnvVarSource, + V1ObjectFieldSelector, + V1ConfigMapKeySelector, + V1Volume, + V1ProjectedVolumeSource, + V1VolumeProjection, + V1ServiceAccountTokenProjection, + V1ConfigMapVolumeSource, + V1KeyToPath, +) + +import flyte +from flyte.extras import ContainerTask + + +# Statically assigned name intended to match Union operator static name. +_DEFAULT_CONFIGMAP_NAME = "build-image-config" +_STORAGE_YAML_KEY = "storage.yaml" +_CONFIG_DIR = "/etc/union/config" + +# The config map storing build image configuration. +config_map_name = os.getenv("CONFIG_MAP_NAME", _DEFAULT_CONFIGMAP_NAME) + +log_level = os.getenv("LOG_LEVEL", "5") # Default to Warn + +union_image_name_prefix = "public.ecr.aws/g1m2l3c1/imagebuilder-staging" + +app_version = os.getenv("APP_VERSION", None) +if app_version is None: + raise ValueError("APP_VERSION environment variable must be set") + +build_image_task = ContainerTask( + name="build-image", + cache=flyte.Cache(behavior="auto"), + image=f"{union_image_name_prefix}/build-image:{app_version}", + inputs={"spec": str, "context": str, "target_image": str}, + outputs={"fully_qualified_image": str}, + pod_template=flyte.PodTemplate( + primary_container_name="main", + pod_spec=V1PodSpec( + containers=[ + V1Container( + name="main", + image_pull_policy="Always", + termination_message_policy="FallbackToLogsOnError", + volume_mounts=[ + V1VolumeMount( + mount_path="/var/run/secrets/union/registry", + name="registry-token", + read_only=True, + ), + V1VolumeMount( + name="config-volume", + mount_path=f"{_CONFIG_DIR}/{_STORAGE_YAML_KEY}", + sub_path=_STORAGE_YAML_KEY, + ), + ], + env=[ + V1EnvVar( + name="ORGANIZATION", + value_from=V1EnvVarSource( + field_ref=V1ObjectFieldSelector( + field_path="metadata.labels['organization']" + ) + ), + ), + V1EnvVar( + name="UNION_BUILDKIT_URI", + value_from=V1EnvVarSource( + config_map_key_ref=V1ConfigMapKeySelector( + name=config_map_name, + key="buildkit-uri", + ) + ), + ), + V1EnvVar( + name="UNION_DEFAULT_REPOSITORY", + value_from=V1EnvVarSource( + config_map_key_ref=V1ConfigMapKeySelector( + name=config_map_name, + key="default-repository", + ) + ), + ), + V1EnvVar( + name="UNION_REGISTRY_AUTHENTICATION_TYPE", + value_from=V1EnvVarSource( + config_map_key_ref=V1ConfigMapKeySelector( + name=config_map_name, + key="authentication-type", + ) + ), + ), + V1EnvVar( + name="UNION_IMAGE_NAME_PREFIX", + value=union_image_name_prefix, + ), + V1EnvVar( + name="FLYTE_INTERNAL_OPTIMIZE_IMAGE", + value_from=V1EnvVarSource( + config_map_key_ref=V1ConfigMapKeySelector( + name=config_map_name, + key="enable-image-optimization", + optional=True, + ) + ), + ), + ], + ), + ], + volumes=[ + V1Volume( + name="registry-token", + projected=V1ProjectedVolumeSource( + sources=[ + V1VolumeProjection( + service_account_token=V1ServiceAccountTokenProjection( + audience="registry", + expiration_seconds=7200, + path="token", + ) + ) + ] + ), + ), + V1Volume( + name="config-volume", + config_map=V1ConfigMapVolumeSource( + name=config_map_name, + items=[ + V1KeyToPath( + key=_STORAGE_YAML_KEY, + path=_STORAGE_YAML_KEY, + ) + ], + ), + ), + ], + ), + ), + command=[ + "imagebuild", + "--logger.formatter.type=text", + f"--logger.level={log_level}", + "--context", + "{{.inputs.context}}", + "--frontend", + f"{union_image_name_prefix}/frontend-v2:{app_version}", + "--remote-outputs-prefix", + "{{.outputPrefix}}", + "--spec", + "{{.inputs.spec}}", + "--target-image", + "{{.inputs.target_image}}", + "--optimize", + ], +) + +build_image_task_env = flyte.TaskEnvironment.from_task( + "build_image_task", build_image_task +) +``` + +### Step 3: Register the task + +From the directory containing `build_image_task.py`, run: + +```shell +APP_VERSION= \ +uv run flyte --config deploy \ + --version "" \ + --project system --domain production \ + build_image_task.py build_image_task_env +``` + +Replace: +- `` with the appVersion from Step 1 (e.g. `2026.3.4`) +- `` with the path to your CLI config file + +### Step 4: Verify + +Confirm the task is registered: + +```shell +flyte --config get task -p system -d production +``` + +You should see `build-image` listed. + +## Updating + +When you upgrade your self-hosted deployment to a new appVersion, repeat the registration steps with the new version. The SDK always uses the latest registered version. + +## Restricted network environments + +If your buildkit pods do not have egress to public container registries or the internet (e.g. due to network policies, firewall rules, or air-gapped infrastructure), additional configuration is required. + +### Container image access + +The image builder needs to pull three images during a build: + +| Image | Purpose | Default source | +|-------|---------|---------------| +| `frontend-v2` | Buildkit gateway frontend | `public.ecr.aws/...` (Union) | +| `build-image` | Build task executor | `public.ecr.aws/...` (Union) | +| Base image (e.g. `python:3.12-slim`) | User's base image | Docker Hub or custom registry | + +If buildkit cannot reach public registries, you must mirror these images to an internal registry that buildkit can access. Update the `union_image_name_prefix` in your `build_image_task.py` to point to your internal registry, and use `Image.from_base()` with an image URI from your internal registry. + +> [!NOTE] +> Starting with version `2026.3.7`, the `uv` package manager binary is baked into the `frontend-v2` image. Previous versions pulled `ghcr.io/astral-sh/uv` as a separate image during builds, which required additional mirroring. + +### Python version constraints + +When using a custom base image via `Image.from_base()`, the Python version in your base image must match the Python version in your image specification. The image builder does not download Python interpreters from the internet — it uses only the Python already installed in the base image. + +If there is a version mismatch, the build will fail with: + +``` +error: No interpreter found for Python 3.13 in managed installations or search path +hint: A managed Python download is available for Python 3.13, but Python downloads are set to 'never' +``` + +To resolve this, ensure your `python_version` matches what is installed in your base image: + +```python +# Base image has Python 3.12 installed +image = Image.from_base("your-registry.example.com/python:3.12-slim") +``` + +### AWS VPC endpoints + +For AWS deployments where pods cannot reach public endpoints, configure VPC interface endpoints for ECR so that buildkit can push and pull images through private network paths: + +- `com.amazonaws..ecr.api` — ECR API (authentication, image metadata) +- `com.amazonaws..ecr.dkr` — ECR Docker registry (image push/pull) +- `com.amazonaws..s3` — S3 Gateway endpoint (ECR stores image layers in S3) + +The ECR interface endpoints must have private DNS enabled and a security group that allows inbound traffic from the VPC CIDR blocks (including any secondary CIDRs used by EKS pod networking). + +### Package index access + +The image builder runs `pip install` or `uv pip install` to install Python packages during builds. If your buildkit pods cannot reach `pypi.org`, you must configure a private package index. You can do this by including a `pip.conf` or setting the `PIP_INDEX_URL` environment variable in your base image: + +```dockerfile +# In your custom base image Dockerfile +ENV PIP_INDEX_URL=https://your-artifactory.example.com/api/pypi/pypi-remote/simple +ENV PIP_TRUSTED_HOST=your-artifactory.example.com +``` + +## Troubleshooting + +### "remote image builder is not enabled" + +This error from the SDK means the `build-image` task is not registered in `system/production`. Follow the registration steps above. + +### Build task fails with permission errors + +Ensure the worker IAM role has permissions to push images to the container registry configured in your data plane Helm values (`imageBuilder.defaultRepository`). diff --git a/content/deployment/selfhosted/monitoring.md b/content/deployment/selfhosted/monitoring.md new file mode 100644 index 000000000..16e6c84d2 --- /dev/null +++ b/content/deployment/selfhosted/monitoring.md @@ -0,0 +1,376 @@ +--- +title: Monitoring +weight: 7 +variants: -flyte +union +mermaid: true +--- +# Monitoring + +{{< key product_name >}} provides built-in monitoring with Prometheus, Grafana dashboards, alerting rules, and SLO tracking. The monitoring stack is deployed and configured through the Helm charts. + +## Architecture + +### Self-hosted intra-cluster + +In a self-hosted deployment, the controlplane and dataplane share a single Kubernetes cluster. The controlplane namespace runs Prometheus, Grafana, and AlertManager. Prometheus scrapes metrics from services in both namespaces. + +```mermaid +graph LR + subgraph cluster["Kubernetes Cluster"] + subgraph cp["Controlplane Namespace"] + prom["Prometheus\nGrafana\nAlertManager"] + cpsvc["CP Services\nServiceMonitor\nPrometheusRule\nDashboard CM"] + end + + subgraph dp["Dataplane Namespace"] + dpsvc["Operator\nExecutor\nPropeller"] + dpmon["ServiceMonitor\nPrometheusRule\nDashboard CM"] + static["Static Prometheus\n(Union features)"] + end + + prom -- scrapes --> dpsvc + prom -- scrapes --> cpsvc + end +``` + +### Separate controlplane and dataplane clusters + +When the controlplane and dataplane run in separate clusters, each cluster can run its own monitoring stack independently. The dataplane chart includes the same Prometheus, Grafana, and alerting capabilities. + +```mermaid +graph LR + subgraph cpcluster["Controlplane Cluster"] + cpprom["Prometheus\nGrafana\nAlertManager"] + cpstuff["CP Services\nServiceMonitor\nPrometheusRule\nDashboard CM"] + end + + subgraph dpcluster["Dataplane Cluster"] + dpprom["Prometheus\nGrafana\nAlertManager"] + dpstuff["Operator · Executor · Propeller\nServiceMonitor\nPrometheusRule\nDashboard CM"] + end + + cpprom -- scrapes --> cpstuff + dpprom -- scrapes --> dpstuff +``` + +## Dashboards + +{{< key product_name >}} ships two pre-built Grafana dashboards delivered as ConfigMaps. They are defined in the Helm charts: + +- [controlplane chart](https://github.com/unionai/helm-charts/tree/main/charts/controlplane) — `union-controlplane-overview` +- [dataplane chart](https://github.com/unionai/helm-charts/tree/main/charts/dataplane) — `union-dataplane-overview` + +### Controlplane Overview + +| Row | What it shows | +| -------------- | ----------------------------------------------------------------------------- | +| SLOs | Service availability, error budget, ingress success rate, ingress latency p99 | +| Health | Service availability, pod restarts, handler panics, Connect error rate | +| Ingress | Request rate by path, error rate, latency percentiles, active connections | +| Connect / gRPC | Per-service request rate and errors, CacheService gRPC | +| FlyteAdmin | Active executions, event rates, endpoint latency, auth decisions | +| Executions | Execution lifecycle, assignment duration, workqueue operations | +| Queue | Scheduler throughput, queue lengths, dispatcher operations, worker capacity | +| Cluster | Heartbeat rate, cluster health, managed cluster cache | +| CacheService | Cache hit/miss rate, reservation contention | +| Authorizer | Allow/deny rate, authorize latency | +| Data Proxy | Cache rates, image read latency, secret proxy errors | +| Usage | Billable usage reports, message pipeline | +| Infrastructure | CPU, memory, and pod restarts by service | + +### Dataplane Overview + +| Row | What it shows | +| -------------- | --------------------------------------------------------------------------------- | +| SLOs | Service availability, error budget, execution success rate, propeller latency p99 | +| Health | Service availability, pod restarts, handler panics, active workflows | +| Union Operator | Work queue operations, heartbeat latency, config sync, billing | +| Executor (V2) | Active actions, capacity, evaluator latency, system failures | +| Propeller (V1) | Round time, success/error rate, workflow updates, event recording | +| gRPC Client | DP→CP request rate, errors, latency | +| Infrastructure | CPU, memory, and pod restarts by service | + +### Adding custom dashboards + +Create a ConfigMap with the `grafana_dashboard` label in any namespace. The Grafana sidecar discovers it automatically: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: my-custom-dashboard + labels: + grafana_dashboard: "1" +data: + my-dashboard.json: | + { ... Grafana dashboard JSON ... } +``` + +## Service Level Objectives (SLOs) + +The SLO row at the top of each dashboard provides at-a-glance visibility into platform health. These panels are always visible — no configuration needed. + +### What the SLOs measure + +| SLO | What it represents | Controlplane | Dataplane | +| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ | +| **Service Availability** | Are all deployments running their desired replica count? Measures infrastructure health — pods that are down, crashlooping, or pending reduce this metric. | Deployment availability across all CP services | Deployment availability across all DP services | +| **Success Rate** | Are API requests and task executions completing without errors? This is the primary indicator of whether the platform is functioning correctly for users. | Ingress success rate (non-5xx responses) — measures what SDK and API callers experience | Execution success rate (combined V1 propeller round success + V2 executor task completion) | +| **Latency** | Are requests being served within acceptable time? High latency degrades user experience even when success rate is high. | Ingress p99 latency — the worst-case response time callers experience | Propeller round p99 — the worst-case time to process one workflow reconciliation | +| **Error Budget** | How much room is left before the availability target is breached? Derived from the success rate and the configured availability target (default 99.9%). When the budget reaches zero, reliability is below target. | Based on ingress success rate vs target | Based on execution success rate vs target | + +### Enabling SLO recording rules + +The SLO dashboard panels show basic metrics by default. For error budget tracking, enable the SLO recording rules: + +```yaml +monitoring: + slos: + enabled: true + targets: + availability: 0.999 # 99.9% — adjust to your requirements + latencyP99: 5 # seconds — adjust to your requirements +``` + +The recording rules pre-compute success rates and error budget remaining as Prometheus metrics. These are recommended starting points — tune the targets based on your traffic patterns and performance baseline. + +## Alerting + +{{< key product_name >}} includes two layers of alerting that you can enable independently. + +### Operational alerts + +Operational alerts detect basic infrastructure failures — services that are down, containers that are crashlooping, or code panics. Enable them in your values: + +```yaml +monitoring: + alerting: + enabled: true +``` + +| Alert | Severity | Fires when | +| --------------- | -------- | ------------------------------------------------- | +| ServiceDown | critical | Any deployment has 0 available replicas for 5 min | +| HighRestartRate | warning | A container restarts more than 5 times in 1 hour | +| HandlerPanic | critical | Any service handler panic in the last hour | + +These alerts fire on both the controlplane and dataplane. + +### SLO-based alerts + +SLO alerts track error budget consumption and latency against configurable targets. These are provided as recommended starting points — adjust the targets and thresholds to match your operational requirements. + +```yaml +monitoring: + slos: + enabled: true + alerting: + enabled: true + targets: + availability: 0.999 # 99.9% — adjust to your requirements + latencyP99: 5 # seconds — adjust to your requirements +``` + +| Alert | Severity | Fires when | +| -------------------- | -------- | --------------------------------------- | +| HighErrorBudgetBurn | warning | Error budget more than 50% consumed | +| ErrorBudgetExhausted | critical | Error budget fully consumed | +| LatencySLOBreach | warning | p99 latency exceeding target for 10 min | + +> [!NOTE] +> The default SLO targets (99.9% availability, 5s p99 latency) are starting points. Every deployment has different traffic patterns and performance characteristics. Review the SLO dashboard panels after enabling to understand your baseline, then tune the targets to values that are meaningful for your environment. + +### Configuring notifications + +By default, alerts are evaluated and visible in Grafana but do not send notifications. To receive notifications when alerts fire: + +1. Open Grafana at `https:///grafana` +2. Navigate to **Alerting → Contact points** +3. Click **Add contact point** +4. Select your notification channel (Slack, PagerDuty, email, etc.) and configure it +5. Under **Alerting → Notification policies**, route alerts to your contact point + +Alternatively, configure AlertManager receivers directly in your Helm values: + +```yaml +monitoring: + alertmanager: + config: + route: + receiver: my-slack + receivers: + - name: my-slack + slack_configs: + - api_url: "https://hooks.slack.com/services/..." + channel: "#alerts" +``` + +## Configuration + +### ServiceMonitors and PrometheusRules + +{{< key product_name >}} creates ServiceMonitors, PrometheusRules, and dashboard ConfigMaps independently of the kube-prometheus-stack subchart. These resources are controlled by their own flags: + +```yaml +monitoring: + # ServiceMonitor CRDs for Union services. + # Discovered by any Prometheus Operator in the cluster. + serviceMonitors: + enabled: true + + # PrometheusRule CRDs with recording rules. + # Alerting rules require monitoring.alerting.enabled. + prometheusRules: + enabled: true + + # Dashboard ConfigMaps discovered by Grafana sidecar. + dashboards: + enabled: true + label: grafana_dashboard + labelValue: "1" +``` + +These flags default to `true` and work regardless of whether `monitoring.enabled` is set. This is useful when you bring your own Prometheus or Grafana — {{< key product_name >}} resources are created without deploying the full kube-prometheus-stack. + +### Dashboard label configuration + +If your Grafana sidecar uses a different label, configure it: + +```yaml +monitoring: + dashboards: + label: my-custom-label + labelValue: "true" +``` + +## Accessing Grafana + +When the kube-prometheus-stack subchart is enabled (`monitoring.enabled: true`), Grafana is deployed in the controlplane namespace and served at: + +``` +https:///grafana +``` + +Authentication is handled by the same ingress auth gate as other controlplane services. No separate Grafana credentials are needed. + +> [!NOTE] +> Grafana is part of the optional kube-prometheus-stack subchart. If you use your own Grafana instance, set `monitoring.grafana.enabled: false` and configure your Grafana to discover the dashboard ConfigMaps using the `grafana_dashboard` label. + +## Customization + +### Remote write + +Forward metrics to an external time-series database (Amazon Managed Prometheus, Grafana Cloud, Thanos) while keeping the full local Prometheus: + +```yaml +monitoring: + prometheus: + prometheusSpec: + remoteWrite: + - url: "https://aps-workspaces..amazonaws.com/workspaces//api/v1/remote_write" + sigv4: + region: +``` + +This runs Prometheus in fan-out mode — metrics are stored locally and forwarded to the remote backend. Recording rules, alerting, and Grafana all continue to work against the local Prometheus. + +### Using your own Prometheus + +If you already run Prometheus, scrape {{< key product_name >}} services directly. All services expose metrics on port 10254 at `/metrics`. + +#### ServiceMonitor + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: union-services +spec: + selector: + matchLabels: + platform.union.ai/prometheus-group: "union-services" + namespaceSelector: + matchNames: + - controlplane + - dataplane + endpoints: + - port: debug + path: /metrics + interval: 30s +``` + +#### Static scrape config + +```yaml +scrape_configs: + - job_name: union-services + kubernetes_sd_configs: + - role: endpoints + namespaces: + names: [controlplane, dataplane] + relabel_configs: + - source_labels: [__meta_kubernetes_service_label_platform_union_ai_prometheus_group] + regex: union-services + action: keep + - source_labels: [__meta_kubernetes_endpoint_port_name] + regex: debug + action: keep +``` + +## Managed Prometheus examples + +The following examples show how to replace the local Prometheus with a managed Prometheus service for durable storage and scalable query. In each case, Prometheus runs in **agent mode** — it only scrapes and forwards metrics, with no local TSDB. + +### Amazon Managed Prometheus (AMP) + +For AWS deployments where a single Prometheus instance may not scale with high-burst workloads, switch to PrometheusAgent mode with AMP as the backend. + +```yaml +monitoring: + prometheus: + enabled: true + agentMode: true + serviceAccount: + create: true + annotations: + eks.amazonaws.com/role-arn: "" + prometheusSpec: + remoteWrite: + - url: "https://aps-workspaces..amazonaws.com/workspaces//api/v1/remote_write" + sigv4: + region: + queueConfig: + maxSamplesPerSend: 1000 + maxShards: 200 + capacity: 2500 + alertmanager: + enabled: false + grafana: + sidecar: + datasources: + defaultDatasourceEnabled: false + serviceAccount: + create: true + annotations: + eks.amazonaws.com/role-arn: "" + grafana.ini: + auth: + sigv4_auth_enabled: true + additionalDataSources: + - name: AMP + type: prometheus + url: "https://aps-workspaces..amazonaws.com/workspaces//" + access: proxy + isDefault: true + jsonData: + sigV4Auth: true + sigV4Region: + httpMethod: POST +``` + +This requires two IRSA roles: +- **Prometheus write**: `aps:RemoteWrite` permission on the AMP workspace +- **Grafana read**: `aps:QueryMetrics`, `aps:GetMetricMetadata`, `aps:GetSeries`, `aps:GetLabels` permissions on the AMP workspace + +> [!NOTE] +> PrometheusAgent cannot evaluate recording or alerting rules. PrometheusRule CRDs are deployed but inert in agent mode. Dashboard panels that rely on raw metrics (Health, Ingress, Connect, Infrastructure rows) work normally. SLO panels that depend on recording rules (`union:cp:slo:*`, `union:dp:slo:*`) will show no data unless you configure [AMP Ruler](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-ruler.html) to evaluate those rules server-side. The PrometheusRule template files in the Helm charts (`templates/monitoring/prometheusrule.yaml`) contain the rule definitions in standard Prometheus format and can be uploaded directly to AMP Ruler. diff --git a/content/deployment/selfhosted/operations/_index.md b/content/deployment/selfhosted/operations/_index.md new file mode 100644 index 000000000..c42f062e0 --- /dev/null +++ b/content/deployment/selfhosted/operations/_index.md @@ -0,0 +1,17 @@ +--- +title: Operations +weight: 9 +variants: -flyte +union +--- + +# Operations + +Guides for operating a {{< key product_name >}} self-hosted deployment after initial setup. + +{{< grid >}} + +{{< link-card target="./cicd" icon="git-merge" title="CI/CD integration" >}} +Deploy workflows from CI/CD pipelines using non-interactive authentication +{{< /link-card >}} + +{{< /grid >}} diff --git a/content/deployment/selfhosted/operations/cicd.md b/content/deployment/selfhosted/operations/cicd.md new file mode 100644 index 000000000..da0203a9e --- /dev/null +++ b/content/deployment/selfhosted/operations/cicd.md @@ -0,0 +1,244 @@ +--- +title: CI/CD integration +weight: 1 +variants: -flyte +union +--- + +# CI/CD integration + +This guide covers how to authenticate `flyte deploy` from a CI/CD pipeline (GitHub Actions, Jenkins, GitLab CI, etc.) against a self-hosted {{< key product_name >}} deployment. + +In serverless and BYOC deployments, `flyte create api-key` mints an API key automatically. Self-hosted deployments don't have access to the identity service that backs this command. Instead, you create a dedicated OAuth application in your identity provider and encode its credentials as an API key manually. + +For `flyte deploy` usage, flags, and workflow examples, see the [CI/CD deployments]({{< docs_home union v2 >}}/user-guide/project-patterns/cicd/) guide. + +## Prerequisites + +- [Authentication](../authentication) is configured and working (Apps 1-5) +- The `flyte` CLI is installed (`pip install flyte` or `uv pip install flyte`) +- You have admin access to your identity provider to create a new OAuth application + +## Step 1: Create a CI/CD OAuth application + +Create a new confidential (service) application in your identity provider. This is the same type of application as the service-to-service app (App 3) documented in the [authentication guide](../authentication), but dedicated to CI/CD so you can manage its lifecycle and permissions independently. + +{{< tabs >}} +{{< tab "Okta" >}} +{{< markdown >}} +1. In the Okta Admin Console, go to **Applications > Create App Integration** +2. Select **API Services** (machine-to-machine) +3. Name it descriptively (e.g., `union-cicd` or `union-jenkins`) +4. After creation, note the **Client ID** and **Client Secret** +5. Go to your custom authorization server (**Security > API > Authorization Servers**) +6. Under **Access Policies**, ensure the CI/CD app is allowed the `client_credentials` grant with the `all` scope + +> [!NOTE] +> If you want per-team or per-project CI/CD keys, create separate OAuth apps for each and assign different access policies. +{{< /markdown >}} +{{< /tab >}} +{{< tab "Entra ID" >}} +{{< markdown >}} +1. In the Azure portal, go to **Microsoft Entra ID > App registrations > New registration** +2. Name it descriptively (e.g., `union-cicd`) +3. Set **Supported account types** to **Single tenant** +4. No redirect URI is needed — this app uses client credentials only +5. After creation, go to **Certificates & secrets > New client secret** and save the secret value +6. Go to the **Union API app registration** (the one with "Expose an API" configured — typically App 1): + - Under **Expose an API > Authorized client applications**, add the CI/CD app's Client ID + - Under **App roles**, ensure an `all` role exists +7. Back on the CI/CD app registration: + - Go to **API permissions > Add a permission > My APIs** + - Select the Union API app and grant the `all` Application permission +8. **Grant admin consent**: Go to **Enterprise Applications > CI/CD app > Permissions > Grant admin consent for \** + +> [!WARNING] +> Without admin consent, client_credentials token requests will fail with an `AADSTS` error. This is the most common setup issue. +{{< /markdown >}} +{{< /tab >}} +{{< tab "Generic OIDC" >}} +{{< markdown >}} +1. Create a new **confidential client** in your identity provider +2. Enable the `client_credentials` grant type +3. Assign the appropriate scope (typically `all` or the scope configured on your authorization server) +4. Note the **Client ID** and **Client Secret** + +If your provider requires explicit audience configuration, set the audience to match the `allowedAudience` configured in your control plane Helm values. +{{< /markdown >}} +{{< /tab >}} +{{< /tabs >}} + +## Step 2: Build the API key + +Encode the credentials as a base64 string in the format `:::` — note the **trailing colon**: + +```shell +echo -n ":::" | base64 +``` + +For example: + +```shell +echo -n "union.example.com:abc123:secret456:" | base64 +# dW5pb24uZXhhbXBsZS5jb206YWJjMTIzOnNlY3JldDQ1Njo= +``` + +The four fields are: +1. **Domain** — your control plane ingress domain (without `https://`) +2. **Client ID** — from the OAuth app you just created +3. **Client secret** — from the OAuth app you just created +4. **Organization** — leave empty for self-hosted (the trailing colon is still required) + +## Step 3: Store in your CI secret manager + +Add the base64 string to your CI system's secret store and expose it as the `FLYTE_API_KEY` environment variable: + +{{< tabs >}} +{{< tab "GitHub Actions" >}} +{{< markdown >}} +1. Go to **Settings > Secrets and variables > Actions > New repository secret** +2. Name: `FLYTE_API_KEY` +3. Value: the base64 string from Step 2 + +In your workflow: +```yaml +- name: Deploy workflows + env: + FLYTE_API_KEY: ${{ secrets.FLYTE_API_KEY }} + run: flyte deploy ... +``` +{{< /markdown >}} +{{< /tab >}} +{{< tab "Jenkins" >}} +{{< markdown >}} +1. Go to **Manage Jenkins > Credentials > Add Credentials** +2. Kind: **Secret text** +3. Secret: the base64 string from Step 2 +4. ID: `flyte-api-key` + +In your Jenkinsfile: +```groovy +environment { + FLYTE_API_KEY = credentials('flyte-api-key') +} +stages { + stage('Deploy') { + steps { + sh 'flyte deploy ...' + } + } +} +``` +{{< /markdown >}} +{{< /tab >}} +{{< tab "GitLab CI" >}} +{{< markdown >}} +1. Go to **Settings > CI/CD > Variables > Add variable** +2. Key: `FLYTE_API_KEY` +3. Value: the base64 string from Step 2 +4. Check **Mask variable** and **Protect variable** + +In your `.gitlab-ci.yml`: +```yaml +deploy: + script: + - flyte deploy ... +``` + +The `FLYTE_API_KEY` variable is automatically available to all jobs. +{{< /markdown >}} +{{< /tab >}} +{{< /tabs >}} + +## Step 4: Configure `flyte deploy` + +Create a `config.yaml` in your repository pointing at your self-hosted deployment: + +```yaml +admin: + endpoint: dns:/// + insecure: false # Set to true if using self-signed certificates +image: + builder: remote # Or "local" if you pre-build images +task: + project: + domain: +``` + +When `FLYTE_API_KEY` is set, the CLI uses it for authentication automatically — it overrides any other auth mode configured in `config.yaml` (including `ExternalCommand`-based SSO flows). No config changes are needed to switch between interactive and CI authentication. + +## Step 5: Test + +Verify the credentials work before wiring them into your pipeline: + +```shell +# 1. Test token acquisition (replace with your IdP's token endpoint) +curl -s -X POST "" \ + -d "grant_type=client_credentials" \ + -d "client_id=" \ + -d "client_secret=" \ + -d "scope=" | jq .access_token +``` + +{{< tabs >}} +{{< tab "Okta" >}} +{{< markdown >}} +```shell +curl -s -X POST "https:///oauth2//v1/token" \ + -d "grant_type=client_credentials" \ + -d "client_id=" \ + -d "client_secret=" \ + -d "scope=all" | jq .access_token +``` +{{< /markdown >}} +{{< /tab >}} +{{< tab "Entra ID" >}} +{{< markdown >}} +```shell +curl -s -X POST "https://login.microsoftonline.com//oauth2/v2.0/token" \ + -d "grant_type=client_credentials" \ + -d "client_id=" \ + -d "client_secret=" \ + -d "scope=api:///.default" | jq .access_token +``` +{{< /markdown >}} +{{< /tab >}} +{{< tab "Generic OIDC" >}} +{{< markdown >}} +```shell +curl -s -X POST "/token" \ + -d "grant_type=client_credentials" \ + -d "client_id=" \ + -d "client_secret=" \ + -d "scope=all" | jq .access_token +``` +{{< /markdown >}} +{{< /tab >}} +{{< /tabs >}} + +If you receive a valid JWT, test the full flow: + +```shell +export FLYTE_API_KEY="" +flyte deploy --config config.yaml --copy-style none --version test-$(date +%s) \ + --project --domain path/to/tasks.py +``` + +## Permissions and RBAC + +The CI/CD app's access depends on your [authorization](../authorization) configuration: + +- **Noop mode**: The app has full access to all projects and domains +- **External authorization**: Configure your external authz service to grant the CI/CD app's identity appropriate permissions +- **Union RBAC**: Create a role scoped to the target project/domain and bind it to the CI/CD app's identity + +For teams sharing a cluster, create **separate OAuth apps per team or per repository** so that one team's CI key cannot deploy to another team's project. See the [CI/CD deployments]({{< docs_home union v2 >}}/user-guide/project-patterns/cicd/#key-scope-and-rotation) guide for more on permission scoping. + +## Key rotation + +Rotate CI/CD credentials on a regular schedule (90 days recommended): + +1. Create a new client secret in your identity provider (don't delete the old one yet) +2. Re-encode with the new secret: `echo -n ":::" | base64` +3. Update the `FLYTE_API_KEY` secret in your CI system +4. Verify a deploy succeeds with the new key +5. Delete the old client secret from your identity provider