diff --git a/content/security/_index.md b/content/security/_index.md index 5d90def83..fb67c8988 100644 --- a/content/security/_index.md +++ b/content/security/_index.md @@ -7,25 +7,48 @@ top_menu: true # Security -Union.ai provides a production-grade workflow orchestration platform built on Flyte, designed for AI/ML and data-intensive workloads. -Security is foundational to Union.ai’s architecture, not an afterthought. -This document provides a comprehensive overview of Union.ai’s security practices, architecture, and compliance posture for enterprise security professionals evaluating the platform. - -Union.ai’s security model is built on several core principles: - -* **Data residency:** Customer data is stored and computed only within the customer's data plane. The Union.ai control plane stores only orchestration metadata—no task inputs, outputs, code, logs, secrets, or container images. -* **Architectural isolation:** A strict separation between the Union-hosted control plane and the customer-hosted data plane ensures that the blast radius of any control plane compromise does not extend to customer data. -* **Outbound only connectivity:** The Cloudflare Tunnel connecting the control plane to the data plane is outbound-only from the customer’s network, requiring no inbound firewall rules. All communication uses mutual TLS (mTLS) and is authenticated using the customer's Auth / SSO. -* **Compliance:** Union.ai is SOC 2 Type II certified for Security, Availability, and Integrity, with practices aligned to ISO 27001 and GDPR standards. Union is designed to meet HIPAA compliance requirements for handling Protected Health Information (PHI) and maintains CIS 1.4 AWS certification while pursuing CIS 3.0 certification (in progress). The Union.ai trust portal can be found at [trust.union.ai](https://trust.union.ai) -* **Defense in depth:** Multiple layers of encryption, authentication, authorization, and network segmentation protect data throughout its lifecycle. -* **Human / operational isolation:** Union.ai personnel access the customer's control plane UI only through authenticated, RBAC-controlled channels. Personnel do not have IAM credentials for customer cloud accounts and cannot directly access customer data stores, secrets, or compute infrastructure. In BYOC deployments, Union.ai additionally has [K8s cluster management access](./byoc-differences#human-access-to-customer-environments). +This section provides a comprehensive overview of Union.ai's security architecture, practices, and compliance posture for enterprise security professionals evaluating the platform. +Beyond describing the security model, it provides concrete verification steps so that reviewers can independently confirm each claim against a running system. + +## Overview + +**[Architecture](./architecture/_index)** +The system is divided into a control plane hosted by Union.ai and a data plane hosted on the customer's infrastructure. +The only connections between the two planes are outbound-only routes from the customer data plane to the control plane. +Consequently, no inbound firewall rules are required on the customer's network. + +**[Data protection](./data-protection/_index)** +Bulk customer data items (files, DataFrames, code bundles, container images) are stored in the customer's data plane and never enter the control plane. +Smaller inline data items (structured task inputs/outputs, secret values during creation, log streams) pass through the control plane memory only transiently. They are not persisted there. +The control plane does persist orchestration and task metadata, but these are always encrypted at rest. + +**[Identity and access](./identity-and-access/_index)** +Authentication is done via OIDC/SSO, API keys, and service accounts. +Role-based access control enforces least-privilege. +Union.ai personnel cannot access customer data or secrets. + +**[Threat model](./threat-model)** +An analysis of potential threats and how they are mitigated is provided. +Control plane compromise, tunnel interception, and presigned URL leakage scenarios are examined, +and the architectural design and security controls that mitigate these risks are described. +The goal is to demonstrate that even in worst-case scenarios, customer data remains protected. + +**[Compliance and governance](./compliance/_index)** +Union.ai is SOC 2 Type II certified for Security, Availability, and Processing Integrity, with practices aligned to ISO 27001 and CIS benchmarks. +The platform is designed to meet HIPAA requirements. +Details are available in the Public Trust Center at [trust.union.ai](https://trust.union.ai). +This includes organizational security practices, vulnerability management, and a shared responsibility model. ## Deployment models -Union.ai offers two deployment models, both sharing the same control plane / data plane architecture and security controls described in this document. +Union.ai offers two deployment models, both sharing the same control plane / data plane architecture and security controls described in this section. -In **Self-Managed** deployments, the customer operates their data plane independently; Union.ai has zero access to the customer’s infrastructure, with the Cloudflare tunnel as the only connection. +In **BYOC** deployments, Union.ai manages the data plane in the customer's cloud account via private connectivity (PrivateLink/PSC). +Union.ai handles upgrades, monitoring, and provisioning, while maintaining strict separation from customer data, secrets, and logs. -In **BYOC** deployments, Union.ai manages the Kubernetes cluster in the customer’s cloud account via private connectivity (PrivateLink/PSC), handling upgrades, monitoring, and provisioning while maintaining strict separation from customer data, secrets, and logs. +In **Self-managed** deployments, the customer operates their data plane independently. +The customer is responsible for all aspects of data plane management, including upgrades, monitoring, and provisioning. +Union.ai has no access to the customer's infrastructure, with the Cloudflare Tunnel and GRPC connections being the only pathways between Union.ai and the customer's network +(and even then, only outbound from the customer to Union.ai). -The core security architecture—encryption, RBAC, tenant isolation, presigned URL data access, and audit logging—is identical across both models. Sections where operational responsibilities differ are noted inline. [BYOC deployment differences](./byoc-differences) provides a detailed comparison. +For details, see [Deployment models](./architecture/deployment-models). diff --git a/content/security/architecture/_index.md b/content/security/architecture/_index.md new file mode 100644 index 000000000..05405f998 --- /dev/null +++ b/content/security/architecture/_index.md @@ -0,0 +1,26 @@ +--- +title: Architecture +weight: 1 +variants: -flyte +union +sidebar_expanded: true +--- + +# Architecture + +Union.ai's security architecture rests on a foundational division between the Union.ai-hosted control plane, which orchestrates execution, and the customer-hosted data plane, where all computation occurs and all customer data resides. The two planes are connected by an outbound-only route that requires no inbound firewall rules on the customer side. + +In the BYOC model, Union.ai manages the data plane over a private connection. In the self-managed model, the customer manages the data plane themselves. In both cases, the same security controls apply, and the same [data residency guarantees](../data-protection/classification-and-residency) hold. + +This section covers: + +* **[Two-plane separation](./two-plane-separation)**: The division between between the Union.ai-hosted control plane and the customer-hosted data plane is the foundation of the security architecture. + +* **[Control plane](./control-plane)**: The control plane is the Union.ai-hosted orchestration component. It stores only orchestration and task metadata, which is encrypted at rest. Bulk data is referenced via signed URIs only, the actual bulk data never touches the control plane. + +* **[Data plane](./data-plane)**: The data plane runs entirely within the customer's cloud account. All computation occurs here and all customer data resides here. It uses workload identity federation (IRSA / Workload Identity / Azure Workload Identity) instead of static credentials, so no long-lived access keys are stored on the data plane. + +* **[Network architecture](./network)**: The data plane initiates all connections to the control plane via two outbound-only routes. There is no inbound attack surface on the customer's network and therefore no firewall rules are required. + +* **[Private connectivity (BYOC)](./private-connectivity)**: In the BYOC model, Union.ai manages the customer's Kubernetes cluster via PrivateLink, Private Service Connect, or Azure Private Link. The Kubernetes API is never exposed to the public internet. + +* **[Deployment models](./deployment-models)**: Self-managed and BYOC share the same two-plane architecture and security controls, differing only in who operates the data plane's Kubernetes cluster. diff --git a/content/security/architecture/control-plane.md b/content/security/architecture/control-plane.md new file mode 100644 index 000000000..fb13ccec7 --- /dev/null +++ b/content/security/architecture/control-plane.md @@ -0,0 +1,39 @@ +--- +title: Control plane +weight: 2 +variants: -flyte +union +--- + +# Control plane + +The control plane is the Union.ai-hosted component that orchestrates task execution, manages user access, and provides the web interface. It runs on AWS infrastructure managed by Union.ai and is covered by Union.ai's SOC 2 Type II certification. + +## What it does and does not store + +The control plane stores the information required for orchestration: + +- **Orchestration metadata**: Identifiers, action state (phase, timestamps, cluster assignment), user profiles, and scheduling configuration. +- **Task and run definitions**: Each run submission includes a full TaskSpec (container image, typed interface, resource requirements, security context) and a RunSpec (environment variables, labels, annotations). Trigger specs carry default input values for scheduled runs. +- **Error and event information**: Error messages from task executions (which may contain customer data from Python tracebacks), Kubernetes event messages, and per-attempt plugin state. + +The control plane does not store: + +- **Bulk customer data payloads**: When it references such data it stores only URIs pointing to objects in the customer's object store (for example, `s3://customer-bucket/org/project/domain/run/action/output.pb`). + +For the full classification of what is and isn't stored in the control plane, the sensitive fields that may appear in task definitions, and how inline data (structured I/O, secret values during creation, log streams) transits control plane memory without being persisted, see [Data classification and residency](../data-protection/classification-and-residency). + +## Infrastructure + +The control plane runs on AWS with multi-AZ redundancy to ensure high availability. It uses managed cloud database services for orchestration metadata, task/run definitions, execution events, and error messages. All backends are encrypted at rest and isolated within a VPC with restricted security groups that permit access only from control plane application services. See [Encryption](../data-protection/encryption) for at-rest encryption details by data type. + +TLS terminates at the edge, and all internal communication occurs over encrypted channels. Automated backups run on a defined schedule with point-in-time recovery capability. Union.ai maintains disaster recovery procedures and applies security patches on a regular cadence. The SOC 2 Type II report covers the availability, security, and operational controls of this infrastructure. + +## Capabilities + +The control plane exposes the following capabilities: + +- **API and UI gateway** -- an authenticated HTTPS API and web console for users, the SDK, and the CLI. All requests are subject to authentication and RBAC enforcement before any orchestration logic runs. +- **Scheduling and execution tracking** -- schedules TaskActions across registered data plane clusters and records execution state (phase transitions, timestamps, errors) reported back from the data plane. +- **Cluster registry** -- maintains the inventory of registered data plane clusters and their health, and routes orchestration traffic accordingly. +- **Data gateway** -- proxies structured task inputs and outputs between clients and the data plane object store, streams execution logs from the data plane to clients, and brokers presigned URL signing requests for bulk data access. See [Data flow](../data-protection/data-flow) for what these pathways carry and how data is handled in transit. + diff --git a/content/security/architecture/data-plane.md b/content/security/architecture/data-plane.md new file mode 100644 index 000000000..83b0f805d --- /dev/null +++ b/content/security/architecture/data-plane.md @@ -0,0 +1,192 @@ +--- +title: Data plane +weight: 3 +variants: -flyte +union +mermaid: true +--- + +# Data plane + +The data plane runs entirely within the customer's cloud account on a Kubernetes cluster. It is where all computation occurs and where customer data is stored at rest (see [Data classification and residency](../data-protection/classification-and-residency)). In the self-managed model, the customer operates the data plane independently. In the BYOC model, Union.ai manages the Kubernetes cluster on the customer's behalf, but it still runs in the customer's cloud account. See [Deployment models](./deployment-models) for the differences. + +## Components + +The data plane consists of several components, each handling a specific aspect of task execution and data management. + +**Executor** is a Kubernetes controller that watches for TaskAction custom resources created by the control plane. When a TaskAction appears, the Executor reconciles its lifecycle: creating task pods, monitoring their status, and reporting state transitions back to the control plane. If connectivity to the control plane is lost, in-flight pods continue running and state reconciles when the connection is restored. + +**Object Store Service** handles data access operations on the customer's object store. It signs presigned URLs for bulk data (files, directories, DataFrames, code bundles, and reports) and serves object read/write operations used by the control plane for structured task I/O. + +**Log Provider** serves task logs through two channels. For running tasks, it streams live logs from the Kubernetes API. For completed tasks, it retrieves logs from the cloud provider's log aggregator (CloudWatch, Cloud Logging, or Azure Monitor). There is no content filtering or redaction; any sensitive data (secrets, PII, stack traces) that applications write to stdout/stderr is included in the stream unmodified. + +**Image Builder** uses Buildkit running on the customer's Kubernetes cluster to build container images from user-submitted `Image` specifications. Source code and built images never leave the customer's infrastructure. Base images are pulled from customer-configured registries, and built images are pushed to the customer's container registry (ECR, GCR, or ACR). + +**Tunnel Service** maintains the outbound-only encrypted Cloudflare Tunnel from the data plane to the control plane. This service initiates the tunnel (no inbound ports required), performs health checks and heartbeats, and automatically reconnects if the connection drops. + +In addition to the tunnel, the data plane operator establishes a separate outbound gRPC connection (TLS) to the regional control plane endpoint for orchestration RPCs (cluster registration, action lifecycle, event reporting, catalog and artifact lookups, admin RPCs). Both channels are outbound-initiated; see [Network architecture](./network) for what each carries. + +**Apps & Serving** provides model and application serving capabilities using Knative with a Kourier gateway. All serving infrastructure runs within the customer's cluster. Authentication is enforced on all endpoints by default (SSO for browser access, API keys for programmatic access), with an option to allow anonymous access on specific endpoints. See [Apps & Serving security](#apps--serving-security) below for details. + +For how each of these pathways handles data in transit, see [Data flow](../data-protection/data-flow). + +## Object store layout + +Each data plane cluster uses two object store buckets: a **metadata bucket** for execution metadata and a **fast-registration bucket** for rapid code deployment artifacts. Within these buckets, objects are organized by namespace: `org/project/domain/run-name/action-name/`. This layout provides isolation: IAM policies and bucket policies can scope access to specific organizational boundaries. + +## Kubernetes security + +The data plane enforces several layers of Kubernetes security to protect workloads and limit blast radius. + +**Workload identity federation** eliminates the need for static cloud credentials on the data plane. See [IAM and workload identity](#iam-and-workload-identity) below for details. + +**Kubernetes RBAC** restricts what each service account can do within the cluster. Platform components have scoped permissions for their specific functions, and task pods run under service accounts with minimal privileges. + +**Network policies** control pod-to-pod communication within the cluster, limiting lateral movement in the event of a container compromise. + +**Resource quotas and limit ranges** prevent any single workload from consuming all cluster resources, providing both stability and a degree of isolation between tenants and projects. + +**Pod security contexts** enforce non-root execution for platform components, reducing the impact of container escape vulnerabilities. + +## Container security + +When a user defines an `Image` specification, source code is uploaded to the customer's object store via presigned URL and fetched by the builder; it never transits through the control plane. + +Base images are pulled from registries configured by the customer, allowing the use of hardened or pre-approved base images. Customers can apply their own image tagging conventions, vulnerability scanning policies, and registry access controls. + +Task pods mount code bundles via presigned URLs with limited time-to-live (TTL). These URLs expire after a short window, limiting the exposure if a URL is intercepted. + +## IAM and workload identity + +The data plane uses two IAM roles to separate platform-level and user-level access: + +**adminflyterole** is used by platform services (Executor, Object Store Service, Log Provider). It has read/write access to the object store buckets, access to the secrets manager for retrieving user-defined secrets, and read access to persisted logs. This role is bound to platform service accounts via workload identity federation. + +**userflyterole** is used by task pods (the containers running user code). It has read/write access to the object store buckets for reading inputs and writing outputs. It does not have access to the secrets manager or platform-level resources. + +Both roles use cloud-native workload identity federation: IRSA (IAM Roles for Service Accounts) on AWS, Workload Identity on GCP, and Azure Workload Identity on Azure. No static credentials are created, stored, or rotated. The Kubernetes service account annotations bind each pod to the appropriate IAM role automatically. + +## Apps & Serving security + +App and serving traffic flows entirely within the customer's infrastructure. No application code, data, or serving requests pass through the control plane. + +Inbound traffic reaches the serving endpoints through Cloudflare, which provides DDoS protection, before routing to the Kourier ingress gateway running in the customer's cluster. Authentication is enforced by default on all endpoints: browser-based access uses SSO, and programmatic access uses API keys. Individual endpoints can be configured for anonymous access when required (for example, public-facing model endpoints). + +RBAC controls govern which users and service accounts can deploy applications and access specific endpoints, scoped per project. All serving infrastructure (Knative, Kourier, and the Union Operator) runs within the customer's Kubernetes cluster. In the BYOC model, Union.ai manages the lifecycle of this serving infrastructure (upgrades, scaling, configuration), but the infrastructure itself resides in the customer's account. + +## Verification + +### Components + +**Reviewer focus:** Confirm that the described components are running in the customer's cluster and match the documented architecture. + +**How to verify:** + +1. List data plane pods and deployments: + + ```bash + kubectl get pods -n union + kubectl get deployments -n union -o wide + ``` + + Confirm that the Executor, Object Store Service, Tunnel Service, and other components are present. + +2. Inspect a specific component: + + ```bash + kubectl describe pod -n union + ``` + + Verify the container image, service account, and resource configuration match expectations. + +### Kubernetes security + +**Reviewer focus:** Confirm that Kubernetes RBAC, network policies, resource quotas, and pod security contexts are in place and correctly scoped. + +**How to verify:** + +1. Review cluster role bindings for Union components: + + ```bash + kubectl get clusterrolebindings | grep union + ``` + +2. Check network policies across namespaces: + + ```bash + kubectl get networkpolicies -A + ``` + +3. Verify resource quotas: + + ```bash + kubectl get resourcequotas -A + ``` + +4. Inspect pod security contexts: + + ```bash + kubectl get pods -n -o jsonpath='{.items[0].spec.securityContext}' + ``` + + Confirm `runAsNonRoot: true` or equivalent non-root settings on platform pods. + +### Container security + +**Reviewer focus:** Confirm that image builds execute entirely within the customer's infrastructure and that built images never leave the customer's registry. + +**How to verify:** + +1. Trigger an image build by submitting a workflow with an `Image` specification. + +2. Observe the build pod: + + ```bash + kubectl get pods -n union | grep build + kubectl logs -n union + ``` + + Confirm that the build pulls base images from the customer's configured registry and pushes the result to the customer's container registry. + +3. Verify the image in the customer's registry: + + ```bash + aws ecr describe-images --repository-name --image-ids imageTag= + ``` + + (Or the equivalent `gcloud` / `az` command for GCP/Azure.) + +### IAM and workload identity + +**Reviewer focus:** Confirm that the two IAM roles exist with the documented permissions, that workload identity federation is in use, and that no static credentials are present. + +**How to verify:** + +1. Inspect the IAM roles and their policies: + + ```bash + aws iam get-role --role-name adminflyterole + aws iam list-role-policies --role-name adminflyterole + aws iam list-attached-role-policies --role-name adminflyterole + + aws iam get-role --role-name userflyterole + aws iam list-role-policies --role-name userflyterole + aws iam list-attached-role-policies --role-name userflyterole + ``` + + Confirm that `adminflyterole` has object store, secrets manager, and log access. Confirm that `userflyterole` has only object store access. + +2. Verify workload identity annotations on service accounts: + + ```bash + kubectl get sa -n union -o yaml | grep role-arn + ``` + + Each service account should have an annotation binding it to the appropriate IAM role via IRSA (or the equivalent for GCP/Azure). + +3. Confirm no static credentials exist: + + ```bash + kubectl get secrets -n union -o name | grep -i aws + ``` + + There should be no secrets containing static AWS access keys. Workload identity federation eliminates the need for them. diff --git a/content/security/architecture/deployment-models.md b/content/security/architecture/deployment-models.md new file mode 100644 index 000000000..c7fe30f89 --- /dev/null +++ b/content/security/architecture/deployment-models.md @@ -0,0 +1,95 @@ +--- +title: Deployment models +weight: 6 +variants: -flyte +union +--- + +# Deployment models + +Union.ai supports two deployment models: **BYOC** (Bring Your Own Cloud) and **Self-managed**. Both models share the same fundamental [two-plane separation](./two-plane-separation): the control plane is hosted by Union.ai, and the data plane runs in the customer's cloud account. They differ in who operates the data plane's Kubernetes cluster. + +## Common properties + +Regardless of deployment model, both BYOC and Self-managed share the same core security properties: + +- The same control plane / data plane architecture described in [Two-plane separation](./two-plane-separation) +- Encryption in transit (TLS 1.2+) and cloud-provider native encryption at rest (see [Encryption](../data-protection/encryption)) +- RBAC for user and service account authorization +- Tenant isolation via Kubernetes namespaces and IAM scoping +- Audit logging of administrative and user actions +- Outbound-only [network connectivity](./network) (Cloudflare Tunnel and direct gRPC) + +The key difference is operational: in BYOC, Union.ai manages the Kubernetes cluster within the customer's cloud account. In Self-managed, the customer operates the cluster entirely on their own. + +## BYOC + +In the BYOC model, Union.ai manages the Kubernetes cluster within the customer's cloud account. The cluster runs in the customer's VPC, uses the customer's IAM roles, and stores data in the customer's object store, but Union.ai handles the operational burden of running the cluster. + +The Kubernetes API endpoint is private-only, accessible through [PrivateLink, Private Service Connect, or Azure Private Link](./private-connectivity). Union.ai accesses the cluster exclusively through this private connection for management operations. + +Union.ai manages: + +- Kubernetes cluster provisioning and lifecycle +- Kubernetes version upgrades +- Node pool configuration and scaling +- Helm chart deployments and updates for Union.ai components +- The monitoring stack (Prometheus, Grafana, Fluent Bit) +- Serving infrastructure (Kourier, Knative, Union Operator) +- Data plane component patching and updates + +The customer retains ownership and control of: + +- The cloud account and its IAM policies +- VPC configuration and network architecture +- Object storage buckets and their access policies +- Any additional infrastructure outside the managed cluster + +Union.ai is responsible for the availability and security of the managed Kubernetes cluster. The customer is responsible for the availability and security of the surrounding cloud account infrastructure (VPC, IAM, object storage). Union.ai assumes the cluster-level third-party dependency risk: if a Kubernetes vulnerability requires patching, Union.ai handles it. + +## Self-managed + +In the Self-managed model, the customer operates the data plane independently. Union.ai has zero access to the data plane infrastructure. The only connections between the control plane and the data plane are two outbound-only channels initiated by the data plane: a Cloudflare Tunnel and a direct gRPC connection. See [Network architecture](./network) for details. + +The customer provisions all IAM roles, configures network policies, manages Kubernetes versions and upgrades, and handles all patching of data plane components. The customer is solely responsible for data plane availability, security hardening, and compliance of the data plane infrastructure. + +This model provides maximum isolation and control. It is appropriate for organizations that have the Kubernetes operational expertise to manage the cluster and prefer to eliminate any third-party access to their data plane infrastructure. + +## Availability and resilience + +The control plane runs on AWS with multi-AZ redundancy and automated failover. Availability is covered by Union.ai's SOC 2 Type II certification, and specific SLA commitments are defined in customer contracts. + +A critical resilience property of the architecture is that **in-flight workflows continue running during control plane outages**. The Executor is a Kubernetes controller: once a task pod is created, it runs independently of the control plane. If either outbound channel drops or the control plane becomes unavailable, running task pods are unaffected. When connectivity is restored, the Executor reconciles state with the control plane, and the execution history is updated. New workflow submissions require control plane availability, but existing work is not interrupted. + +For data plane availability, the responsibility depends on the deployment model. In the self-managed model, the customer is solely responsible for data plane availability. In the BYOC model, Union.ai is responsible for the availability of the managed Kubernetes cluster, while the customer remains responsible for the underlying cloud account resources. + +## Verification + +### Availability and resilience + +**Reviewer focus:** Confirm that in-flight workflows survive control plane connectivity loss and that state reconciles upon reconnection. + +**How to verify:** + +1. Start a long-running workflow (e.g., a task with a `sleep` of several minutes). + +2. Simulate a connectivity disruption to the control plane by scaling down the Tunnel Service: + + ```bash + kubectl scale deployment -n union --replicas=0 + ``` + +3. Verify that task pods continue running: + + ```bash + kubectl get pods -n + ``` + + The task pods should remain in `Running` state and continue their work. + +4. Restore connectivity: + + ```bash + kubectl scale deployment -n union --replicas=1 + ``` + +5. Check the Union.ai UI or query the API to confirm that the execution state reconciled correctly. The execution should show as completed (or progressed) with accurate timestamps, not as failed or lost. diff --git a/content/security/architecture/network.md b/content/security/architecture/network.md new file mode 100644 index 000000000..c292f043a --- /dev/null +++ b/content/security/architecture/network.md @@ -0,0 +1,142 @@ +--- +title: Network architecture +weight: 4 +variants: -flyte +union +--- + +# Network architecture + +The network architecture reinforces the [two-plane separation](./two-plane-separation) with an outbound-only connectivity model. The data plane initiates all connections to the control plane over two distinct outbound channels: a Cloudflare Tunnel and a direct gRPC connection. There are no inbound firewall rules, no VPN tunnels, and no listening services on the customer's network that Union.ai can reach. + +## Outbound-only model + +All network connections between the data plane and the control plane are initiated by the data plane. The customer's network requires only standard outbound HTTPS access to Cloudflare edge nodes. No inbound firewall rules, port forwarding, or VPN configuration are needed. + +This model eliminates the inbound attack surface entirely. There are no listening services on the customer's network for an attacker to discover through port scanning or exploit through service vulnerabilities. The customer's network perimeter remains unaffected by the Union.ai integration. Firewall management is simplified to a single rule: permit outbound HTTPS, which most enterprise networks already allow. + +The trust model is customer-initiated: the data plane decides when and whether to connect, and the customer can sever either channel at any time by blocking outbound traffic or shutting down the data plane operator. + +## Cloudflare Tunnel + +The Cloudflare Tunnel is an outbound-only encrypted connection from the customer's cluster to the Cloudflare edge network, which then routes to the Union.ai control plane. It is initiated by a `cloudflared` sidecar in the data plane and lets the control plane reach data plane services without any inbound firewall rules. + +All traffic through the tunnel is encrypted using a layered transport: TLS with mutual authentication (X.509 client certificates), Cloudflare Access service tokens for application-layer authentication, and Cloudflare Tunnel encryption for the network path. Tunnel tokens are rotated automatically: the data plane operator periodically polls the control plane and picks up updated tokens when issued. + +The Tunnel Service in the data plane maintains this connection with health checks and heartbeats, and automatically reconnects if the connection drops. State reconciliation occurs upon reconnection, so no data is lost during brief connectivity interruptions. + +Once the tunnel is established (outbound from the data plane), it carries bidirectional traffic over the open session. All traffic is encrypted in transit: + +- **Structured task inputs and outputs**: protobuf payloads proxied between clients and the data plane object store on run submission and result retrieval +- **Log streams**: execution log content streamed from the data plane through the control plane to clients +- **Secret values**: secret values during create/update operations, relayed to the data plane secrets backend +- **Presigned URL signing requests**: metadata-only requests brokered to generate time-limited data access URLs +- **Apps & Serving ingress**: end-user requests routed to model and application endpoints in the customer's cluster +- **Health checks**: bidirectional health and liveness signals + +Bulk customer data (files, directories, DataFrames, code bundles, and reports) does not traverse the tunnel; it transfers directly between clients and the customer's object store via presigned URLs. Container images also bypass the tunnel: they are pulled by Kubernetes from the customer's container registry over standard HTTPS. For payload size limits, in-memory handling, and how each pathway is encrypted at every hop, see [Data flow](../data-protection/data-flow). + +## Direct gRPC connection + +In addition to the Cloudflare Tunnel, the data plane maintains a separate outbound gRPC connection over TLS to the regional control plane endpoint. The data plane operator establishes and multiplexes orchestration RPCs over this connection. Like the tunnel, it is outbound-initiated by the data plane and requires no inbound firewall rules. + +This channel carries: + +- **Cluster registration**: the data plane registers itself with the control plane on startup and keeps the registration current +- **Action lifecycle**: TaskAction polling, scheduling decisions, and reconciliation +- **Event reporting**: execution events, phase transitions, and status updates from the data plane to the control plane +- **Catalog and artifact lookups**: artifact registry, run metadata, and task definition reads +- **Admin RPCs**: project, domain, and identity queries + +The connection terminates at the Cloudflare edge for the regional `*.unionai.cloud` / `*.union.ai` hostname, which then routes to the hosted control plane. All traffic is encrypted with TLS 1.2+. + +## Regional endpoints + +Union.ai provides control plane endpoints in multiple regions. Customers select the region closest to their data plane deployment to minimize latency. Region selection also has data residency implications -- see [Data classification and residency](../data-protection/classification-and-residency#data-residency). + +| Region | Location | +|---|---| +| US East | us-east-2 | +| US West | us-west-2 | +| Europe West 1 | eu-west-1 | +| Europe West 2 | eu-west-2 | +| Europe Central | eu-central-1 | + +Each region has its own dedicated control plane endpoint hostname. + +## Egress configuration + +For customers with strict egress controls, outbound traffic can be limited to Cloudflare's published CIDR blocks. These blocks can be further restricted to specific Cloudflare regions to minimize the allowed egress surface. Cloudflare publishes its IP ranges at [cloudflare.com/ips](https://www.cloudflare.com/ips/). + +## Communication paths + +All communication paths in the system use encryption. No unencrypted communication paths exist. + +| Path | Protocol | Encryption | +|---|---|---| +| Client to Control Plane | HTTPS | TLS 1.2+ | +| Data Plane ↔ Control Plane (outbound-initiated by data plane) | Cloudflare Tunnel | mTLS | +| Data Plane → Control Plane (outbound-initiated by data plane) | gRPC over TLS | TLS 1.2+ | +| Client to Object Store | HTTPS (presigned URL) | TLS 1.2+ (cloud provider enforced) | +| Fluent Bit to Log Aggregator | Cloud provider SDK | TLS (cloud-native) | +| Task Pods to Object Store | Cloud provider SDK | TLS (cloud-native) | +| Union.ai to Customer Kubernetes API (BYOC only) | PrivateLink / PSC | TLS (private connectivity) | + +For details on the BYOC private management connection, see [Private connectivity (BYOC)](./private-connectivity). + +## Verification + +### Outbound-only model + +**Reviewer focus:** Confirm that no inbound firewall rules or listening services exist on the customer's network for Union.ai traffic, and that all connections are outbound-initiated. + +**How to verify:** + +1. Review the security group or firewall rules attached to the data plane cluster's nodes. Confirm that no inbound rules reference Union.ai IP ranges or allow inbound traffic from external sources for orchestration purposes: + + ```bash + # AWS example + aws ec2 describe-security-groups --group-ids \ + --query 'SecurityGroups[].IpPermissions' + ``` + +2. Confirm that no Kubernetes services in the `union` namespace expose inbound LoadBalancer or NodePort services to the control plane: + + ```bash + kubectl get svc -n union + ``` + + Services should be `ClusterIP` type or, if `LoadBalancer`, should serve only Apps & Serving endpoints, not control plane connectivity. + +3. Review VPC Flow Logs to confirm that connections to Cloudflare are outbound-initiated. All flows to Cloudflare IP ranges should show the data plane node as the source. + +4. (Optional) Run a port scan from an external host against the data plane nodes to confirm no Union.ai-related services are reachable. + +### Cloudflare Tunnel and direct gRPC + +**Reviewer focus:** Confirm that bulk data (files, DataFrames, code bundles) bypasses the tunnel via presigned URLs, and that structured task I/O and log streams transit the tunnel as documented above (encrypted in transit, not persisted). Confirm that the direct gRPC connection from the data plane operator to the regional control plane endpoint is outbound-initiated. + +**How to verify:** + +1. Inspect tunnel pod logs: + + ```bash + kubectl logs -n union + ``` + + Logs should show health checks, connection establishment, and message exchanges. + +2. Analyze VPC Flow Logs for traffic patterns. Bulk data transfers (files, DataFrames, code bundles) should flow directly between task pods and the customer's object store endpoints (S3/GCS/Azure Blob), not through Cloudflare IPs. Structured task I/O and log streams will flow through the tunnel as documented. + +3. Use browser developer tools (Network tab) in the Union.ai UI to confirm that binary output artifacts are fetched via presigned URLs (resolving to the customer's storage domain), while structured outputs are fetched via the control plane API. + +### Egress configuration + +**Reviewer focus:** Confirm that egress can be restricted to Cloudflare CIDR blocks without breaking functionality. + +**How to verify:** + +1. Apply egress rules that allow outbound traffic only to Cloudflare CIDR blocks (and the customer's cloud provider endpoints for object store and logging). + +2. Verify that the tunnel connection establishes successfully and workflows execute normally. + +3. Attempt to reach an endpoint outside the allowed egress list from a task pod to confirm the restriction is effective. diff --git a/content/security/architecture/private-connectivity.md b/content/security/architecture/private-connectivity.md new file mode 100644 index 000000000..fe8644aa5 --- /dev/null +++ b/content/security/architecture/private-connectivity.md @@ -0,0 +1,50 @@ +--- +title: Private connectivity (BYOC) +weight: 5 +variants: -flyte +union +--- + +# Private connectivity (BYOC) + +In the BYOC deployment model, Union.ai maintains a private management connection to the customer's Kubernetes cluster. This connection uses the cloud provider's native private connectivity service: AWS PrivateLink, GCP Private Service Connect, or Azure Private Link, depending on the customer's cloud platform. + +This private connection is used exclusively for cluster management operations: Kubernetes version upgrades, node pool provisioning and scaling, Helm chart deployments and updates, and health monitoring. It provides Union.ai with the access needed to manage the Kubernetes cluster without exposing the Kubernetes API to the public internet. + +The private management connection does **not** carry customer data or orchestration traffic. Customer data and orchestration RPCs flow through the outbound channels described in [Network architecture](./network). The private connectivity path handles only infrastructure management operations. + +By keeping the Kubernetes API endpoint private, this design aligns with several compliance controls, including ISO 27001 A.5.15 (Access control), A.8.20 (Networks security), A.8.22 (Segregation of networks), and CIS Controls v8 Control 12 (Network infrastructure management). The Kubernetes API is never reachable from the public internet. + +For details on the self-managed alternative (where no private management connection exists because the customer operates the data plane independently), see [Deployment models](./deployment-models). + +## Verification + +### Private management connection + +**Reviewer focus:** Confirm that the Kubernetes API endpoint is accessible only via the private connectivity service and is not exposed to the public internet. + +**How to verify:** + +1. In the AWS console (or equivalent for GCP/Azure), navigate to VPC > Endpoints and confirm that a PrivateLink (or Private Service Connect / Azure Private Link) endpoint exists connecting to the customer's Kubernetes API: + + ```bash + # AWS example + aws ec2 describe-vpc-endpoints --filters Name=service-name,Values=*eks* + ``` + + For GCP, use `gcloud compute service-attachments list` and `gcloud compute forwarding-rules list`. For Azure, use `az network private-endpoint list` and `az network private-link-service list`. + +2. Verify that the Kubernetes API resolves to a private IP or hostname: + + ```bash + kubectl cluster-info + ``` + + The server address should be a private IP (e.g., `10.x.x.x`) or a private DNS name, not a public endpoint. + +3. Attempt to reach the Kubernetes API from outside the VPC to confirm it is unreachable: + + ```bash + curl -k https:///healthz + ``` + + This should time out or be refused when run from a host outside the customer's VPC or private connectivity path. diff --git a/content/security/architecture/two-plane-separation.md b/content/security/architecture/two-plane-separation.md new file mode 100644 index 000000000..e415ebcb7 --- /dev/null +++ b/content/security/architecture/two-plane-separation.md @@ -0,0 +1,24 @@ +--- +title: Two-plane separation +weight: 1 +variants: -flyte +union +--- + +# Two-plane separation + +Union.ai's architecture is divided into two distinct planes: a **control plane** hosted by Union.ai on AWS, and a **data plane** that runs on the customer's own Kubernetes cluster within their cloud account. + +## Control plane + +The control plane handles workflow orchestration, user management, and the web interface. It stores only the metadata required for these functions; bulk customer data payloads are stored as URI references rather than inline. See [Control plane](./control-plane) for components and infrastructure. + +## Data plane + +The data plane is where all computation and data handling occurs. It runs entirely within the customer's cloud account, on infrastructure the customer controls (or, in the BYOC model, infrastructure that Union.ai manages on the customer's behalf within the customer's account). It is protected by the customer's IAM policies. See [Data plane](./data-plane) for components, Kubernetes security, and IAM. + +## Blast radius + +This separation limits the blast radius of a control plane security incident. A compromised control plane would only expose what is stored or proxied through it: task metadata in its databases, inline data transiting memory during active requests, and log stream content. It could not expose bulk customer data, which is signed and accessed directly on the data plane. + +For the full classification of what data lives in each plane, what transits control plane memory, and how each pathway is protected, see [Data classification and residency](../data-protection/classification-and-residency). For network paths between the planes, see [Network architecture](./network). + diff --git a/content/security/aws-iam-roles.md b/content/security/aws-iam-roles.md deleted file mode 100644 index 6037a185b..000000000 --- a/content/security/aws-iam-roles.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -title: AWS IAM roles -weight: 18 -variants: -flyte +union ---- - -# AWS IAM roles - -In self-managed deployments, the customer provisions these roles using Union.ai's documentation and templates. In BYOC deployments, [Union.ai provisions them](./byoc-differences#iam-role-provisioning). - -| Plane | Service Account | Purpose | K8s Namespace | IAM Role ARN Pattern | Bound To | S3 Access | -| --- | --- | --- | --- | --- | --- | --- | -| Control Plane | `flyteadmin` | Orchestration metadata management, namespace provisioning, presigned URL generation for code upload/download | union | `arn:aws:iam:::role/adminflyterole` | FlyteAdmin (workflow admin service) | Generates presigned URLs for customer S3 buckets (does not directly read/write data) | -| Data Plane | `clustersync-system` | Synchronizes K8s namespaces, RBAC roles, service accounts, resource quotas, and config across the cluster | union | `adminflyterole` (data plane admin) | ClusterResourceSync controller | No direct S3 access | -| Data Plane | `executor` | Receives task assignments via tunnel, creates task pods, manages pod lifecycle, reports status back to control plane | union | `adminflyterole` (data plane admin) | Node Executor (TaskAction controller) | R/W to metadata bucket and fast-registration bucket for staging task inputs/outputs | -| Data Plane | `proxy-system` | Monitors events, Flyte workflows, pod logs; streams data back to control plane via tunnel | union | `adminflyterole` (data plane admin) | Proxy Service | Read-only access to metadata bucket for proxying presigned URL requests | -| Data Plane | `operator-system` | Cluster operations, health monitoring, config management, image builder orchestration, tunnel management | union | `adminflyterole` (data plane admin) | Union Operator | R/W to metadata bucket for operator state and config | -| Data Plane | `flytepropeller-system` | K8s operator managing FlyteWorkflow CRDs, pod creation, workflow lifecycle execution | union | `adminflyterole` (data plane admin) | FlytePropeller (workflow engine) | R/W to metadata bucket for workflow data (inputs, outputs, offloaded data) | -| Data Plane | `flytepropeller-webhook-system` | Mutating admission webhook that injects secrets into task pods at creation time | union | `adminflyterole` (data plane admin) | FlytePropeller Webhook | No direct S3 access (handles secrets injection only) | -| Data Plane | `clusterresource-template` (per-namespace) | Executes user workflow tasks; reads inputs, writes outputs to S3 | Per-workspace namespace | `userflyterole` (data plane user) | Task Pods (user workloads) | R/W to metadata bucket for task inputs/outputs, code bundles, artifacts | - -For BYOC-specific deployment concerns, see [BYOC deployment differences](./byoc-differences). diff --git a/content/security/byoc-differences.md b/content/security/byoc-differences.md deleted file mode 100644 index f4afd5870..000000000 --- a/content/security/byoc-differences.md +++ /dev/null @@ -1,147 +0,0 @@ ---- -title: BYOC deployment differences -weight: 13 -variants: -flyte +union ---- - -# BYOC deployment differences - -Union.ai's BYOC (Bring Your Own Cloud) deployment shares the same control plane / data plane architecture, encryption, RBAC, tenant isolation, and audit logging as the self-managed deployment. The key difference is that **Union.ai manages the Kubernetes cluster** in the customer's cloud account, rather than the customer managing it independently. - -This page consolidates all security-relevant differences between BYOC and self-managed deployments. - -## Overview - -| Aspect | Self-Managed | BYOC | -| --- | --- | --- | -| Data plane operator | Customer | Union.ai | -| K8s cluster management | Customer | Union.ai (via PrivateLink/PSC) | -| K8s API exposure | Customer-controlled | Private only (never public Internet) | -| Union.ai infrastructure access | None (Cloudflare tunnel only) | K8s cluster management only | -| Data/secrets/logs access by Union.ai | None | None | -| Upgrade responsibility | Customer | Union.ai | -| Monitoring responsibility | Customer | Union.ai + customer | - -## Network architecture - -In addition to the Cloudflare Tunnel (which operates identically in both models), Union.ai maintains a **private management connection** to the customer's Kubernetes cluster in BYOC deployments. This connection uses cloud-native private connectivity: - -| Cloud Provider | Technology | -| --- | --- | -| AWS | AWS PrivateLink | -| GCP | GCP Private Service Connect | -| Azure | Azure Private Link | - -This connection is used exclusively for cluster management operations (upgrades, provisioning, health monitoring) and does not carry customer data. The Kubernetes API endpoint is never exposed to the public Internet. - -This means BYOC has an additional communication path not present in self-managed deployments: - -| Communication Path | Protocol | Encryption | -| --- | --- | --- | -| Union.ai → Customer K8s API | PrivateLink / PSC | TLS (private connectivity) | - -This satisfies ISO 27001 A.5.15 (access control), CIS v8 4.4 (restrict administrative access), and CIS v8 12.11 (segment administration interfaces) requirements. - -## Human access to customer environments - -In self-managed deployments, Union.ai personnel access only the customer's control plane tenant. They have zero access to the customer's data plane infrastructure. - -In BYOC deployments, Union.ai support and engineering personnel additionally have **authenticated access to the customer's Kubernetes cluster** for operational purposes: - -* Cluster upgrades -* Node pool provisioning -* Helm chart updates -* Health monitoring and troubleshooting - -This access is via cloud-native private connectivity (PrivateLink/PSC) and is scoped to K8s cluster management. Union.ai personnel still **cannot** access: - -* Customer object stores -* Secrets backends -* Container registries -* Log aggregators - -All cluster management actions are logged. Union.ai is implementing **just-in-time (JIT) access controls** to replace persistent support access with time-bound, customer-authorized grants. - -The scope of "administrative operations" also differs: in self-managed, these are limited to control plane API calls (cluster configuration, namespace provisioning). In BYOC, they extend to direct K8s cluster management through the PrivateLink/PSC connection. - -## Secrets management - -The default secrets backend differs by deployment model: - -* **Self-managed:** Kubernetes Secrets (K8s etcd) is the default -* **BYOC:** A cloud-native secrets backend (AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault) is the default, for managed integration with the provisioning workflow - -All four backends remain available as options in both models. The security properties (write-only API, runtime-only consumption, in-memory relay) are identical. - -## Infrastructure management - -In self-managed deployments, the customer manages their own Kubernetes clusters, including provisioning, configuration, version management, node pools, and security patching. - -In BYOC deployments, Union.ai manages the Kubernetes cluster in the customer's cloud account: - -* **Cluster provisioning and configuration** -* **Kubernetes version management and upgrades** -* **Node pool health and autoscaler configuration** -* **Helm chart updates for platform components** -* **Monitoring stack deployment and maintenance** (Prometheus, Grafana, Fluent Bit) -* **Serving infrastructure lifecycle** (Kourier gateway, Knative, Union Operator) - -The customer retains responsibility for their cloud account's underlying infrastructure (VPC, IAM policies, object storage configuration). - -### IAM role provisioning - -The same two IAM roles (`adminflyterole` and `userflyterole`) exist in both models. In self-managed, the customer provisions them using Union.ai's documentation and templates. In BYOC, Union.ai provisions these roles as part of cluster setup. - -### Data plane patching - -In self-managed, the customer is responsible for all data plane patching (K8s version, platform components, monitoring stack). In BYOC, Union.ai manages data plane updates, including Kubernetes version, helm charts, and platform components. The control plane is updated independently in both models. - -## Availability and resilience - -Control plane availability is identical across both models (AWS multi-AZ, managed PostgreSQL with automated failover, SOC 2 Type II coverage). - -The difference is in data plane availability: - -* **Self-managed:** The customer is solely responsible for data plane availability, including Kubernetes cluster operations, node pool management, upgrades, and monitoring. Union.ai's availability commitment covers only the control plane. -* **BYOC:** Union.ai is responsible for data plane cluster availability, including Kubernetes version management, node pool health, autoscaler configuration, and monitoring stack uptime. The customer retains responsibility for their cloud account's underlying availability (VPC, IAM, object storage SLAs). Union.ai's operational SLA for BYOC cluster management is defined in the customer contract. - -In both models, in-flight workflows continue executing during control plane outages. The operational difference is that in BYOC, Union.ai's monitoring detects control plane connectivity issues; in self-managed, the customer must detect these independently. - -## Third-party dependency risk - -In self-managed, the customer owns all data plane dependencies. Union.ai's dependency risk scope is limited to the control plane and Cloudflare tunnel. - -In BYOC, Union.ai assumes operational responsibility for cluster-level dependencies and their associated risk mitigation: - -* Kubernetes version -* Helm charts -* Monitoring stack (Prometheus, Grafana, Fluent Bit) -* Serving infrastructure (Kourier, Knative) - -Union.ai's vendor management program, covered under the SOC 2 Type II audit, includes periodic evaluation of these dependencies. - -## Shared responsibility model - -The shared responsibility model shifts in BYOC for data plane operations: - -| Responsibility Area | Self-Managed | BYOC | -| --- | --- | --- | -| Control plane security | Union.ai | Union.ai | -| Data plane K8s cluster | Customer | Union.ai | -| Cloud account (VPC, IAM) | Customer | Customer | -| Data encryption at rest | Customer (CMK optional) | Customer (CMK optional) | -| Network security (tunnel) | Union.ai (tunnel) + Customer (firewall/VPC) | Union.ai (tunnel + PrivateLink) + Customer (VPC) | -| IAM role provisioning | Customer | Union.ai | -| Secrets management | Customer (backend selection + values) | Union.ai (default backend) + Customer (values) | -| Application-level access control | Customer (role assignment) | Customer (role assignment) | -| Compliance documentation | Union.ai (SOC 2, Trust Center) + Customer | Union.ai (SOC 2, Trust Center) + Customer | - -## HIPAA and compliance - -Union.ai's HIPAA compliance support applies equally to both deployment models. The architecture ensures that all customer data -- including any PHI -- remains exclusively in the customer's own cloud infrastructure regardless of who manages the K8s cluster. The control plane stores only orchestration metadata and never persists PHI. - -## Contact and resources - -* Trust Center: [trust.union.ai](https://trust.union.ai) -* SOC 2 Type II Report: Available upon request -* Security Inquiries: Contact your Union.ai account representative or visit [trust.union.ai](https://trust.union.ai) diff --git a/content/security/compliance-and-certifications.md b/content/security/compliance-and-certifications.md deleted file mode 100644 index b35288ad4..000000000 --- a/content/security/compliance-and-certifications.md +++ /dev/null @@ -1,83 +0,0 @@ ---- -title: Compliance and certifications -weight: 7 -variants: -flyte +union ---- - -# Compliance and certifications - -## Certifications overview - -Union.ai maintains a rigorous certification program validated by independent third-party auditors. -Full details at the [Union.ai Trust Center](https://trust.union.ai/). - -| Standard | Certification | Status | -| --- | --- | --- | -| SOC 2 Type II | Security, Availability, Integrity | Certified | -| SOC 2 Type I | Security, Availability, Integrity | Certified | -| HIPAA | Health data privacy and security | Compliant* | -| CIS 1.4 AWS | Restricted access benchmark | Certified | -| CIS 3.0 | Security benchmark | In progress | - -- * Union is designed to meet HIPAA compliance requirements for handling Protected Health Information (PHI). -- The SOC 2 Type II audit was conducted over a 12-week period and is available upon request. -Key areas covered include protection against unauthorized access (Security), system availability commitments and disaster recovery (Availability), and complete, valid, accurate, and timely processing (Processing Integrity). -- Union.ai uses Vanta for continuous compliance monitoring and automated control assessments. - -## Standards compliance - -In addition to certifications, Union.ai complies with the following standard control frameworks through its private data plane architecture: - -| Framework | Control | Description | -| --- | --- | --- | -| ISO 27001 A.5.15 | Access control | Restricts access to network services and management interfaces; management endpoints not exposed to public Internet | -| ISO 27001 A.8.20 | Network security | Segregation and protection of networks; management interfaces on dedicated, private channels | -| ISO 27001 A.8.28 | Secure configuration | Minimizes public exposure of management plane by default | -| ISO 27001 A.8.21 | Cryptography | TLS encryption with minimized exposure of sensitive channels | -| ISO 27001 A.5.23 | Cloud service security | Cloud services configured securely with mitigated public exposure risks | -| CIS v8 4.4 | Administrative access | Administrative interfaces not exposed to Internet; VPN/bastion required | -| CIS v8 12.11 | Segment admin interfaces | Separation of administrative interfaces from public access | -| CIS v8 13.2 | Boundary protections | Management plane endpoints behind strong network segmentation | - -## HIPAA compliance - -Union.ai is designed to support HIPAA compliance requirements, enabling healthcare and life sciences organizations to process protected health information (PHI) within their data planes. -Because all customer data—including any PHI—remains exclusively in the customer’s own cloud infrastructure, Union.ai’s architecture inherently supports HIPAA’s data protection requirements. -The control plane stores only orchestration metadata and never persists PHI. - -## GDPR alignment - -Union.ai’s architecture inherently supports GDPR through its data residency model. -For EU-region data planes, all customer data remains within the European Union. -The control plane stores only orchestration metadata, and where error messages may contain user-generated content, this is documented and scoped. - -## Trust Center - -Union.ai maintains a public Trust Center at trust.union.ai (powered by Vanta), providing real-time transparency into the company’s security controls, compliance status, and security practices. -The Trust Center provides up-to-date information on certifications, downloadable resources (SOC 2 reports upon request), and over 70 verified security controls organized across five categories: - -| Control Category | Controls | Key Controls Include | -| --- | --- | --- | -| Infrastructure Security | 17 controls | Encryption key access restricted, unique account authentication enforced, production application/database/OS/network access restricted, intrusion detection, log management, network segmentation, firewalls reviewed and utilized, network hardening standards | -| Organizational Security | 13 controls | Asset disposal procedures, production inventory, portable media encryption, anti-malware, code of conduct, confidentiality agreements, password policy, MDM, security awareness training | -| Product Security | 5 controls | Data encryption at rest, control self-assessments, penetration testing, data transmission encryption, vulnerability/system monitoring | -| Internal Security Procedures | 35 controls | BC/DR plans established and tested, cybersecurity insurance, change management, SDLC, incident response tested, risk assessments, vendor management, board oversight, whistleblower policy | -| Data and Privacy | 3 controls | Data retention procedures, customer data deleted upon leaving, data classification policy | - -## Shared responsibility model - -Union.ai operates under a shared responsibility model: - -| Responsibility Area | Union.ai | Customer | -| --- | --- | --- | -| Control plane security | Full ownership | N/A | -| Data plane infrastructure | Guidance and tooling | Provisioning and maintenance | -| Data encryption at rest | Default cloud encryption | Optional CMK configuration | -| Network security (tunnel) | Tunnel management | Firewall and VPC configuration | -| IAM roles and policies | Role templates and documentation | Role creation and binding | -| Secrets management | API and relay infrastructure | Backend selection and secret values | -| Application-level access control | RBAC framework | Role assignment and policy | -| Compliance documentation | SOC 2 report, Trust Center | Customer-specific attestations | - -> [!NOTE] -> In BYOC deployments, shared responsibilities shift for data plane infrastructure and IAM roles. See [BYOC deployment differences: Shared responsibility model](./byoc-differences#shared-responsibility-model). diff --git a/content/security/compliance/_index.md b/content/security/compliance/_index.md new file mode 100644 index 000000000..e7d2ebd0f --- /dev/null +++ b/content/security/compliance/_index.md @@ -0,0 +1,20 @@ +--- +title: Compliance and governance +weight: 5 +variants: -flyte +union +sidebar_expanded: true +--- + +# Compliance and governance + +Union.ai maintains industry-recognized certifications and aligns its security practices with established frameworks. The platform's architecture (with strict data residency, tenant isolation, and control plane / data plane separation) inherently supports compliance requirements across regulated industries. This section covers certifications, regulatory alignment, organizational security practices, and vulnerability management. + +This section covers: + +* [Certifications and Trust Center](./certifications): Summary of all certifications, SOC 2 Type II detail, and the Trust Center. +* [HIPAA compliance](./hipaa): How Union.ai supports HIPAA requirements for Protected Health Information. +* [GDPR alignment](./gdpr): Data residency and the EU-region deployment model. +* [Standards compliance](./standards): ISO 27001 and CIS benchmark control mappings. +* [Shared responsibility model](./shared-responsibility): Responsibility allocation for self-managed and BYOC deployments. +* [Organizational security](./organizational-security): Employee security lifecycle, governance controls, and the security development lifecycle. +* [Vulnerability management](./vulnerability-management): Vulnerability assessment, patch management, incident response, and third-party dependency risk. diff --git a/content/security/compliance/certifications.md b/content/security/compliance/certifications.md new file mode 100644 index 000000000..37bf4ba3d --- /dev/null +++ b/content/security/compliance/certifications.md @@ -0,0 +1,55 @@ +--- +title: Certifications and Trust Center +weight: 1 +variants: -flyte +union +--- + +# Certifications and Trust Center + +## Certifications overview + +Union.ai maintains the following certifications and compliance standards, validated by independent third-party auditors with continuous compliance monitoring via Vanta. + +| Standard | Status | +|---|---| +| SOC 2 Type II (Security, Availability, Processing Integrity) | Certified | +| SOC 2 Type I (Security, Availability, Processing Integrity) | Certified | +| HIPAA | Compliant (designed to meet requirements) | +| CIS 1.4 AWS (restricted access benchmark) | Certified | +| CIS 3.0 | In progress | + +## SOC 2 Type II + +The 12-week audit covers three trust service criteria: Security (protection against unauthorized access), Availability (system availability and disaster recovery), and Processing Integrity (complete, valid, accurate, and timely data processing). + +The audit scope includes control plane infrastructure and operations, tenant isolation controls (org-scoped primary keys, service-layer query gating), employee security lifecycle (background checks, access provisioning, termination checklists), incident response procedures, vendor management program, and business continuity and disaster recovery plans. + +Union.ai maintains 73 verified controls across 5 categories, continuously monitored via Vanta: + +| Category | Controls | Examples | +|---|---|---| +| Infrastructure Security | 17 | Encryption key access, unique account auth, production access restrictions, intrusion detection, log management, network segmentation, firewall review, network hardening | +| Organizational Security | 13 | Asset disposal, production inventory, portable media encryption, anti-malware, code of conduct, confidentiality agreements, password policy, MDM, security awareness training | +| Product Security | 5 | Data encryption at rest, control self-assessments, penetration testing, data transmission encryption, vulnerability/system monitoring | +| Internal Security Procedures | 35 | BC/DR plans, cybersecurity insurance, change management, SDLC, incident response, risk assessments, vendor management, board oversight, whistleblower policy | +| Data and Privacy | 3 | Data retention, customer data deleted upon leaving, data classification policy | + +The SOC 2 Type II report is available upon request. + +## Trust Center + +Union.ai maintains a public Trust Center at [trust.union.ai](https://trust.union.ai) (powered by Vanta) with real-time transparency into security controls, compliance status, and security practices. The Trust Center provides up-to-date certification information and access to request SOC 2 reports. All 73 verified controls are visible through the Trust Center. + +## Verification + +### Certifications + +**Reviewer focus:** Confirm that certifications are current and that the Trust Center provides real-time visibility into control status. + +**How to verify:** + +1. Visit [trust.union.ai](https://trust.union.ai) and review the current certification status. + +2. Request the SOC 2 Type II report and walk through the control categories relevant to specific security questions. + +This is audit-only verification. Certifications are validated by independent third-party auditors. diff --git a/content/security/compliance/gdpr.md b/content/security/compliance/gdpr.md new file mode 100644 index 000000000..bc3d4bb44 --- /dev/null +++ b/content/security/compliance/gdpr.md @@ -0,0 +1,27 @@ +--- +title: GDPR alignment +weight: 3 +variants: -flyte +union +--- + +# GDPR alignment + +Union.ai's architecture inherently supports GDPR through its data residency model. For EU-region data planes, all customer data remains within the European Union. Union.ai operates control plane endpoints in EU West-1 (Ireland), EU West-2 (London), and EU Central (Frankfurt), ensuring organizations can deploy with full EU data residency. + +The control plane stores orchestration metadata and task definitions (encrypted at rest), which may include fields such as environment variables and default input values. No bulk customer data payloads are stored in the control plane. Inline data (structured task I/O, secret values during creation, and log streams) transits control plane memory during request processing but is not persisted. This transit occurs through the control plane region, so customers should select a control plane region consistent with their data residency requirements. When both planes are in EU regions (EU West-1, EU West-2, or EU Central), all data (both at rest and in transit) stays within the EU. + +For details on how data residency is enforced architecturally, see [Data classification and residency](../data-protection/classification-and-residency) and [Two-plane separation](../architecture/two-plane-separation). + +## Verification + +### GDPR alignment + +**Reviewer focus:** Confirm that all customer data remains within the EU for EU-region deployments and that the control plane does not store data payloads. + +**How to verify:** + +1. Follow the data residency verification steps described in [Data classification and residency](../data-protection/classification-and-residency#data-residency). All data resides in EU-region infrastructure. + +2. Confirm that the control plane endpoints are deployed in EU regions (EU West-1, EU West-2, EU Central). + +3. This is architectural verification. The [two-plane separation](../architecture/two-plane-separation) ensures data never leaves the customer's cloud account. diff --git a/content/security/compliance/hipaa.md b/content/security/compliance/hipaa.md new file mode 100644 index 000000000..9eab555a7 --- /dev/null +++ b/content/security/compliance/hipaa.md @@ -0,0 +1,29 @@ +--- +title: HIPAA compliance +weight: 2 +variants: -flyte +union +--- + +# HIPAA compliance + +Union.ai supports HIPAA compliance for organizations processing protected health information (PHI). The architectural separation between control plane and data plane described in [Two-plane separation](../architecture/two-plane-separation) is the foundation of HIPAA compliance. + +Bulk PHI (files, DataFrames) is stored and processed only in the customer's data plane and never enters the control plane. Structured task inputs/outputs and log streams that may contain PHI transit control plane memory during request processing (encrypted in transit, plaintext in memory, not persisted, cached, or logged). + +Task definitions stored in the control plane databases may contain fields such as environment variables and default input values. If they contain PHI, they would be persisted (encrypted at rest) in Union.ai infrastructure. Error messages from task executions are also persisted in the control plane and may contain customer data from tracebacks. There is no log content filtering or redaction, so PHI written to stdout/stderr flows through control plane memory unmodified. Organizations should evaluate whether task definitions or error messages in their workflows could contain PHI and scope their BAA accordingly. + +All data is encrypted at rest and in transit. RBAC policies restrict access to authorized users. All API requests are authenticated and logged with identity, operation, and timestamp. These guarantees apply equally to self-managed and BYOC deployments. + +## Verification + +### HIPAA compliance + +**Reviewer focus:** Confirm that PHI remains exclusively in the customer's infrastructure and that appropriate safeguards are in place. + +**How to verify:** + +1. All data residency demonstrations apply. PHI stays in the customer's infrastructure. Run the verification steps described in [Data classification and residency](../data-protection/classification-and-residency) to confirm data residency. + +2. Confirm Business Associate Agreement (BAA) availability with Union.ai sales. + +3. Map Union.ai's controls to HIPAA safeguard categories (Administrative, Physical, Technical) to validate coverage. diff --git a/content/security/compliance/organizational-security.md b/content/security/compliance/organizational-security.md new file mode 100644 index 000000000..c4cbe001e --- /dev/null +++ b/content/security/compliance/organizational-security.md @@ -0,0 +1,35 @@ +--- +title: Organizational security +weight: 6 +variants: -flyte +union +--- + +# Organizational security + +## Employee security lifecycle + +Union.ai conducts background checks for all employees with production system access, verified through the SOC 2 audit. Security awareness training is required within 30 days of hire and annually, monitored via Vanta. Confidentiality agreements are signed by all employees and contractors. A code of conduct is acknowledged by all personnel, with violations subject to disciplinary action. + +Access management follows documented procedures for provisioning, modification, and revocation. Termination checklists ensure complete access revocation when employees depart. Annual performance evaluations are conducted. Least-privilege access to internal systems is enforced with regular access reviews. + +## Governance + +Formal security roles and responsibilities are defined with a documented organizational structure and reporting relationships. Board-level oversight is maintained: senior management briefs the board on security and risk at least annually. + +Information security policies are documented and reviewed at least annually. A whistleblower policy provides an anonymous communication channel for reporting concerns. Third-party vendors are evaluated and monitored through the vendor management program, and the sub-processor list is available via the [Trust Center](https://trust.union.ai). Business continuity and disaster recovery plans are aligned with SOC 2 requirements. + +## Security development lifecycle + +Secure coding guidelines are enforced through mandatory code review. Automated security testing is integrated into CI/CD pipelines. Dependency scanning and vulnerability management cover all software components. Infrastructure-as-code with version-controlled security configurations ensures that infrastructure changes are auditable and reproducible. + +Regular third-party penetration testing validates the effectiveness of security controls. Documented incident response procedures include escalation paths and post-incident review. + +## Verification + +### Organizational security + +**Reviewer focus:** Confirm that organizational security practices are documented and independently verified. + +**How to verify:** + +These are organizational practices verified through the SOC 2 Type II audit and [Trust Center](https://trust.union.ai) continuous monitoring. This is audit-only verification. These practices cannot be demonstrated through product features. diff --git a/content/security/compliance/shared-responsibility.md b/content/security/compliance/shared-responsibility.md new file mode 100644 index 000000000..b8a6fae3a --- /dev/null +++ b/content/security/compliance/shared-responsibility.md @@ -0,0 +1,46 @@ +--- +title: Shared responsibility model +weight: 5 +variants: -flyte +union +--- + +# Shared responsibility model + +## Self-managed + +In self-managed deployments, the customer owns and operates the data plane infrastructure, while Union.ai manages the control plane. The following table defines the responsibility boundary. + +| Area | Union.ai | Customer | +|---|---|---| +| Control plane security | Full ownership | N/A | +| Data plane infrastructure | Guidance and tooling | Provisioning and maintenance | +| Data encryption at rest | Default cloud encryption | Optional CMK configuration | +| Network security (tunnel) | Tunnel management | Firewall and VPC configuration | +| IAM roles and policies | Role templates and documentation | Role creation and binding | +| Secrets management | API and relay infrastructure | Backend selection and secret values | +| Access control | RBAC framework | Role assignment and policy | +| Compliance documentation | SOC 2 report, Trust Center | Customer-specific attestations | + +## BYOC shifts + +In BYOC deployments, Union.ai assumes additional operational responsibility for the data plane Kubernetes cluster while the customer retains ownership of the cloud account. + +| Area | Self-managed | BYOC | +|---|---|---| +| Data plane K8s cluster | Customer | Union.ai | +| Cloud account (VPC, IAM) | Customer | Customer | +| IAM role provisioning | Customer | Union.ai | +| Secrets management | Customer (backend + values) | Union.ai (default backend) + Customer (values) | +| Network security | Union.ai (tunnel) + Customer (firewall/VPC) | Union.ai (tunnel + PrivateLink) + Customer (VPC) | + +For details on how the deployment model affects security controls, see [Deployment models](../architecture/deployment-models). + +## Verification + +### Shared responsibility model + +**Reviewer focus:** Confirm that the responsibility boundaries are clearly defined and that the BYOC model correctly reflects the shifted responsibilities. + +**How to verify:** + +This is a reference table for risk assessment, not a claim requiring active proof. Use it to map security questions to the responsible party for each deployment model. diff --git a/content/security/compliance/standards.md b/content/security/compliance/standards.md new file mode 100644 index 000000000..5be55dafa --- /dev/null +++ b/content/security/compliance/standards.md @@ -0,0 +1,37 @@ +--- +title: Standards compliance +weight: 4 +variants: -flyte +union +--- + + + +# Standards compliance + +Union.ai aligns with ISO 27001 and CIS control frameworks through its private data plane architecture. The private connectivity model described in [Private connectivity](../architecture/private-connectivity) directly addresses the management interface controls in both frameworks. + +| Framework | Control | Description | +|---|---|---| +| ISO 27001 A.5.15 | Access control | Restricts access to network services and management interfaces; management endpoints not exposed to the public internet | +| ISO 27001 A.5.23 | Information security for use of cloud services | Cloud services configured securely with mitigated public exposure risks | +| ISO 27001 A.8.20 | Networks security | Segregation and protection of networks; management interfaces on dedicated private channels | +| ISO 27001 A.8.22 | Segregation of networks | Management plane separated from public networks | +| ISO 27001 A.8.24 | Use of cryptography | TLS encryption with minimized exposure of sensitive channels | +| CIS Controls v8, Control 12 | Network infrastructure management | Administrative interfaces not exposed to the public internet; management endpoints behind network segmentation | +| CIS Controls v8, Control 13 | Network monitoring and defense | Traffic filtering between network segments; boundary protections on management plane endpoints | + +Union.ai also holds CIS 1.4 AWS certification and is pursuing CIS 3.0. + +## Verification + +### Standards compliance + +**Reviewer focus:** Confirm that the private connectivity architecture satisfies the referenced ISO 27001 and CIS controls. + +**How to verify:** + +1. The private connectivity architecture described in [Private connectivity](../architecture/private-connectivity) IS the demonstration of these controls: management interfaces are not exposed to the public Internet. + +2. The [Trust Center](https://trust.union.ai) covers continuous monitoring of compliance status. + +3. This is architectural and audit-only verification. diff --git a/content/security/compliance/vulnerability-management.md b/content/security/compliance/vulnerability-management.md new file mode 100644 index 000000000..d1d6b2c0f --- /dev/null +++ b/content/security/compliance/vulnerability-management.md @@ -0,0 +1,49 @@ +--- +title: Vulnerability management +weight: 7 +variants: -flyte +union +--- + +# Vulnerability management + +## Vulnerability assessment + +Union.ai's vulnerability management program includes dependency analysis and automated CVE alerts for software dependencies, container image scanning for platform and customer-facing components, and periodic third-party penetration testing. + +## Patch management + +Union.ai follows a risk-based approach to patching. Critical vulnerabilities (CVSS 9.0+) are prioritized for immediate remediation. High-severity vulnerabilities are addressed within defined SLA windows. + +The control plane is updated independently of customer data planes, so security patches can be applied without customer-side changes. In self-managed deployments, the customer handles data plane patching. In BYOC deployments, Union.ai manages data plane patching on the customer's behalf. + +## Incident response + +Union.ai maintains documented incident response procedures aligned with SOC 2 Type II requirements, including defined escalation paths, communication protocols, containment procedures, and post-incident review processes. The control plane's stateless handling of customer data limits the potential impact of any control plane incident. See [Two-plane separation](../architecture/two-plane-separation) for details. + +## Third-party dependency risk + +Union.ai relies on a small number of critical and operational dependencies, each with specific mitigations. + +| Dependency | Tier | Role | Mitigation | +|---|---|---|---| +| Cloudflare | Critical | Cross-plane connectivity (Tunnel and gRPC ingress) | mTLS, outbound-only, health monitoring, auto-reconnection | +| AWS (control plane) | Critical | CP infrastructure | Multi-AZ, automated failover, encryption at rest and in transit | +| Customer cloud provider | Critical | DP infrastructure | Customer-managed; Union.ai provides guidance and tooling | +| Vanta | Operational | Compliance monitoring | Independent SOC 2 audit validates controls | +| Okta | Operational | OIDC authentication | Standard OAuth2/OIDC; API keys and service accounts as fallback | + +The vendor management program is covered under SOC 2 Type II with periodic evaluation. A formal dependency risk assessment is available upon request. In self-managed deployments, the customer owns all data plane dependencies. In BYOC deployments, Union.ai assumes cluster-level dependency risk. + +## Verification + +### Vulnerability management + +**Reviewer focus:** Confirm that Union.ai has an active vulnerability management program with defined SLAs and that third-party dependencies are evaluated and monitored. + +**How to verify:** + +1. The SOC 2 Type II report and [Trust Center](https://trust.union.ai) cover vulnerability management and vendor assessment controls. + +2. Penetration test results are available on request. + +This is audit-only verification. diff --git a/content/security/components-architecture.md b/content/security/components-architecture.md deleted file mode 100644 index f22289bc1..000000000 --- a/content/security/components-architecture.md +++ /dev/null @@ -1,135 +0,0 @@ ---- -title: Compute and control plane components -weight: 11 -variants: -flyte +union -mermaid: true ---- - -# Compute and control plane components - -This section provides a detailed reference for each security-relevant component running on the data plane and/or control plane. -Understanding these components is essential for enterprise security teams conducting architecture reviews. - -## Component architecture - -The diagram below shows the major components in both planes and how they communicate. -All cross-plane traffic flows through the Cloudflare Tunnel—an outbound-only, mTLS-encrypted connection initiated from the data plane. -No inbound ports are opened on the customer’s cluster. - -```mermaid -graph TB - subgraph CP["Control plane (Union.ai hosted — AWS)"] - Admin["Admin
(UI & API gateway)"] - QueueSvc["Queue Service
(schedules TaskActions)"] - StateSvc["State Service
(receives state transitions)"] - ClusterSvc["Cluster Service
(cluster health & DNS reconciliation)"] - DataProxy["DataProxy
(streaming relay for logs & metrics)"] - end - - subgraph Tunnel["Cloudflare Tunnel (outbound-only, mTLS)"] - direction LR - TunnelEdge(["Cloudflare edge"]) - end - - subgraph DP["Data plane (customer hosted — customer cloud account)"] - TunnelSvc["Tunnel Service
(maintains outbound tunnel connection)"] - Executor["Executor
(Kubernetes controller — runs task pods)"] - ObjStore["Object Store Service
(presigned URL generation)"] - LogProvider["Log Provider
(live K8s logs + cloud log aggregator)"] - ImageBuilder["Image Builder
(Buildkit — on-cluster image builds)"] - - subgraph Apps["Apps & Serving"] - Kourier["Kourier gateway
(Envoy — auth + routing)"] - Knative["Knative Services
(app containers)"] - end - - Executor -->|"submit and watch"| Pods["Task pods
(customer workloads)"] - Pods -->|"read/write via IAM"| ObjBucket[("Object store
(metadata + fast-reg buckets)")] - ObjStore -->|"signs URLs using admin IAM role"| ObjBucket - LogProvider -->|"live: K8s API
completed: CloudWatch / Cloud Logging / Azure Monitor"| Pods - Kourier --> Knative - end - - Admin -->|"ConnectRPC / HTTPS"| User(["Client
(browser / CLI / SDK)"]) - User -->|"presigned URL — direct fetch"| ObjBucket - - CP <-->|"Cloudflare Tunnel"| TunnelEdge - TunnelEdge <-->|"outbound-initiated from data plane"| TunnelSvc - TunnelSvc --- Executor - TunnelSvc --- ObjStore - TunnelSvc --- LogProvider - TunnelSvc --- Apps - - QueueSvc -->|"TaskAction"| Executor - Executor -->|"state transitions (ConnectRPC)"| StateSvc - LogProvider -->|"streamed relay — never persisted"| DataProxy - ClusterSvc -->|"health checks & DNS"| TunnelSvc -``` - -**Key relationships:** - -| From | To | What flows | -| --- | --- | --- | -| Queue Service | Executor | TaskAction custom resources (orchestration instructions) | -| Executor | State Service | Phase transitions (Queued → Running → Succeeded/Failed) | -| Executor | Task pods | Pod lifecycle management | -| Task pods | Object store | Task inputs/outputs via IAM role (workload identity) | -| Object Store Service | Object store | Presigned URL generation using admin IAM role | -| Log Provider | DataProxy | Log streams relayed in memory — optionally persisted on customer storage | -| Cluster Service | Tunnel Service | Health checks and DNS record reconciliation | -| Tunnel Service | Cloudflare edge | Single outbound-only mTLS connection covering all data-plane services | - -## Executor - -The Executor is a Kubernetes controller that runs on the customer’s data plane. -It is the core component responsible for translating orchestration instructions into actual workload execution. -The Executor watches for `TaskAction` custom resources created by the Queue Service, reconciles each `TaskAction` through its lifecycle (`Queued`, `Initializing`, `Running`, `Succeeded`/`Failed`), reports state transitions back to the control plane’s State Service via `ConnectRPC` through the Cloudflare tunnel, and creates and manages Kubernetes pods for task execution. - -The Executor runs entirely within the customer’s cluster. -It accesses the customer’s object store and secrets using IAM roles bound to its Kubernetes service account via workload identity federation. -At no point does the Executor communicate directly with external services outside the customer’s cloud account (except through the Cloudflare tunnel to the control plane). - -## Apps and serving - -- Apps and Serving enables customers to deploy long-running web applications — Streamlit dashboards, FastAPI services, notebooks, and inference endpoints — directly on the customer's data plane. -- Apps run as Knative Services within tenant-scoped Kubernetes namespaces, with the Union Operator managing the full lifecycle including autoscaling and scale-to-zero. -- No application code, data, or serving traffic passes through the Union control plane. -- Inbound traffic routes through Cloudflare for DDoS protection to a Kourier gateway (Union's Envoy fork) running on the customer's cluster, which enforces authentication against the control plane before forwarding to the app container. -- Browser access uses SSO; programmatic access requires a Union API key. -- All endpoints require authentication by default, with optional per-app anonymous access. -- Union's RBAC controls which users can deploy and access apps per project, and resource quotas constrain consumption. -- The load balancer, serving infrastructure, and app containers all run within the customer's cluster, maintaining the same data residency guarantees as workflow execution. -- In BYOC deployments, Union.ai manages the [serving infrastructure lifecycle](./byoc-differences#infrastructure-management). - -## Object store service - -The Object Store Service runs on the data plane and provides the signing capabilities that enable the presigned URL security model. -Its key operations include: -- `CreateSignedURL` (generates presigned URLs using the customer’s IAM credentials via the admin role). -- `CreateUploadLocation` (generates presigned `PUT` URLs for fast registration with `Content-MD5` integrity verification) -- `Presign` (generic presigning for arbitrary object store keys) -- `Get`/`Put` (direct object store read/write used internally by platform services). - -Two object store buckets are provisioned per data plane cluster: a metadata bucket for task inputs, outputs, reports, and intermediate data, and a "fast-registration" bucket for code bundles uploaded during task registration. -Object layout follows a hierarchical pattern: org/project/domain/run-name/action-name, providing natural namespace isolation. - -## Log provider - -The Log Provider runs on the data plane and serves task logs from two sources. -For live tasks, logs are streamed directly from the Kubernetes API (pod stdout/stderr) in real time. -For completed tasks, logs are read from the cloud log aggregator (CloudWatch, Cloud Logging, or Azure Monitor) after pod termination. -Union also supports persisting logs in object storage. -Log lines include structured metadata: timestamp, message content, and originator classification (user vs. system). -This structured approach enables security teams to distinguish between application-generated logs and platform-generated logs for audit purposes. - -## Image builder - -When enabled, the Image Builder runs on the data plane and uses Buildkit to construct container images without exposing source code or built artifacts outside the customer’s infrastructure. -The build process pulls the base image from a customer-approved registry (public or private), accesses user code via a presigned URL with a limited time-to-live, builds the container image with specified layers (pip packages, apt packages, custom commands, UV/Poetry projects), and pushes the built image to the customer’s container registry (ECR, GCR, ACR, or others). -Source code and built images never leave the customer’s infrastructure during the build process. - -## Tunnel service - -The Tunnel Service maintains the Cloudflare Tunnel connection between the data plane and control plane. -It is responsible for initiating and maintaining the outbound-only encrypted connection, performing periodic health checks and heartbeats, and reconnecting automatically in case of network disruption. -The Cluster Service on the control plane performs periodic reconciliation to ensure tunnel health and DNS records are current. diff --git a/content/security/data-protection.md b/content/security/data-protection.md deleted file mode 100644 index 4d9ffdf46..000000000 --- a/content/security/data-protection.md +++ /dev/null @@ -1,65 +0,0 @@ ---- -title: Data protection -weight: 2 -variants: -flyte +union ---- - -# Data protection - -## Data classification - -Union.ai maintains a rigorous data classification framework. -Every data type handled by the platform is classified by residency and access pattern: - -| Data Type | Classification | Residency | Transits Control Plane? | -| --- | --- | --- | --- | -| Task inputs/outputs | Customer Data | Customer object store | No — direct via presigned URL | -| Code bundles | Customer Data | Customer object store | No — direct via presigned URL | -| Container images | Customer Data | Customer registry | No — stays in customer infra | -| Reports (HTML) | Customer Data | Customer object store | No — direct via presigned URL | -| Task logs | Customer Data | Customer log aggregator | Relayed in-memory (not stored) | -| Secrets | Customer Data | Customer secrets backend | Relayed during create (not stored) | -| Observability metrics | Customer Data | Customer data plane | Relayed in-memory (not stored) | -| Task definitions | Orchestration Metadata | Control plane DB | Yes — metadata only | -| Run/action metadata | Orchestration Metadata | Control plane DB | Yes | -| User identity/RBAC | Platform Metadata | Control plane DB | Yes | -| Cluster records | Platform Metadata | Control plane DB | Yes | - -## Encryption at rest - -All data at rest is encrypted using cloud-provider native encryption: - -| Storage | Encryption Standard | Key Management | -| --- | --- | --- | -| Object Store (S3/GCS/Azure Blob) | Cloud-provider default (SSE-S3, Google-managed, Azure SSE) | Cloud provider managed; CMK supported | -| Container Registry | Cloud-provider encryption | Cloud provider managed | -| Secrets Backend (cloud) | Cloud-provider encryption | Cloud secrets manager | -| Secrets Backend (K8s) | `etcd` encryption | K8s cluster-level encryption | -| ClickHouse | Encrypted EBS/persistent disk | Cloud provider managed | -| Control Plane PostgreSQL | AWS RDS encryption | AES-256; AWS KMS managed | - -## Encryption in transit - -Union.ai enforces encryption for all data in transit. -No unencrypted communication paths exist in the platform architecture. - -- All client-to-control-plane communication uses TLS 1.2 or higher. -- All control-plane-to-data-plane communication uses mutual TLS via Cloudflare Tunnel. -- All client-to-object-store communication (via presigned URLs) uses HTTPS, enforced by cloud providers. -- All internal data plane communication uses cloud-native TLS. - -## Data residency and sovereignty - -Union.ai’s architecture provides strong data residency guarantees: - -### Data plane - -* All customer data resides in the customer’s own cloud account and region -* Customers choose the region for their data plane deployment - -### Control plane - -* Union.ai hosts your control plane in these supported regions: US West, US East, EU West-1 (Ireland), EU West-2 (London), EU Central, with more being added -* No customer data is replicated to or cached in Union.ai infrastructure. See [Data classification](#data-classification) for more detail on data classification and handling. - -For organizations operating under GDPR or other data residency regulations, Union.ai’s EU-region data planes ensure all customer data remains within the European Union. diff --git a/content/security/data-protection/_index.md b/content/security/data-protection/_index.md new file mode 100644 index 000000000..2fedf585e --- /dev/null +++ b/content/security/data-protection/_index.md @@ -0,0 +1,26 @@ +--- +title: Data protection +weight: 2 +variants: -flyte +union +sidebar_expanded: true +--- + +# Data protection + +Union.ai protects customer data through a classification framework, residency guarantees, and cloud-native encryption. All customer data is encrypted both at rest and in transit. + +The platform uses three data access patterns: + +- **Presigned URLs** for bulk data (files, DataFrames, code bundles), which bypass the control plane entirely. +- **Inline proxy** for structured task I/O and secret values, which transits control plane memory encrypted in transit, exists as plaintext in memory only during request handling, and is not persisted. +- **Streaming relays** for logs and metrics, which transit control plane memory and are not persisted. + +This section covers: + +* [Data classification and residency](./classification-and-residency): How data is classified, where it resides, and multi-cloud region support. +* [Data flow](./data-flow): Presigned URL and streaming relay patterns, and what data appears in the UI. +* [Encryption](./encryption): Encryption at rest and in transit across all storage and communication paths. +* [Secrets management](./secrets): Write-only API design, backends, and secret lifecycle. +* [Workflow data flow](./workflow-data-flow): Security controls at each stage of the workflow lifecycle. +* [Multi-cloud support](./multi-cloud): Supported cloud providers and consistent security guarantees. +* [Logging and audit](./logging-and-audit): Task logging, observability metrics, and audit trails. diff --git a/content/security/data-protection/classification-and-residency.md b/content/security/data-protection/classification-and-residency.md new file mode 100644 index 000000000..c44710206 --- /dev/null +++ b/content/security/data-protection/classification-and-residency.md @@ -0,0 +1,114 @@ +--- +title: Data classification and residency +weight: 1 +variants: -flyte +union +--- + +# Data classification and residency + +## Data classification + +Every data type in the Union.ai platform is classified by its residency and access pattern. This classification determines where data is stored and how it is accessed. + +| Classification | Data types | At rest | In transit | Enters control plane memory? | +|---|---|---|---|---| +| Bulk Customer Data | Files, directories, DataFrames, code bundles, container images, reports | Customer infrastructure (S3 SSE / GCS / Azure SSE) | HTTPS via presigned URL | **No**: never enters control plane | +| Inline Customer Data | Structured task inputs/outputs, secret values (during creation), execution log streams | Customer infrastructure (S3 SSE / GCS / Azure SSE; cloud secret managers) | TLS (client→CP) + TLS+mTLS+tunnel (CP→DP) | **Yes**: plaintext in memory, not persisted/cached/logged | +| Orchestration Metadata | Task definitions (including env vars, default values, SQL, pod specs), run/action state, error messages, trigger specs | Control plane databases (AES-256/KMS) | TLS (API) + TLS (gRPC events) | **Yes**: read from DB into memory for API responses | +| Platform Metadata | User identity/RBAC records, cluster records | Control plane databases (AES-256/KMS) | TLS (API) | **Yes**: read from DB into memory for API responses | + +**Bulk customer data** (files, directories, DataFrames, code bundles, container images, and reports) is stored exclusively in the customer's infrastructure and never enters the control plane. These objects are accessed via presigned URLs. + +**Inline customer data** (structured task inputs and outputs, secret values during creation/update, and execution log streams) is stored at rest in the customer's infrastructure but transits control plane memory during request processing. This data is encrypted in transit (TLS + Cloudflare Tunnel), exists as plaintext in control plane memory only for the duration of each request, and is not persisted, cached, or logged in the control plane. + +**Orchestration metadata** is stored in the control plane databases (encrypted at rest). This includes task definitions, which contain structural information (container image references, typed interfaces) and fields that may be customer-sensitive: environment variables, default input literal values, SQL query statements, Kubernetes pod specs, plugin configuration, and config key-value pairs. Error messages from task executions (which may contain data from Python tracebacks) are also stored. A full task definition (TaskSpec) is stored on every run submission. + +## Data residency + +All customer data resides in the customer's own cloud account and region. The customer chooses the region for their data plane deployment, and all data plane resources (object storage, container registry, secrets backend, log aggregator, and compute) are provisioned within that region. + +The control plane is available in the following regions: US West (us-west-2), US East (us-east-2), EU West-1 (Ireland), EU West-2 (London), and EU Central (eu-central-1). No bulk customer data is replicated to or cached in Union.ai infrastructure. Inline data (structured task I/O, secret values during creation, log streams) transits control plane memory during request processing but is not persisted. This transit occurs through the control plane region, so customers should select a control plane region consistent with their data residency requirements. For EU-deployed data planes using an EU control plane region, all data (both at rest and in transit) stays within the EU, supporting GDPR data residency requirements. + +For details on the architectural separation that enforces these residency guarantees, see [Two-plane separation](../architecture/two-plane-separation). + +## Verification + +### Data classification + +**Reviewer focus:** Confirm that each data type resides where the classification table claims. Verify that bulk data is in the customer's infrastructure, and that task definitions in the control plane contain only expected fields. + +**How to verify:** + +The task definition schema is derived from the open-source Flyte protobuf definitions in the [flyte-sdk repository](https://github.com/flyteorg/flyte-sdk). Review the `TaskTemplate` and `RunSpec` protobuf schemas and compare them to the field enumeration in the classification table above to confirm that the fields stored match the documented classifications. + +Then run a workflow with recognizable data (e.g., a known string or file), and verify the location of each data type: + +1. **Inputs/outputs**: confirm they are in the customer's object store: + + ```bash + aws s3 ls s3:///org/project/domain/run-name/action-name/ + ``` + +2. **Code bundle**: confirm it is in the customer's object store: + + ```bash + aws s3 ls s3:///org/project/domain/code-bundles/ + ``` + +3. **Container image**: confirm it is in the customer's container registry: + + ```bash + aws ecr describe-images --repository-name --region + ``` + +4. **Logs**: confirm they are in the customer's log aggregator: + + ```bash + aws logs get-log-events --log-group-name --log-stream-name + ``` + +5. **Secrets**: confirm they are in the customer's secrets backend: + + ```bash + aws secretsmanager list-secrets --region + ``` + +6. **Task definition**: confirm it contains the expected fields, stored in the control plane: + + ```bash + uctl get task -o json + ``` + + The response will contain resource requirements, typed interfaces, container image references, and potentially sensitive fields (environment variables, default values, etc.) as documented in [Control plane](../architecture/control-plane). Bulk data content should not appear inline. + +7. **Run metadata**: confirm it contains metadata and URI references, stored in the control plane: + + ```bash + uctl get execution -o json + ``` + + The response should contain phase, timestamps, URIs, error messages, and task definition fields. Bulk data content should not appear inline. + +### Data residency + +**Reviewer focus:** Confirm that all data plane resources reside in the customer's chosen region and that no customer data is stored outside that region. + +**How to verify:** + +1. Confirm the object store region: + + ```bash + aws s3api get-bucket-location --bucket + ``` + + The output should match the customer's chosen deployment region. + +2. Verify all data plane resources in the cloud console. Compute, storage, registry, secrets, and log aggregator should all be in the same region. + +3. Confirm the cluster region via the Union.ai API: + + ```bash + uctl get cluster + ``` + + The cluster region should match the customer's chosen deployment region. diff --git a/content/security/data-protection/data-flow.md b/content/security/data-protection/data-flow.md new file mode 100644 index 000000000..a741d0aab --- /dev/null +++ b/content/security/data-protection/data-flow.md @@ -0,0 +1,138 @@ +--- +title: Data flow +weight: 2 +variants: -flyte +union +--- + +# Data flow + +Union.ai uses three distinct patterns to move data between the data plane and clients: presigned URLs for bulk data, an inline proxy for structured task I/O, and streaming relays for live data. All three patterns encrypt data in transit. The patterns differ in whether data enters control plane memory. + +- **Presigned URL pattern (bulk data -- never enters control plane).** The client (SDK or UI) connects directly to the customer's object store over HTTPS using a presigned URL. The object store encrypts the data at rest. The control plane is not on the data path. + +- **Inline proxy pattern (structured I/O -- transits control plane).** The client sends data to the control plane over TLS. The control plane proxies it to the data plane operator through the Cloudflare Tunnel (TLS + mTLS), which writes it to the customer's object store (encrypted at rest). The data exists as plaintext in control plane memory only for the duration of the request and is not persisted. + +- **Streaming relay pattern (logs -- transits control plane).** The data plane streams data to the control plane through the Cloudflare Tunnel (TLS + mTLS), and the control plane forwards it to the client over TLS. The data passes through control plane memory as plaintext but is not persisted. + +## Presigned URL pattern + +For bulk data -- files (`flyte.io.File`), directories (`flyte.io.Dir`), DataFrames, code bundles, and reports -- the control plane proxies signing requests to the data plane, which generates time-limited presigned URLs using customer-managed IAM credentials. The client then uploads or downloads data directly to the customer's object store. The data content never enters the control plane; only the signing metadata passes through. This model eliminates the need for the control plane to hold persistent cloud IAM credentials. + +- **Client to object store:** encrypted via HTTPS, using a presigned URL direct to customer storage. +- **At rest:** encrypted by the cloud provider (S3 SSE, GCS encryption, or Azure SSE). +- **Control plane involvement:** none. The control plane generates or relays the signed URL only; the data content never enters the control plane. + +## Inline proxy pattern + +For structured task inputs and outputs (protobuf literals such as ints, strings, lists, dicts, and small serialized objects), the control plane acts as a proxy. + +On run submission, the SDK sends structured inputs to the control plane (up to 10 MiB). The control plane proxies the full payload through its memory to the data plane object store via the Cloudflare Tunnel. + +On result retrieval, the control plane fetches both inputs and outputs from the data plane object store and returns them to the client (up to 20 MiB). + +The data is encrypted in transit (TLS on both sides), exists as plaintext in control plane memory for the duration of each request, and is not persisted, cached, or logged. The same pattern applies to secret values during create/update operations, which are relayed through the control plane to the data plane's secrets backend. + +The distinction between presigned URLs and the inline proxy is by data type, not by size: binary artifacts always use presigned URLs; structured protobuf literals always use the inline proxy. + +**Encryption at each phase (run submission):** + +| Phase | Encrypted? | Details | +|-------|------------|---------| +| Client → Control Plane | **Yes** | TLS 1.2+. Wire format: protobuf binary | +| In Control Plane | **Plaintext in memory** | Deserialized protobuf, hashed for cache key, then re-serialized. Not persisted, cached, or logged | +| Control Plane → Data Plane | **Yes** | TLS + mTLS + Cloudflare Tunnel. Wire format: protobuf JSON | +| At rest (data plane object store) | **Yes** | S3 SSE / GCS encryption / Azure SSE | + +**Encryption at each phase (result retrieval):** + +| Phase | Encrypted? | Details | +|-------|------------|---------| +| At rest (data plane object store) | **Yes** | S3 SSE / GCS encryption / Azure SSE | +| Data Plane → Control Plane | **Yes** | TLS + mTLS + Cloudflare Tunnel | +| In Control Plane | **Plaintext in memory** | Full inputs and outputs deserialized. Not persisted, cached, or logged | +| Control Plane → Client | **Yes** | TLS 1.2+ | + +The following controls are applied to every presigned URL: + +- **TTL enforcement**: each URL expires after a default of 1 hour, configurable to shorter durations. +- **Single-object scope**: each URL grants access to exactly one object. +- **Operation specificity**: each URL is locked to a single operation (GET or PUT). +- **Transport encryption**: URLs are transmitted only over TLS. +- **No URL logging**: presigned URLs are not persisted in control plane logs or databases. + +Because presigned URLs are bearer tokens (possession alone grants access), Union.ai recommends treating them with the same care as short-lived credentials and configuring the shortest practical TTL for your use case. + +## Streaming relay pattern + +For logs and observability metrics, the control plane acts as a stateless relay. It streams data from the data plane through the Cloudflare Tunnel to the client in real time. The data passes through the control plane's memory as plaintext, encrypted in transit on both network hops. It is never written to disk, cached, or stored. Once the stream completes, no trace of the data remains in the control plane. There is no content filtering or redaction in the log streaming pipeline. Any sensitive data (secrets, PII, credentials) that user code writes to stdout/stderr will flow through control plane memory unmodified. + +| Phase | Encrypted? | Details | +|-------|------------|---------| +| Data plane log source → DP operator | **Yes** | Linkerd mTLS (pod-to-pod) or cloud SDK TLS (CloudWatch / Cloud Logging / Azure Monitor) | +| Data Plane → Control Plane | **Yes** | TLS + mTLS + Cloudflare Tunnel. Wire format: protobuf streaming | +| In Control Plane | **Plaintext in memory** | Each log message deserialized for byte counting. Not persisted, cached, or logged. No content filtering | +| Control Plane → Client | **Yes** | TLS 1.2+ (streaming) | +| At rest (data plane log backend) | **Yes** | CloudWatch (AES-256/KMS), Cloud Logging (Google-managed), or Azure Monitor (Microsoft-managed) | + +## Data in the UI + +The Union.ai web console displays information from multiple sources. The following table shows where each UI field originates, where that data is stored, and how the browser retrieves it: + +| Field | Source | Access method | +|---|---|---| +| Task names (function/module names) | Control Plane | CP API | +| User names | IDP (cached in CP memory) | IDP / CP | +| Inputs/outputs (structured) | Data plane object store via CP proxy | Cloudflare Tunnel (transits CP memory) | +| Logs (live) | Data plane K8s | Cloudflare Tunnel | +| Logs (persisted) | Data plane log aggregator (CloudWatch / Cloud Logging / Azure Monitor) | Cloudflare Tunnel | +| K8s events | Data plane K8s | Cloudflare Tunnel | +| Reports (HTML) | Data plane object store | Signed URL, rendered in browser iframe | +| Code explorer | Data plane object store | Signed URL, JS downloads and unzips bundle | +| Timeline timestamps | Control Plane | CP API | +| Errors | Control Plane | CP API | + +Fields sourced from the control plane include orchestration metadata and task definitions, which may contain potentially sensitive fields such as environment variables and default values (see [Data classification and residency](./classification-and-residency) for the full enumeration). Structured inputs/outputs are proxied through control plane memory via the inline proxy pattern before reaching the client. Fields sourced directly from the data plane via presigned URLs (reports, code bundles) bypass the control plane entirely. Error messages served from the control plane database may contain customer data from Python tracebacks. + +For details on the underlying network architecture, see [Network architecture](../architecture/network). + +## Verification + +### Presigned URLs + +**Reviewer focus:** Confirm that presigned URLs point to the customer's object store, expire as documented, and are scoped to a single object. + +**How to verify (browser-based):** + +1. Open the Union.ai UI and navigate to a completed task's outputs. +2. Open browser developer tools (Network tab) and observe the request when viewing output data. +3. The presigned URL should resolve to the customer's S3/GCS/Azure Blob domain (not a Union.ai domain), contain an expiry parameter, and reference a single object key. +4. Copy the presigned URL and wait 1 hour. Paste it into the browser. It should return a 403 (TTL expired). +5. Modify the object key in the URL and retry immediately. It should return a 403 (signature invalid, confirming single-object scope). + +### Streaming relay + +**Reviewer focus:** Confirm that logs and metrics streamed through the control plane are not persisted. + +**How to verify:** + +Proving non-persistence is inherently difficult. The best available evidence: + +1. Inspect the control plane proxy pod configuration: + + ```bash + kubectl describe pod -n + ``` + + The pod should have no persistent volumes mounted and no database connection environment variables. + +2. The SOC 2 Type II audit covers the non-persistence control for streaming relays. Request the current report from Union.ai. + +3. (Advanced) Compare log content flowing through the tunnel against control plane state before and after streaming. There should be no delta in stored data. + +### UI data sources + +**Reviewer focus:** Confirm that the UI retrieves customer data exclusively through presigned URLs or tunnel relays, not from the control plane. + +**How to verify:** + +Walk through each panel of the Union.ai UI with browser developer tools open. For each data element, the Network tab shows whether the request goes to the CP API (metadata) or to a presigned URL / tunnel relay endpoint (customer data). Every field in the table above should match its documented access method. This verification is fully self-service. diff --git a/content/security/data-protection/encryption.md b/content/security/data-protection/encryption.md new file mode 100644 index 000000000..66ef561b6 --- /dev/null +++ b/content/security/data-protection/encryption.md @@ -0,0 +1,142 @@ +--- +title: Encryption +weight: 3 +variants: -flyte +union +--- + +# Encryption + +Union.ai encrypts all data at rest and in transit across every storage and communication path in the platform. Transit encryption uses TLS for all communication paths, with mutual TLS (mTLS) layered through the Cloudflare Tunnel for cross-plane traffic. At-rest encryption is provided by cloud provider services (S3 SSE, GCS encryption, Azure SSE) for customer-side storage, and by managed cloud database services (AES-256/KMS) for the control plane. Data that transits control plane memory (structured task I/O, secret values during creation, log streams) is encrypted on every network hop but exists as plaintext in process memory during request handling. + +## Encryption at rest + +| Storage | Standard | Key Management | +|---|---|---| +| Object Store (S3/GCS/Azure Blob) | Cloud-provider default (SSE-S3, Google-managed, Azure SSE) | Cloud provider managed; CMK supported | +| Container Registry | Cloud-provider encryption | Cloud provider managed | +| Secrets Backend (cloud) | Cloud-provider encryption | Cloud secrets manager | +| Secrets Backend (K8s) | etcd encryption | K8s cluster-level | +| Observability metrics store (per-cluster) | Encrypted persistent disk | Cloud provider managed | +| Control plane databases | Managed cloud database service | AES-256; cloud KMS managed | + +All data at rest is encrypted using cloud-provider native encryption. Each storage backend uses the default encryption mechanism provided by the underlying cloud service, ensuring that encryption is always active without requiring additional configuration. For object stores (S3, GCS, Azure Blob Storage), customers can bring their own customer-managed keys (CMK) to gain full control over key rotation, access policies, and revocation. Control plane databases use AES-256 encryption managed through the cloud KMS, consistent with Union.ai's SOC 2 Type II controls. + +## Encryption in transit + +All communication paths in the Union.ai platform are encrypted using TLS: + +- **Client to control plane**: all API and UI traffic uses TLS 1.2 or higher. +- **Data plane to control plane**: two outbound-only channels (Cloudflare Tunnel with mTLS, and direct gRPC over TLS 1.2+). See [Network architecture](../architecture/network). +- **Client to object store**: presigned URLs always use HTTPS, enforced by the cloud provider. +- **Internal data plane communication**: uses cloud-native TLS for inter-service traffic. + +No unencrypted communication paths exist in the platform. The combination of TLS at the edge, mutual TLS through the tunnel, TLS on the gRPC channel, and HTTPS for presigned URLs ensures end-to-end encryption for all data in transit. Data content is never logged at any log level. (Note: if debug logging is enabled in the control plane, authentication credentials, not data content, may be logged in plaintext during request header propagation.) + +For details on the cross-plane channels, see [Network architecture](../architecture/network). + +## Data protection summary + +The following table summarizes the encryption state for each data category across all phases: + +| Data category | In transit | At rest | Enters control plane memory? | Persisted in control plane? | +|---|---|---|---|---| +| **Files, directories, DataFrames** | HTTPS (presigned URL) | S3 SSE / GCS / Azure SSE | No | No | +| **Code bundles** | HTTPS (presigned URL) | S3 SSE / GCS / Azure SSE | No | No | +| **Container images** | HTTPS (registry pull) | ECR/GCR/ACR encryption | No | No | +| **Inter-task I/O** (in-cluster) | Cloud SDK TLS | S3 SSE / GCS / Azure SSE | No | No | +| **Structured task inputs** (run submission) | TLS + TLS/mTLS/tunnel | S3 SSE / GCS / Azure SSE | Yes (plaintext, transient) | No | +| **Structured task I/O** (retrieval) | TLS + TLS/mTLS/tunnel | S3 SSE / GCS / Azure SSE | Yes (plaintext, transient) | No | +| **Secret values** (create/update) | TLS + TLS/mTLS/tunnel | ASM/GCP SM/AKV/etcd encryption | Yes (plaintext, transient) | No | +| **Secret values** (get/list/delete) | TLS | ASM/GCP SM/AKV/etcd encryption | No (metadata only) | No | +| **Secret values** (runtime injection) | Linkerd mTLS / Kubernetes API | Secret backend encryption | No (data plane only) | No | +| **Execution logs** (streaming) | TLS + TLS/mTLS/tunnel | CloudWatch / Cloud Logging / Azure Monitor | Yes (plaintext, transient) | No | +| **Task definitions** (TaskSpec) | TLS | Control plane database (AES-256/KMS) | Yes (read from DB) | **Yes** (encrypted at rest) | +| **Run/trigger specs** | TLS | Control plane database (AES-256/KMS) | Yes (read from DB) | **Yes** (encrypted at rest) | +| **Error messages** | TLS (gRPC) | Control plane database (storage-level) | Yes (read from DB) | **Yes** | +| **Execution metadata** (phase, timestamps) | TLS (gRPC) | Control plane database (AES-256/KMS) | Yes (read from DB) | **Yes** (encrypted at rest) | + +"Transient" means the data exists in process memory only for the duration of a single request and is not written to disk, cache, or logs. For details on each data flow pattern, see [Data flow](./data-flow). + +## Verification + +### Encryption at rest + +**Reviewer focus:** Confirm that all storage backends are encrypted and verify key management configuration. + +**How to verify:** + +1. Check object store encryption: + + ```bash + aws s3api get-bucket-encryption --bucket + ``` + + The output should show SSE configuration (SSE-S3, SSE-KMS, or SSE-C depending on configuration). + +2. Check Kubernetes storage classes for ClickHouse volumes: + + ```bash + kubectl get sc -o yaml + ``` + + Encrypted storage classes should be configured for ClickHouse persistent volumes. + +3. For customer-managed keys, verify key configuration: + + ```bash + aws kms describe-key --key-id + ``` + +This verification is fully self-service. + +### Customer-managed key authority + +**Reviewer focus:** Confirm that bulk data is unreadable without the customer's encryption keys, regardless of who holds a presigned URL. + +**How to verify:** + +1. From a successful workflow run, capture a presigned URL for an output artifact (Union.ai UI, browser developer tools → Network tab, or via `uctl`). + +2. Confirm the URL fetches the artifact: + + ```bash + curl -o /tmp/output "" + ``` + +3. Disable the customer-managed key that encrypts the bucket: + + ```bash + aws kms disable-key --key-id + ``` + + On GCP, disable the CMEK key version protecting the bucket. On Azure, disable the customer-managed key in Key Vault. + +4. Re-fetch the same URL (or issue a fresh one). The request should fail with `KMS.DisabledException` (AWS) or the equivalent. Re-enable the key to restore access. + +The presigned URL itself is unchanged and still authentic. The data behind it is opaque without the customer's key -- proof that customer-controlled keys are the final gate on bulk data access, independent of who can issue URLs. + +### Encryption in transit + +**Reviewer focus:** Confirm that all communication paths use TLS and that no plaintext channels exist. + +**How to verify:** + +1. Verify TLS on the control plane endpoint: + + ```bash + openssl s_client -connect :443 + ``` + + Confirm TLS version (1.2 or higher) and cipher suite in the output. + +2. Check the browser lock icon when accessing the Union.ai UI for certificate details. + +3. Confirm that all presigned URLs use HTTPS by inspecting any presigned URL generated by the platform. + +4. Check the Cloudflare Tunnel pod logs for TLS handshake confirmation: + + ```bash + kubectl logs -n + ``` + +This verification is fully self-service. diff --git a/content/security/data-protection/logging-and-audit.md b/content/security/data-protection/logging-and-audit.md new file mode 100644 index 000000000..24e17afa7 --- /dev/null +++ b/content/security/data-protection/logging-and-audit.md @@ -0,0 +1,60 @@ +--- +title: Logging and audit +weight: 7 +variants: -flyte +union +--- + +# Logging and audit + +## Task logging + +Logs are collected by Fluent Bit (deployed as a DaemonSet on the data plane) and shipped to the customer's cloud-native log service: CloudWatch Logs (AWS), Cloud Logging (GCP), or Azure Monitor (Azure). Live logs are streamed directly from the Kubernetes API while a task is running. Persisted logs are read from the cloud log aggregator after a pod terminates. + +Log data is not persisted in the control plane. It is streamed as a stateless pass-through relay, encrypted in transit on both network hops (client-to-CP and DP-to-CP), and exists as plaintext in control plane memory only during each streaming request. Persisted logs (fetched from CloudWatch, Cloud Logging, or Azure Monitor for completed executions) also transit the control plane via the same streaming proxy path. There is no content filtering or redaction at any layer of the log pipeline. Any sensitive data (secrets, PII, stack traces) that user code writes to stdout/stderr flows through control plane memory unmodified. Log lines include structured metadata: timestamp, message content, and originator classification. For details on how log data flows through the system, see [Data flow](./data-flow#streaming-relay-pattern). + +## Observability metrics + +A per-cluster instance (Prometheus and/or ClickHouse) stores time-series observability metrics including resource utilization and cost data. Queries are proxied through the control plane to the customer's instance. Metrics data never leaves the customer's infrastructure. In BYOC deployments, Union.ai deploys and manages the monitoring stack. + +## Audit trail + +Every API request is authenticated with the identity context captured. Run and action lifecycle events are recorded with timestamps, phases, and responsible identities. RBAC changes and user management operations are logged. Secret creation and management operations are tracked, though values are never logged. Cluster state changes and tunnel health events are recorded. Error information is preserved per attempt, enabling forensic analysis of failures. + +## Verification + +### Task logging + +**Reviewer focus:** Confirm that task logs are stored in the customer's cloud log service and that the control plane does not persist log data. + +**How to verify:** + +1. Run a task that writes known log output. + +2. Find the log in the customer's cloud log service: + + ```bash + aws logs get-log-events --log-group --log-stream + ``` + +3. Open the Union UI task logs panel and use browser developer tools (Network tab) to verify that log data comes through the tunnel. + +4. Confirm Fluent Bit is running on every node: + + ```bash + kubectl get ds -n union | grep fluent + ``` + +This verification is fully self-service. + +### Audit trail + +**Reviewer focus:** Confirm that all operations are logged with identity, operation, and timestamp, and that the audit trail is complete and queryable. + +**How to verify:** + +Audit data is available from the following sources: + +- Control plane execution and lifecycle events: `uctl get execution --all -o json`, filtered by time window. +- Authentication events: the configured identity provider's audit log (e.g., Okta). +- Cluster operations: the Kubernetes audit log on the data plane. +- Cloud IAM activity: CloudTrail (AWS), Cloud Audit Logs (GCP), or Azure Monitor activity logs. diff --git a/content/security/data-protection/multi-cloud.md b/content/security/data-protection/multi-cloud.md new file mode 100644 index 000000000..65dae0e7d --- /dev/null +++ b/content/security/data-protection/multi-cloud.md @@ -0,0 +1,33 @@ +--- +title: Multi-cloud support +weight: 6 +variants: -flyte +union +--- + +# Multi-cloud support + +Union.ai supports data plane deployments on AWS, GCP, and Azure. Each cloud provider uses its native services for storage, secrets, logging, and container registry, while the platform enforces consistent security guarantees across all three. + +## Supported services + +| Cloud | Object Store | Secrets Backend | Log Aggregator | Container Registry | +|---|---|---|---|---| +| AWS | S3 | K8s Secrets / AWS Secrets Manager | CloudWatch Logs | ECR | +| GCP | GCS | K8s Secrets / GCP Secret Manager | Cloud Logging | GCR / Artifact Registry | +| Azure | Azure Blob Storage | K8s Secrets / Azure Key Vault | Azure Monitor | ACR | + +Union Implementation Services supports additional cloud providers and on-premises deployments through case-by-case engagement. + +## Consistent security guarantees + +Regardless of cloud provider, Union.ai enforces the same security model: control plane / data plane separation, presigned URLs for bulk data access (bypassing the control plane), inline proxying for structured task I/O (transiting control plane memory, not persisted), outbound-only [cross-plane connectivity](../architecture/network) (Cloudflare Tunnel and direct gRPC), RBAC-based access control, and encryption at rest and in transit. Cloud-specific implementations (IAM roles, encryption services, log aggregators, and secrets managers) are abstracted by the platform while maintaining native integration with each provider's security services. A workflow running on AWS receives the same separation guarantees as one running on GCP or Azure; only the underlying cloud primitives differ. Data residency at rest is maintained within the customer's chosen cloud and region. + +For details on the data flow patterns that apply across all clouds, see [Data flow](./data-flow). For encryption specifics by storage type, see [Encryption](./encryption). For secrets backend options, see [Secrets management](./secrets). + +## Verification + +### Multi-cloud support + +**Reviewer focus:** Confirm that the services listed for each cloud provider are accurate. + +**How to verify:** This is a factual reference table, not a claim requiring active demonstration. Verify against the Union.ai deployment documentation and cloud provider configurations for each supported deployment. diff --git a/content/security/data-protection/secrets.md b/content/security/data-protection/secrets.md new file mode 100644 index 000000000..b9764845b --- /dev/null +++ b/content/security/data-protection/secrets.md @@ -0,0 +1,96 @@ +--- +title: Secrets management +weight: 4 +variants: -flyte +union +--- + +# Secrets management + +Union.ai's secrets management system stores secret values at rest exclusively within the customer's infrastructure, with a write-only API design that eliminates an entire class of exfiltration attacks. During secret creation and update, the value transits control plane memory (encrypted in transit, plaintext in memory, not persisted or logged) on its way to the data plane's secrets backend. Get, List, and Delete operations never expose secret values. + +## Core design + +Secret values are stored exclusively within the customer's infrastructure. The secrets API is write-only by design: there is no API to read back secret values. The `GetSecret` RPC returns only the secret's metadata (name, scope, creation time, cluster presence status), never the value itself. This means that even if an attacker compromises a user account or the control plane API, they cannot retrieve secret values through the API. The value simply is not available through any API endpoint. + +## Backends + +| Backend | Storage Location | Default | +|---|---|---| +| Kubernetes Secrets | K8s etcd on customer cluster | Self-managed default | +| AWS Secrets Manager | AWS-managed service | BYOC default (AWS) | +| GCP Secret Manager | GCP-managed service | BYOC default (GCP) | +| Azure Key Vault | Azure-managed service | BYOC default (Azure) | + +All four backends are available regardless of deployment model. The choice of backend is a deployment configuration on the data plane operator. Each backend integrates with its cloud provider's native encryption and access control mechanisms. + +## Secret lifecycle + +**Creation:** When a user creates a secret via the UI or CLI, the value is sent to the control plane over TLS, relayed through the Cloudflare Tunnel (encrypted) to the data plane's secrets backend, and stored encrypted at rest in the customer's secret manager (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault, or K8s Secrets). The value exists as plaintext in control plane memory only during this relay and is never written to disk, database, cache, or logs on the control plane. Only the secret identifier is logged. Once the relay completes, no trace of the value remains in the control plane (though Go's garbage collector does not zero deallocated memory, so the plaintext may persist in heap until reused). + +| Phase | Encrypted? | Details | +|-------|------------|---------| +| Client → Control Plane | **Yes** | TLS 1.2+. Wire format: protobuf binary | +| In Control Plane | **Plaintext in memory** | Deserialized Go struct. Not persisted, cached, or logged | +| Control Plane → Data Plane | **Yes** | TLS + mTLS + Cloudflare Tunnel. Wire format: protobuf JSON | +| In Data Plane (operator) | **Plaintext in memory** | Briefly held before writing to secret backend | +| At rest (secret backend) | **Yes** | AWS Secrets Manager (AES-256/KMS), GCP Secret Manager (Google-managed or CMEK), Azure Key Vault (HSM-backed), or K8s etcd encryption | + +**Consumption:** When a task pod is created, the Executor configures it to mount the requested secrets from the backend as environment variables or files. The value is read by the data plane's secrets backend and injected into the pod. It never leaves the customer's infrastructure during this process. The control plane is not involved in secret consumption at runtime. + +**Scoping:** Secrets can be scoped at organization, project, or domain level. Only task pods running within the appropriate scope can access the corresponding secrets. This ensures that teams working in different projects cannot access each other's secrets, even within the same data plane cluster. + +For details on how secrets flow during workflow execution, see [Workflow data flow](./workflow-data-flow). + +## Verification + +### Write-only API + +**Reviewer focus:** Confirm that no API endpoint returns secret values and that the write-only design holds. + +**How to verify:** + +1. Create a test secret: + + ```bash + uctl create secret --name test-secret --value "s3cr3t-value" --project myproject + ``` + +2. Attempt to read it back: + + ```bash + uctl get secret --name test-secret --project myproject + ``` + + The output should show name, scope, creation time, and cluster status. There should be **no value field** in the response. + +3. Try every API endpoint that touches secrets. None should return the value. + +4. Check the protobuf definition in the open-source Flyte repository. `GetSecretResponse` has no value field. The write-only design is enforced at the protocol level. + +5. Verify the secret exists in the customer's secrets backend by checking the cloud secrets manager console directly. The value should be present there. + +This verification is fully self-service and works immediately. Note that the write-only design is enforced at the protocol level: Union.ai's API structurally cannot return secret values, regardless of the caller's privilege level. + +### Secret lifecycle + +**Reviewer focus:** Confirm that secret values transit the control plane only in-memory during creation and are consumed entirely within the data plane at runtime. + +**How to verify:** + +- **Creation path:** Inspect the control plane proxy pod logs during secret creation: + + ```bash + kubectl logs -n + ``` + + The logs should show the relay operation but not the secret value. + +- **Consumption path:** Inspect the task pod configuration: + + ```bash + kubectl describe pod -n + ``` + + The pod should show secrets mounted from the customer's backend (e.g., AWS Secrets Manager volume mounts or environment variable references). + +- **Scoping:** Create a secret in project A, then run a task in project B that attempts to access it. The task should fail. Run the same task in project A, and it should succeed. diff --git a/content/security/data-protection/workflow-data-flow.md b/content/security/data-protection/workflow-data-flow.md new file mode 100644 index 000000000..3d826296d --- /dev/null +++ b/content/security/data-protection/workflow-data-flow.md @@ -0,0 +1,77 @@ +--- +title: Workflow data flow +weight: 5 +variants: -flyte +union +--- + +# Workflow data flow + +This page traces the security-relevant data movements at each stage of the workflow lifecycle: registration, execution, and result retrieval. For the underlying classification of bulk vs. inline data and what each pathway carries, see [Data classification and residency](./classification-and-residency) and [Data flow](./data-flow). + +## Task deployment and run creation + +When a run is created, the SDK serializes the full task specification (container image, resource requirements, typed interface, and all configuration) and sends it inline to the control plane along with the structured task inputs (up to 10 MiB). The code bundle is uploaded directly to the customer's object store via a presigned PUT URL. Code never touches the control plane. The task specification is stored in the control plane databases. Binary input artifacts (files, directories, DataFrames) are uploaded directly to the customer's object store via presigned URLs. The control plane stores only the input URI, then enqueues the action to the data plane. + +The Executor on the data plane creates a pod that reads inputs from and writes outputs back to the customer's object store. During workflow execution, inter-task I/O flows directly between task pods and the object store via IAM, with no control plane involvement. Secrets required by the task are injected into pods from the customer's secrets backend at runtime; secret values do not traverse the control plane during execution. (They do traverse the control plane during initial secret creation and updates -- see [Secrets management](./secrets).) The control plane receives phase transitions, status updates, and error messages from the Executor. Error messages may contain customer data from Python tracebacks. + +## Result retrieval + +Once a run completes, its results are accessible through three channels, each with different data flow characteristics: + +**Binary outputs, reports, and code bundles** are accessed via presigned URLs. The data flows directly from the customer's object store to the client and does not pass through the control plane. + +**Structured outputs** (protobuf literals) are retrieved through the control plane, which fetches the full output payload from the data plane object store (encrypted in transit, plaintext in control plane memory during the request, not persisted). Both structured inputs and outputs are returned together, up to 20 MiB total. + +**Logs** are streamed from the data plane through the Cloudflare Tunnel as a stateless relay. The control plane forwards the stream as plaintext in memory (encrypted in transit) without persisting or filtering the content. + +**Metadata** (run status, phase transitions, timestamps, and error messages) is served directly from the control plane database. + +For details on the data flow patterns, see [Data flow](./data-flow). + +## Verification + +### End-to-end data flow + +**Reviewer focus:** Confirm the data separation model at every stage: bulk data stays in the customer's infrastructure (presigned URLs), inline data transits the control plane transiently (not persisted), and task definitions stored in the control plane contain only expected fields. + +**How to verify:** + +**Step 1: Deployment and code bundle** + +Run a workflow and observe where the code bundle is stored: + +```bash +union run --remote my_task.py main +``` + +Verify that the code bundle is in the customer's bucket: + +```bash +aws s3 ls s3:///org/project/domain/code-bundles/ +``` + +The code bundle `.tgz` file should appear in the customer's own object store. + +**Step 2: Execution** + +Inspect the task pod to confirm it reads from customer infrastructure: + +```bash +kubectl describe pod -n +``` + +The pod description should show volumes mounted from customer S3, secrets from the customer's backend, and a non-root security context. + +**Step 3: Retrieval** + +Open browser developer tools (Network tab) and view the task's outputs in the Union.ai UI. Binary output artifacts (files, DataFrames) should be fetched via presigned URLs pointing to the customer's S3/GCS/Azure Blob endpoint. Structured outputs (protobuf literals) are fetched via the inline proxy through the control plane. Separately, confirm that the control plane API returns metadata and URI references: + +```bash +uctl get execution -o json +``` + +The response should contain phase, timestamps, URIs, and task definition fields. Bulk data content should not appear inline. + +**Step 4: Negative proof** + +Search control plane audit logs for the recognizable data string used in the workflow. It should not appear. If VPC Flow Logs are enabled, bulk data transfers should flow directly between task pods and the customer's object store. Structured task I/O and log streams will transit the Cloudflare Tunnel as documented in [Data flow](./data-flow). diff --git a/content/security/data-residency-summary.md b/content/security/data-residency-summary.md deleted file mode 100644 index 17290fef4..000000000 --- a/content/security/data-residency-summary.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -title: Data residency summary -weight: 14 -variants: -flyte +union ---- - -# Data residency summary - -| Data | Stored In | Accessed Via | Transits Control Plane? | -| --- | --- | --- | --- | -| Task definitions (spec metadata) | Control plane DB | ConnectRPC | Yes — metadata only | -| Run metadata (phase, timestamps) | Control plane DB | ConnectRPC | Yes | -| Action metadata (phase, attempts) | Control plane DB | ConnectRPC | Yes | -| Task inputs/outputs | Customer object store | Presigned URL | No — direct client ↔ object store | -| Code bundles | Customer object store | Presigned URL | No — direct client ↔ object store | -| Reports (HTML) | Customer object store | Presigned URL | No — direct client ↔ object store | -| Container images | Customer container registry | Pulled by K8s | No — stays in customer infra | -| Task logs | Customer log aggregator | Streamed via tunnel | Relayed in-memory (not stored) | -| Secrets | Customer secrets backend | Injected at runtime | Relayed during create (not stored) | -| Observability metrics | Customer ClickHouse | Proxied via DataProxy | Relayed in-memory (not stored) | -| User identity / RBAC | Control plane DB | ConnectRPC | Yes | -| Cluster state | Control plane DB | Internal | Yes | diff --git a/content/security/identity-and-access-management.md b/content/security/identity-and-access-management.md deleted file mode 100644 index 9a624490e..000000000 --- a/content/security/identity-and-access-management.md +++ /dev/null @@ -1,88 +0,0 @@ ---- -title: Identity and access management -weight: 3 -variants: -flyte +union ---- - -# Identity and access management - -## Authentication - -Union.ai supports three authentication methods to accommodate different usage patterns: - -| Method | Identity Type | Credentials | Use Case | -| --- | --- | --- | --- | -| OIDC (Okta) | Human user | Browser SSO | UI access, initial CLI login | -| API Keys | Human user (delegated) | Static bearer token | CI/CD scripts, simple automation | -| Service Accounts | Application identity | OAuth2 client_id + client_secret → short-lived token | Production pipelines, multi-service systems | - -API keys are issued per user and inherit the user’s RBAC permissions. -They can be created and revoked via the UI or CLI. -Service accounts are provisioned through the Identity Service, creating OAuth2 applications with distinct, auditable identities independent of any human user. - -## Authorization (RBAC) - -Union.ai implements a policy-based Role-Based Access Control (RBAC) system with three built-in role types. - -| Role | Capabilities | Typical Assignment | -| --- | --- | --- | -| Admin | Full access: manage users, clusters, secrets, projects, and all runs | Platform administrators, security team leads | -| Contributor | Create/abort runs, register tasks, manage secrets within assigned projects | ML engineers, data scientists, DevOps | -| Viewer | Read-only access to runs, actions, logs, reports | Stakeholders, auditors, read-only consumers | -| Custom Policies | Custom policies bind roles (built-in or custom) to resources scoped at org-wide, domain, or project+domain level using composable YAML bindings via `uctl` | Giving contributor access to a specific project's development and staging domains, but only viewer access in production | - -RBAC policies are enforced at the service layer. -Every API request is authenticated and authorized against the user’s role assignments before any data access occurs. -Users have the ability to create custom policies to further refine access control. - -## Organization isolation - -Union.ai enforces tenant isolation at multiple architectural layers to ensure that no customer can access another customer's data or metadata, even within the shared control plane. - -### Database-layer isolation - -Every record in the control plane PostgreSQL database is scoped by organization (org). The org identifier is part of the primary key or unique index on all tenant-scoped tables, including actions, tasks, runs, executions, and RBAC bindings. All database queries are gated by the org context extracted from the caller's authenticated token at the service layer, before any SQL is executed. This ensures that a query can only return records belonging to the caller's organization. Cross-organization access is explicitly denied: there is no API or internal path that permits querying across org boundaries. While Union.ai does not currently use PostgreSQL row-level security (RLS) policies, the application-layer enforcement is uniform and independently verifiable through the SOC 2 Type II audit. - -### Data plane isolation - -Each customer's data plane runs in a dedicated Kubernetes cluster within the customer's own cloud account. There is no shared compute infrastructure between customers. Customer workloads, data, secrets, container images, and logs are physically isolated in separate cloud accounts with separate IAM boundaries. No other customer's workloads can execute on or access another customer's cluster. - -### Control plane service isolation - -Within the control plane, all service-to-service calls carry the authenticated org context. The identity service extracts org membership from the OIDC token, and this context is propagated through every downstream service call via request headers. Kubernetes namespaces on the data plane are provisioned per-project within each org, providing namespace-level resource isolation (resource quotas, RBAC bindings, network policies) even within a single customer's cluster. - -### Isolation verification - -Tenant isolation controls are covered by Union.ai's SOC 2 Type II audit scope. The combination of org-scoped primary keys, service-layer query gating, and physically separate data planes provides defense-in-depth against cross-tenant data access. - -## Human access to customer environments - -Union.ai maintains controls governing how its personnel interact with customer environments. - -### Current access model - -Union.ai support and engineering personnel may access a customer's Union.ai tenant (the control plane UI and API for that organization) for the purposes of onboarding, troubleshooting, and operational support. This access is authenticated through the same OIDC/SSO mechanisms as customer users and is subject to RBAC policies. Personnel access the customer's tenant, not the customer's data plane infrastructure directly. Union.ai personnel do not have IAM credentials for the customer's cloud account and cannot directly access the customer's object stores, secrets backends, or container registries. - -> [!NOTE] -> In BYOC deployments, Union.ai personnel additionally have K8s cluster management access. See [BYOC deployment differences: Human access](./byoc-differences#human-access-to-customer-environments) for details. - -### Access scope and limitations - -When Union.ai personnel access a customer's tenant, they can view orchestration metadata (workflow definitions, run status, scheduling configuration), view logs relayed through the tunnel (but cannot access the customer's log aggregator directly), and perform administrative operations (cluster configuration, namespace provisioning) as authorized by the customer's RBAC policy. Personnel cannot read secret values (the API is write-only for values), cannot access raw data in the customer's object stores (presigned URLs are generated per-request and are not retained), and cannot access the customer's cloud account or IAM roles. In BYOC deployments, administrative operations [extend to direct K8s cluster management](./byoc-differences#human-access-to-customer-environments). - -### Audit trail - -All access by Union.ai personnel to customer tenants is authenticated and logged. API requests include the identity of the caller, the operation performed, and a timestamp. - -> [!NOTE] -> In BYOC deployments, Union.ai personnel have additional K8s cluster access for operational management. See [BYOC deployment differences: Human access](./byoc-differences#human-access-to-customer-environments) for full details. - -## Least privilege principle - -Union.ai enforces the principle of least privilege across all system components: - -* IAM roles on the data plane are scoped to minimum required permissions -* Two IAM roles per data plane: admin role (for platform services) and user role (for task pods) -* IAM roles are bound to Kubernetes service accounts via cloud-native workload identity federation -* Presigned URLs grant single-object, operation-specific, time-limited access -* Service accounts receive only the permissions needed for their specific function diff --git a/content/security/identity-and-access/_index.md b/content/security/identity-and-access/_index.md new file mode 100644 index 000000000..3f45470f0 --- /dev/null +++ b/content/security/identity-and-access/_index.md @@ -0,0 +1,17 @@ +--- +title: Identity and access +weight: 3 +variants: -flyte +union +sidebar_expanded: true +--- + +# Identity and access + +Union.ai provides a layered identity and access management system that controls how users and applications authenticate, what resources they can access, and how tenant isolation is enforced. Access control spans two distinct domains: in-product authentication and authorization (RBAC, SSO, API keys) and infrastructure-level access to the customer's cloud environment. + +This section covers: + +* [Authentication](./authentication): OIDC, API keys, service accounts, and SSO configuration. +* [Role-based access control](./rbac): Built-in roles, custom policies, enforcement, and the least-privilege principle. +* [Tenant isolation](./tenant-isolation): Database-layer, data plane, and service-level isolation between customers. +* [Human access controls](./human-access): How Union.ai personnel access customer environments in self-managed and BYOC deployments. diff --git a/content/security/identity-and-access/authentication.md b/content/security/identity-and-access/authentication.md new file mode 100644 index 000000000..1fed7abec --- /dev/null +++ b/content/security/identity-and-access/authentication.md @@ -0,0 +1,54 @@ +--- +title: Authentication +weight: 1 +variants: -flyte +union +--- + +# Authentication + +## Authentication methods + +Union.ai supports three authentication methods, each designed for a different use case. + +| Method | Identity Type | Credentials | Use Case | +|---|---|---|---| +| OIDC | Human user | Browser SSO | UI access, initial CLI login | +| API Keys | Human user (delegated) | Static bearer token | CI/CD scripts, simple automation | +| Service Accounts | Application identity | OAuth2 client_id + client_secret -> short-lived token | Production pipelines, multi-service systems | + +API keys are issued per user and inherit the user's RBAC permissions. They can be created and revoked via the UI or CLI. + +Service accounts are provisioned by the platform, creating OAuth2 applications with distinct, auditable identities independent of any human user. + +## Single sign-on + +Union.ai uses OAuth2 / OIDC for SSO. Customers can configure any OIDC or SAML 2.0 compliant identity provider (Google Workspace, Microsoft Entra ID, Okta, etc.). SSO provides centralized identity management where the user lifecycle is managed in the customer's IdP. MFA enforcement is delegated to the customer's IdP, so the customer's existing MFA policies apply without additional configuration. Session management is inherited from the IdP configuration, and all authentication events are logged with caller identity. + +## Verification + +### SSO and credential lifecycle + +**Reviewer focus:** Confirm that SSO redirects to the customer's IdP, that MFA is enforced when configured, and that API keys and service accounts can be created, used, and revoked. + +**How to verify:** + +1. SSO: Log in. The browser redirects to the customer's IdP, and an MFA prompt appears if configured. + +2. API key: Create a key, use it in a script, then revoke it: + + ```bash + uctl create api-key + # Use the key in a script to authenticate + uctl delete api-key + # Confirm the revoked key is rejected + ``` + +3. Service account: Create a service account and confirm it has a distinct identity: + + ```bash + uctl create service-account + ``` + + Show the OAuth2 token exchange and confirm the service account appears as a distinct identity in the audit log. + +This verification is fully self-service. diff --git a/content/security/identity-and-access/human-access.md b/content/security/identity-and-access/human-access.md new file mode 100644 index 000000000..e3c980764 --- /dev/null +++ b/content/security/identity-and-access/human-access.md @@ -0,0 +1,73 @@ +--- +title: Human access controls +weight: 4 +variants: -flyte +union +--- + +# Human access controls + +## Self-managed + +In self-managed deployments, Union.ai personnel access only the Union.ai-hosted control plane infrastructure. They have zero access to the customer's data plane. This access uses standard OIDC/SSO and RBAC. + +## BYOC + +In BYOC deployments, Union.ai personnel additionally have authenticated Kubernetes cluster access for operational purposes: upgrades, node pool provisioning, Helm chart updates, health monitoring, and troubleshooting. This access uses cloud-native private connectivity (PrivateLink/PSC) and is scoped to Kubernetes cluster management. All cluster management actions are logged. + +## Customer-side support access (optional) + +Separately from BYOC Kubernetes cluster management, Union.ai offers an optional support service where customers can grant Union.ai staff access to the customer's tenant for troubleshooting. This is available for both self-managed and BYOC deployments. + +When requested, Union.ai support personnel are granted access through the same RBAC framework used by the customer's own users. The customer creates a role binding for Union.ai staff, scoped to the specific projects, domains, and permission level appropriate for the troubleshooting engagement. This access can be time-limited so that it expires automatically after the support engagement concludes. + +This service is entirely optional. Customers must explicitly request it and configure the RBAC grants themselves. Union.ai staff cannot self-provision this access. The access is subject to the same authentication (OIDC/SSO), authorization (RBAC policies), and audit logging as any other user in the customer's organization. + +This is distinct from BYOC Kubernetes cluster management access (described above), which is infrastructure-level access for platform operations. Customer-side support access operates at the application level: viewing runs, inspecting logs, diagnosing task failures, and reviewing configuration. It does not grant Kubernetes cluster access, IAM role access, or direct access to the customer's cloud account. + +## Access scope + +When Union.ai personnel are granted access to a customer's tenant (in BYOC, or via the optional support service in self-managed), they CAN: view orchestration metadata, view logs relayed through the tunnel, perform administrative operations as authorized by the customer's RBAC policy, and (in BYOC) manage the Kubernetes cluster. + +Personnel CANNOT: read secret values (the API is write-only), access bulk data in customer object stores (presigned URLs are per-request and not retained), or access the customer's cloud account, IAM roles, object stores, secrets backends, container registries, or log aggregators. Personnel with control plane infrastructure access could in principle observe inline data transiting control plane memory during request processing (structured task I/O, log streams, secret values during create/update), but this data is transient (not persisted, not logged, not cached) and is inherent to any pass-through proxy architecture. + +All access by Union.ai personnel is authenticated and logged with caller identity, operation performed, and timestamp. + +## Verification + +### Human access controls + +**Reviewer focus:** Confirm that Union.ai personnel access is appropriately scoped for each deployment model and that no path exists to access customer data or secrets. + +**How to verify:** + +Self-managed: Union.ai has no IAM roles, no VPN, no SSH keys, and no kubectl access to the customer's cluster. Both outbound channels (Cloudflare Tunnel and direct gRPC) are initiated FROM the customer's data plane. Union.ai cannot initiate connections TO the customer's infrastructure. + +BYOC: + +1. Inspect the operator service account permissions: + + ```bash + kubectl auth can-i --list --as= + ``` + + This shows exactly what Union.ai can do on the cluster. + +2. Review the Kubernetes audit log and CloudTrail for Union.ai personnel access history. + +3. Write-only secrets: even when logged into the customer's tenant, personnel cannot read secret values. + +4. Presigned URLs are per-request and ephemeral. The underlying data is fetched from the customer's S3/GCS/Azure Blob, not from any Union.ai storage. + +Customer-side support access: + +1. Confirm that no Union.ai support user exists in the customer's tenant unless explicitly provisioned by the customer. + +2. If support access has been granted, verify the RBAC binding: + + ```bash + uctl get policy + ``` + + The Union.ai support user should appear with the scoped role and time limit configured by the customer. + +3. After the time limit expires, repeat the query. The binding should no longer be active. diff --git a/content/security/identity-and-access/rbac.md b/content/security/identity-and-access/rbac.md new file mode 100644 index 000000000..c220969c1 --- /dev/null +++ b/content/security/identity-and-access/rbac.md @@ -0,0 +1,68 @@ +--- +title: Role-based access control +weight: 2 +variants: -flyte +union +--- + +# Role-based access control + +## Built-in roles + +Union.ai provides three user-assignable roles with progressively broader permissions. + +| Role | Capabilities | +|---|---| +| Admin | Full access: manage users, clusters, secrets, projects, all runs | +| Contributor | Create/abort runs, deploy tasks, manage secrets within assigned projects | +| Viewer | Read-only: runs, actions, logs, reports | + +Additional internal system roles exist for platform operations but are not user-visible or user-assignable. + +## Custom policies + +Custom policies bind roles (built-in or custom) to resources scoped at org-wide, domain, or project+domain level using composable YAML bindings via `uctl`. This allows organizations to define fine-grained access policies that match their team structure and security requirements. + +## Enforcement + +Every API request is authenticated and authorized against the user's role assignments before any data access occurs. Enforcement happens at the service layer. Internal-only services (data plane object store proxy, data plane logs proxy) rely on network-level isolation rather than per-request authorization checks, on the basis that they are reachable only from within the service mesh. + +## Least privilege + +Union.ai enforces least privilege across all components. IAM roles on the data plane are scoped to minimum required permissions. Each data plane has two IAM roles: an admin role for platform services and a user role for task pods. IAM roles are bound via cloud-native workload identity federation, eliminating static credentials entirely. Presigned URLs grant single-object, operation-specific, time-limited access. Service accounts receive only the permissions needed for their specific function. + +## Verification + +### RBAC enforcement + +**Reviewer focus:** Confirm that each role enforces the expected permissions and that custom policies correctly scope access to specific projects and domains. + +**How to verify:** + +1. Create three test users with Admin, Contributor, and Viewer roles. + +2. Log in as Viewer and confirm restricted operations are denied: + + ```bash + uctl create run ... # Expect denied + uctl create secret ... # Expect denied + uctl get executions # Expect success + ``` + +3. Log in as Contributor scoped to project A: + + ```bash + uctl create run --project B ... # Expect denied + uctl create run --project A ... # Expect success + ``` + +4. Create a custom policy scoping a user to project X, development domain only. Attempt to access the production domain. Expect denied. + +5. Display all active policy bindings: + + ```bash + uctl get policy + ``` + +6. For Union.ai employee access: the customer creates an RBAC policy for Union.ai support, scoped to viewer only and time-limited. + +This verification is fully self-service. diff --git a/content/security/identity-and-access/tenant-isolation.md b/content/security/identity-and-access/tenant-isolation.md new file mode 100644 index 000000000..2fd7c114c --- /dev/null +++ b/content/security/identity-and-access/tenant-isolation.md @@ -0,0 +1,45 @@ +--- +title: Tenant isolation +weight: 3 +variants: -flyte +union +--- + +# Tenant isolation + +## Database-layer isolation + +Every record in the control plane databases is scoped by organization. The org identifier is part of the primary key on all tenant-scoped tables. The service layer gates every query by org context, derived from the caller's authenticated token, before any data access occurs. The primary org identity comes from the request hostname subdomain (not user-supplied input). The standard cross-org authorization check blocks cross-org calls by default, and unrecognized services receive a default-deny response. + +## Data plane isolation + +Each customer's data plane runs in a dedicated Kubernetes cluster within the customer's own cloud account. There is no shared compute infrastructure between customers. Customer workloads, data, secrets, container images, and logs are physically isolated in separate cloud accounts with separate IAM boundaries. + +## Control plane service isolation + +All service-to-service calls within the control plane carry the authenticated org context. The identity service extracts org membership from the OIDC token, and this context is propagated through every downstream service call via request headers. Kubernetes namespaces on the data plane are provisioned per-project within each org, providing namespace-level resource isolation including resource quotas, RBAC bindings, and network policies. + +## Defense in depth + +Tenant isolation controls are covered by the SOC 2 Type II audit scope. The combination of org-scoped primary keys, service-layer query gating, and physically separate data planes provides defense-in-depth against cross-tenant data access. + +## Verification + +### Tenant isolation + +**Reviewer focus:** Confirm that customers cannot access other tenants' data through any path, and that the isolation model is architecturally enforced rather than relying solely on application logic. + +**How to verify:** + +1. Data plane isolation is architectural: each customer has their own cluster in their own cloud account. This is verifiable by inspecting the infrastructure directly. + +2. Database isolation: all API responses include org context. Confirm that only the customer's org is returned: + + ```bash + uctl get executions -o json | jq '.org' + ``` + + This should always return only the customer's org identifier. + +3. The SOC 2 Type II audit specifically covers tenant isolation controls. + +4. The protobuf definitions and SDK code are open source, so the org context enforcement path can be traced through the codebase. diff --git a/content/security/infrastructure-security.md b/content/security/infrastructure-security.md deleted file mode 100644 index 8f8e5ce21..000000000 --- a/content/security/infrastructure-security.md +++ /dev/null @@ -1,71 +0,0 @@ ---- -title: Infrastructure security -weight: 5 -variants: -flyte +union ---- - -# Infrastructure security - -## Kubernetes security - -The data plane runs on customer-managed Kubernetes clusters. Union supports the following security measures: - -> [!NOTE] -> In BYOC deployments, Union.ai manages the K8s cluster. See [BYOC deployment differences: Infrastructure management](./byoc-differences#infrastructure-management). - -* Workload identity federation for pod-level IAM role binding (no static credentials) -* Kubernetes RBAC for service account permissions within the cluster -* Network policies for pod-to-pod communication isolation -* Resource quotas and limit ranges to prevent resource abuse -* Pod security contexts enforcing non-root execution where applicable - -A complete list of data plane permissions appears in **[Kubernetes RBAC: data plane](./kubernetes-rbac-data-plane)** - -## Container security - -Union.ai’s container security model ensures that code execution is isolated and controlled: - -* Image Builder runs on the customer’s cluster using Buildkit, ensuring source code and built images never leave customer infrastructure -* Base images are pulled from customer-approved registries (public or private) -* Built images are pushed to the customer’s container registry (ECR/GCR/ACR) -* Task pods mount code bundles via presigned URLs with limited TTL -* Container images follow customer-defined tagging and scanning policies - -## IAM and workload identity - -Two IAM roles are provisioned per data plane, each with narrowly scoped permissions. In BYOC deployments, [Union.ai provisions these roles](./byoc-differences#iam-role-provisioning); in self-managed, the customer provisions them. - -| Role | Permissions | Assumed By | Mechanism | -| --- | --- | --- | --- | -| Admin Role (`adminflyterole`) | R/W to object store buckets, secrets manager access, persisted logs read | Platform services: Executor, Object Store Service, DataProxy | Workload identity federation | -| User Role (`userflyterole`) | R/W to object store buckets | Task pods (user workloads) | Workload identity via K8s service account annotation | - -These roles use cloud-native workload identity federation (IAM Roles for Service Accounts on AWS, Workload Identity on GCP, Azure Workload Identity on Azure), eliminating the need for static credential storage. - -## Control plane infrastructure - -The Union.ai control plane is hosted on AWS with enterprise-grade infrastructure security: - -* Managed PostgreSQL (AWS RDS) with AES-256 encryption at rest -* Network isolation via VPC with restricted security groups -* TLS termination at the edge for all incoming connections -* Automated backups and disaster recovery procedures -* Infrastructure-as-code deployment with version-controlled configurations -* Automated patch management and security updates - -## Availability, response time, and resilience - -Union.ai's architecture separates the availability characteristics of the control plane and data plane, providing resilience even during partial outages. - -### Control plane availability - -The Union.ai control plane runs on AWS with multi-AZ redundancy, managed PostgreSQL (RDS) with automated failover, and continuous monitoring. Union.ai's SOC 2 Type II audit covers availability as a trust service criterion. The control plane is designed for high availability, with automated recovery and health monitoring. Specific SLA targets are defined in customer contracts and are available upon request. - -### Data plane resilience during control plane outages - -Because the data plane runs entirely within the customer's Kubernetes cluster, in-flight workflows continue executing even if the control plane becomes temporarily unavailable. The Executor, which manages pod lifecycle, operates as a Kubernetes controller on the customer's cluster and does not require real-time connectivity to the control plane to continue running pods that have already been scheduled. State transitions will be reconciled when connectivity is restored. However, new workflow submissions and scheduling operations require control plane availability. - -The customer is solely responsible for data plane availability, including Kubernetes cluster operations, node pool management, upgrades, and monitoring. Union.ai's availability commitment covers only the control plane. In-flight workflows continue executing independently during control plane outages. - -> [!NOTE] -> In BYOC deployments, availability responsibilities shift — Union.ai manages data plane cluster availability. See [BYOC deployment differences: Availability and resilience](./byoc-differences#availability-and-resilience). diff --git a/content/security/kubernetes-rbac-control-plane.md b/content/security/kubernetes-rbac-control-plane.md deleted file mode 100644 index 9801082e7..000000000 --- a/content/security/kubernetes-rbac-control-plane.md +++ /dev/null @@ -1,24 +0,0 @@ ---- -title: "Kubernetes RBAC: Control plane" -weight: 16 -variants: -flyte +union ---- - -# Kubernetes RBAC: Control plane - -**All roles are ClusterRole** - -| Role Name | Purpose | API Groups | Resources | Verbs | -| --- | --- | --- | --- | --- | -| `flyteadmin` | Full control over K8s resources for workflow orchestration, namespace provisioning, RBAC setup for workspaces | ""(core) `flyte.lyft.com rbac.authorization.k8s.io` | `configmaps flyteworkflows namespaces pods resourcequotas roles rolebindings secrets services serviceaccounts spark-role limitranges` | *(all) | -| `scyllacluster-edit` | Aggregated admin/edit role for ScyllaDB cluster management (control plane database) | `scylla.scylladb.com` | `scyllaclusters scylladbmonitorings scylladbdatacenters scylladbclusters scylladbmanagerclusterregistrations scylladbmanagertasks` | `create patch update delete deletecollection` | -| `scylladb:controller:aggregate-to-operator` | ScyllaDB operator controller - manages ScyllaDB cluster lifecycle for the control plane database | ""(core) `apps policy scylla.scylladb.com networking.k8s.io batch` | `events nodes endpoints persistentvolumeclaims pods services configmaps secrets statefulsets deployments daemonsets jobs poddisruptionbudgets serviceaccounts scyllaclusters scyllaoperatorconfigs nodeconfigs ingresses` | `get list watch create update delete patch` | -| `scylla-operator:webhook` | ScyllaDB webhook server for admission control of ScyllaDB resources | `admissionregistration.k8s.io scylla.scylladb.com` | `validatingwebhookconfigurations mutatingwebhookconfigurations scyllaclusters nodeconfigs scyllaoperatorconfigs scylladbdatacenters scylladbclusters scylladbmanagertasks` | `get list watch create update patch delete` | -| `console-clusterrole` | Read-only access for Union Console UI to display namespaces, workflows, and pod logs | ""(core) `flyte.lyft.com` | `namespaces flyteworkflows pods pods/log` | `get list watch` | -| `authorizer-clusterrole` | Authorizer service reads namespaces for authorization decisions | ""(core) | `namespaces` | `get list watch` | -| `cluster-clusterrole` | Cluster management service monitors cluster state for health and capacity | ""(core) `apps` | `namespaces nodes replicasets deployments` | `get list watch` | -| `dataproxy-clusterrole` | DataProxy service reads secrets for presigned URL generation and data relay configuration | ""(core) | `secrets` | `get list watch` | -| `executions-clusterrole` | Executions service reads workflow state for execution management and status tracking | ""(core) `flyte.lyft.com` | `namespaces configmaps flyteworkflows` | `get list watch` | -| `queue-clusterrole` | Queue service reads namespaces for task queue routing | ""(core) | `namespaces` | `get list watch` | -| `run-scheduler-clusterrole` | Run Scheduler reads namespaces to determine scheduling scope for workflows | ""(core) | `namespaces` | `get list watch` | -| `usage-clusterrole` | Usage tracking service reads namespaces for resource usage aggregation | ""(core) | `namespaces` | `get list watch` | diff --git a/content/security/kubernetes-rbac-data-plane.md b/content/security/kubernetes-rbac-data-plane.md deleted file mode 100644 index 2046dad62..000000000 --- a/content/security/kubernetes-rbac-data-plane.md +++ /dev/null @@ -1,33 +0,0 @@ ---- -title: "Kubernetes RBAC: Data plane" -weight: 17 -variants: -flyte +union ---- - -# Kubernetes RBAC: Data plane - -## Union core services (data plane) - -| Role Name | Purpose | Kind | API Groups | Scope | Resources | Verbs | -| --- | --- | --- | --- | --- | --- | --- | -| `clustersync-resource` | Synchronizes K8s resources across namespaces: creates per-workspace namespaces, RBAC bindings, service accounts, and resource quotas | ClusterRole | ""(core) `rbac.authorization.k8s.io` | Cluster-wide | `configmaps namespaces pods resourcequotas roles rolebindings secrets services serviceaccounts clusterrolebindings` | *(all) | -| `union-executor` | Node Executor: creates/manages task pods, handles FlyteWorkflow and TaskAction CRDs, manages all plugin resource types (Spark, Ray, etc.) | ClusterRole | ""(core) *(all) `apiextensions.k8s.io flyte.lyft.com` | Cluster-wide | `pods (RO) events *(all plugin objects) customresourcedefinitions flyteworkflows/* taskactions/*` | `get list watch create update delete patch` | -| `proxy-system` | Read-only monitoring: streams workflow events, pod logs, and resource utilization data back to control plane via tunnel | ClusterRole | "*" | Cluster-wide | `events flyteworkflows pods/log pods rayjobs resourcequotas` | `get list watch` | -| `operator-system` | Union Operator: manages FlyteWorkflow lifecycle, cluster-level configuration, health monitoring, node management | ClusterRole | `flyte.lyft.com` *(all) | Cluster-wide | `flyteworkflows flyteworkflows/finalizers resourcequotas pods configmaps podtemplates secrets namespaces nodes` | `get list watch create update delete patch post deletecollection` | -| `flytepropeller-role` | FlytePropeller workflow engine: creates task pods, manages FlyteWorkflow CRDs, handles all plugin resource types, enforces resource limits | ClusterRole | ""(core) *(all) `apiextensions.k8s.io flyte.lyft.com` | Cluster-wide | `pods (RO) events *(all plugin objects) customresourcedefinitions flyteworkflows/* limitranges` | `get list watch create update delete patch` | -| `flytepropeller-webhook-role` | Admission webhook: intercepts pod creation to inject secrets from the secrets backend into task containers | ClusterRole | "*" | Cluster-wide | `mutatingwebhookconfigurations secrets pods replicasets/finalizers` | `get create update patch` | -| `proxy-system-secret` | Manages proxy service secrets within the union namespace for tunnel authentication and configuration | Role | "*" | union namespace | `secrets` | `get list create update delete` | -| `operator-system` (ns) | Operator manages its own secrets and deployments within the union namespace | Role | "*" | union namespace | `secrets deployments` | `get list watch create update` | -| `union-operator-admission` | Webhook admission controller reads/creates TLS secrets for webhook serving certificates | Role | ""(core) | union namespace | `secrets` | `get create` | - -## Observability and monitoring - -| Role Name | Purpose | Kind | API Groups | Scope | Resources | Verbs | -| --- | --- | --- | --- | --- | --- | --- | -| `release-name-fluentbit` | Fluent Bit log collector: reads pod metadata to tag and route container logs to CloudWatch/Cloud Logging | ClusterRole | ""(core) | Cluster-wide | `namespaces pods` | `get list watch` | -| `opencost` | OpenCost: read-only access to all cluster resources for cost attribution and resource usage tracking | ClusterRole | ""(core) `extensions apps batch autoscaling storage.k8s.io` | Cluster-wide | `configmaps deployments nodes pods services resourcequotas replicationcontrollers limitranges PVCs PVs namespaces endpoints daemonsets replicasets statefulsets jobs storageclasses` | `get list watch` | -| `release-name-kube-state-metrics` | KSM: exports K8s object metrics for Prometheus monitoring dashboards | ClusterRole | ""(core) `extensions apps batch autoscaling policy networking.k8s.io certificates.k8s.io discovery.k8s.io storage.k8s.io admissionregistration.k8s.io` | Cluster-wide | `certificatesigningrequests configmaps cronjobs daemonsets deployments endpoints HPAs ingresses jobs leases limitranges namespaces networkpolicies nodes PVCs PVs pods replicasets replicationcontrollers resourcequotas secrets services statefulsets storageclasses validatingwebhookconfigurations volumeattachments endpointslices` | `list watch` | -| `release-name-grafana-clusterrole` | Grafana: reads `configmaps`/`secrets` for dashboard definitions and data source configuration | ClusterRole | ""(core) | Cluster-wide | `configmaps secrets` | `get watch list` | -| `union-operator-prometheus` | Prometheus: scrapes metrics from all cluster services and nodes for monitoring | ClusterRole | ""(core) `discovery.k8s.io networking.k8s.io` | Cluster-wide | `nodes nodes/metrics services endpoints pods endpointslices ingresses`; `nonResourceURLs`: `/metrics /metrics/cadvisor` | `get list watch` | -| `prometheus-operator` | Prometheus Operator: manages the full Prometheus monitoring stack lifecycle, CRDs, and configurations | ClusterRole | `monitoring.coreos.com apps extensions` (core) `networking.k8s.io policy admissionregistration.k8s.io storage.k8s.io` | Cluster-wide | `alertmanagers prometheuses thanosrulers servicemonitors podmonitors prometheusrules probes scrapeconfigs prometheusagents statefulsets daemonsets deployments configmaps secrets pods services endpoints namespaces ingresses PDBs webhookconfigs storageclasses` | *(all) | -| `release-name-dcgm-exporter` | DCGM Exporter: reads node/pod metadata for GPU metrics labeling (optional, for GPU workloads) | ClusterRole | ""(core) | Cluster-wide | `nodes pods` | `get list watch` | diff --git a/content/security/logging-monitoring-and-audit.md b/content/security/logging-monitoring-and-audit.md deleted file mode 100644 index 4b5a9bd9a..000000000 --- a/content/security/logging-monitoring-and-audit.md +++ /dev/null @@ -1,43 +0,0 @@ ---- -title: Logging, monitoring, and audit -weight: 6 -variants: -flyte +union ---- - -# Logging, monitoring, and audit - -## Task logging - -Logs are collected by `fluentbit` (deployed as a `DaemonSet` on the data plane) and shipped to the customer’s cloud-native log service: - -| Cloud Provider | Log Service | Integration | -| --- | --- | --- | -| AWS | CloudWatch Logs | Fluent Bit → CloudWatch | -| GCP | Cloud Logging (Stackdriver) | Fluent Bit → Cloud Logging | -| Azure | Azure Monitor / Log Analytics | Fluent Bit → Azure Monitor | - -The data plane log provider serves logs from two sources: live logs streamed directly from the Kubernetes API while a task is running, and persisted logs read from the cloud log aggregator after a pod terminates. -Log data is never stored in the control plane—it is streamed from the customer’s data plane through the Cloudflare tunnel and relayed to the client as a stateless pass-through. - -## Observability metrics - -A per-cluster instance (Prometheus and/or ClickHouse) stores time-series observability metrics including resource utilization and cost data. -Queries are proxied through the DataProxy service to the customer’s instance. -Metrics data never leaves the customer’s infrastructure. In BYOC deployments, Union.ai [deploys and manages the monitoring stack](./byoc-differences#infrastructure-management). - -## Audit trail - -Union.ai maintains comprehensive audit capabilities: - -* Every API request is authenticated, and the identity context is captured -* Run and action lifecycle events are recorded with timestamps, phases, and responsible identities -* RBAC changes and user management operations are logged -* Secret creation and management operations are tracked (values are never logged) -* Cluster state changes and tunnel health events are recorded -* Error information is preserved per attempt, enabling forensic analysis of failures - -## Incident response - -Union.ai maintains documented incident response procedures aligned with SOC 2 Type II requirements. -These include defined escalation paths, communication protocols, containment procedures, and post-incident review processes. -The control plane’s stateless handling of customer data limits the potential impact of any control plane incident. diff --git a/content/security/multi-cloud-and-region-support.md b/content/security/multi-cloud-and-region-support.md deleted file mode 100644 index e518df2e8..000000000 --- a/content/security/multi-cloud-and-region-support.md +++ /dev/null @@ -1,29 +0,0 @@ ---- -title: Multi-cloud and region support -weight: 9 -variants: -flyte +union ---- - -# Multi-cloud and region support - -Union.ai supports data plane deployments across multiple cloud providers and regions, ensuring that organizations can meet their specific infrastructure and regulatory requirements. - -## Supported cloud providers - -| Cloud Provider | Object Store | Secrets Backend | Log Aggregator | Container Registry | -| --- | --- | --- | --- | --- | -| AWS | S3 | K8s Secrets / AWS Secrets Manager | CloudWatch Logs | ECR | -| GCP | GCS | K8s Secrets / GCP Secret Manager | Cloud Logging | GCR / Artifact Registry | -| Azure | Azure Blob Storage | K8s Secrets / Azure Key Vault | Azure Monitor | ACR | - -Union Implementation Services supports additional cloud providers and on-premises deployments through a case-by-case engagement. - -## Supported regions - -Union.ai currently operates control planes in the following regions, with additional regions being added: **US West, US East, EU West, and EU Central**. -Customers choose the region for their data plane deployment, ensuring that all customer data remains within the selected geographic region. - -## Consistent security across clouds - -Regardless of the cloud provider selected, Union.ai enforces consistent security guarantees through its architecture: the same control plane/data plane separation, the same presigned URL model, the same tunnel-based connectivity, the same RBAC framework, and the same encryption standards. -Cloud-specific implementations (IAM roles, encryption services, log aggregators) are abstracted by the platform while maintaining native integration with each provider’s security services. diff --git a/content/security/organizational-security-practices.md b/content/security/organizational-security-practices.md deleted file mode 100644 index aeb13d58b..000000000 --- a/content/security/organizational-security-practices.md +++ /dev/null @@ -1,49 +0,0 @@ ---- -title: Organizational and physical security practices -weight: 10 -variants: -flyte +union ---- - -# Organizational \and physical security practices - -Union.ai maintains organizational security controls to protect people, facilities, and endpoint devices. -These controls are independently verified through SOC 2 Type II audits and continuously monitored via the Vanta Trust Center (trust.union.ai). - -## Employee security lifecycle - -**Verified controls** (source: Trust Center, SOC 2 Type II audit) - -| Control | Description | Verification | -| --- | --- | --- | -| Background checks | All employees with access to production systems undergo background checks prior to onboarding | SOC 2 Type II | -| Security awareness training | Required within 30 days of hire and annually thereafter for all employees | Trust Center (passing) | -| Confidentiality agreements | Signed by all employees and contractors during onboarding | Trust Center (passing) | -| Code of conduct | Acknowledged by all employees and contractors; violations subject to disciplinary action | Trust Center (passing) | -| Access provisioning | Documented procedures for granting, modifying, and revoking user access | Trust Center (passing) | -| Termination checklists | Access revoked for terminated employees via formal checklist process | Trust Center (passing) | -| Performance evaluations | Managers complete evaluations for direct reports at least annually | Trust Center (passing) | -| Least-privilege access | Internal systems follow least-privilege; regular access reviews conducted | SOC 2 Type II | - -## Governance & organizational controls - -| Control | Description | Verification | -| --- | --- | --- | -| Defined security roles | Formal roles and responsibilities for design, implementation, and monitoring of security controls | Trust Center (passing) | -| Organizational structure | Documented org chart with reporting relationships | Trust Center (passing) | -| Board-level oversight | Board or relevant subcommittee briefed by senior management on security and risk at least annually | Trust Center (passing) | -| Information security policies | Policies and procedures documented and reviewed at least annually | Trust Center (passing) | -| Whistleblower policy | Formalized policy with anonymous communication channel for reporting violations | Trust Center (passing) | -| Vendor management | Third-party vendors and sub-processors evaluated and monitored; sub-processor list available via Trust Center | SOC 2 Type II | -| Business continuity | BC/DR plans aligned with SOC 2 | SOC 2 Type II | - -## Security development lifecycle - -* **Secure coding:** Guidelines enforced through mandatory code review processes -* **Automated security testing:** Integrated into CI/CD pipelines -* **Dependency scanning:** Vulnerability scanning and management for all software dependencies -* **Infrastructure-as-code:** Version-controlled security configurations -* **Penetration testing:** Regular third-party security assessments -* **Incident response:** Documented procedures aligned with SOC 2 Type II, including defined escalation paths and post-incident review - -> [!NOTE] -> All controls marked as “passing” are continuously monitored via Vanta and verified through the Union.ai Trust Center at trust.union.ai. The SOC 2 Type II audit report is available upon request. diff --git a/content/security/presigned-url-data-types.md b/content/security/presigned-url-data-types.md deleted file mode 100644 index 3eee5f889..000000000 --- a/content/security/presigned-url-data-types.md +++ /dev/null @@ -1,14 +0,0 @@ ---- -title: Presigned URL data types -weight: 15 -variants: -flyte +union ---- - -# Presigned URL data types - -| Data Type | Access Method | Direction | -| --- | --- | --- | -| Task inputs/outputs | Presign via ObjectStore service | Download (GET) | -| Code bundles (TGZ) | CreateDownloadLinkV2 | Download (GET) | -| Reports (HTML) | CreateDownloadLinkV2 | Download (GET) | -| Fast registration uploads | CreateUploadLocation | Upload (PUT) | diff --git a/content/security/secrets-management.md b/content/security/secrets-management.md deleted file mode 100644 index 6ef9fbbfd..000000000 --- a/content/security/secrets-management.md +++ /dev/null @@ -1,48 +0,0 @@ ---- -title: Secrets management -weight: 4 -variants: -flyte +union ---- - -# Secrets management - -Union.ai provides enterprise-grade secrets management with a security-first design that ensures secret values never leave the customer’s infrastructure during normal operations. - -## Secrets architecture - -The data plane supports four configurable secrets backends: - -| Backend | Storage Location | Default? | -| --- | --- | --- | -| Kubernetes Secrets | K8s `etcd` on the customer cluster | Yes (default for self-managed) | -| AWS Secrets Manager | AWS-managed service | Optional | -| GCP Secret Manager | GCP-managed service | Optional | -| Azure Key Vault | Azure-managed service | Optional | - -In all cases, secrets are stored within the customer’s infrastructure. -The choice of backend is a deployment configuration on the data plane operator. - -> [!NOTE] -> In BYOC deployments, the default secrets backend differs. See [BYOC deployment differences: Secrets management](./byoc-differences#secrets-management). - -## Secret lifecycle - -### Creation - -When a user creates a secret via the UI or CLI, the request is relayed through the Cloudflare tunnel to the data plane’s secrets backend. -The secret value transits the control plane in-memory during this relay but is never written to disk or database on the control plane. - -### Consumption - -When a task pod is created, the Executor configures it to mount the requested secrets from the secrets backend (as environment variables or files). -The secret value is read by the data plane’s secrets backend and injected into the pod—it never leaves the customer’s infrastructure during this process. - -### Write-only API - -> [!NOTE] -> Security by Design: There is no API to read back secret values. The GetSecret RPC returns only the secret’s metadata (name, scope, creation time, cluster presence status)—never the value itself. Secret values can only be consumed by task pods at runtime. This eliminates an entire class of secret exfiltration attacks. - -## Secret scoping - -Secrets can be scoped at multiple levels (organization, project, domain) to provide granular access control. -Only task pods running within the appropriate scope can access the corresponding secrets. diff --git a/content/security/security-architecture.md b/content/security/security-architecture.md deleted file mode 100644 index 4c8ee104e..000000000 --- a/content/security/security-architecture.md +++ /dev/null @@ -1,134 +0,0 @@ ---- -title: Security architecture -weight: 1 -variants: -flyte +union ---- - -# Security architecture - -Union.ai’s security architecture is founded on the principle of strict separation between orchestration (control plane) and execution (data plane). -This architectural decision ensures that customer data remains within the customer’s own cloud infrastructure at all times. - -## Control plane / data plane separation - -The control plane and data plane serve fundamentally different purposes and handle different types of data: - -### Control plane (Union.ai hosted) - -The control plane is responsible for workflow orchestration, user management, and providing the web interface. -It runs within Union.ai’s AWS account and stores only orchestration metadata in a managed PostgreSQL database. -This metadata includes task definitions (image references, resource requirements, typed interfaces), run and action metadata (identifiers, phase, timestamps, error information), user identity and RBAC records, cluster configuration and health records, and trigger/schedule definitions. -The control plane never stores customer data payloads. -It stores only references (URIs) to data in the customer’s object store, no data. -When data must be surfaced to a client, the control plane either proxies a signing request to generate a presigned URL or relays a data stream from the data plane without persisting it. - -**See comprehensive list of control plane roles and permissions in [Kubernetes RBAC: control plane](./kubernetes-rbac-control-plane).** - -### Data plane (customer hosted) - -The data plane runs inside the customer’s own cloud account on their own Kubernetes cluster. -All customer data resides here, including: - -| Data Type | Storage Technology | Access Pattern | -| --- | --- | --- | -| Task inputs/outputs | Object Store | Read/write by task pods via IAM roles | -| Code bundles (TGZ) | Object Store (fast-registration bucket) | Write via presigned URL; read by task pods and presigned URL by the browser | -| Container images | Container Registry | Built on-cluster; pulled by K8s | -| Task logs | Cloud Log Aggregator + live K8s API | Streamed via tunnel (never stored in CP) | -| Secrets | K8s Secrets, Vault, or Cloud Secrets Manager | Injected into pods at runtime | -| Observability metrics | Prometheus (in-cluster / customer managed) | Proxied queries via DataProxy | -| Reports (HTML) | Object Store (S3/GCS/Azure Blob) | Accessed by the browser via presigned URL | -| Cluster events | K8s API (ephemeral) | Live from K8s API | - -**See comprehensive list of data plane roles and permissions in [Kubernetes RBAC: data plane](./kubernetes-rbac-data-plane).** - -## Network architecture - -Network security is enforced through multiple layers: - -![Network security](../_static/images/security/network-security.png) - -> [!NOTE] -> In BYOC deployments, Union.ai additionally maintains a private management connection to the customer's K8s cluster. See [BYOC deployment differences: Network architecture](./byoc-differences#network-architecture) for details. - -### Cloudflare tunnel (outbound-only) - -The data plane connects to the control plane via a Cloudflare Tunnel—an outbound-only encrypted connection initiated from the customer’s cluster. -This architecture provides several critical security benefits: - -* No inbound firewall rules are required on the customer’s network -* All traffic through the tunnel uses mutual TLS (mTLS) encryption -* The Tunnel Service performs periodic health checks and state reconciliation -* Connection is initiated outward to Cloudflare’s edge network, from the data plane, which then connects to the control plane - -### Control plane tunnel (outbound only) - -The data plane reaches out to the control plane to establish a bidirectional, encrypted and authenticated, outbound-only tunnel. -Union.ai operates regional control plane endpoints: - -| Area | Region | Endpoint | -| --- | --- | --- | -| US | us-east-2 | hosted.unionai.cloud | -| US | us-west-2 | us-west-2.unionai.cloud | -| Europe | eu-west-1 | eu-west-1.unionai.cloud | -| Europe | eu-west-2 | eu-west-2.unionai.cloud | -| Europe | eu-central-1 | eu-central-1.unionai.cloud | - -In locked-down environments, networking teams can limit egress access to published Cloudflare CIDR blocks, and further restrict to specific regions in coordination with the Union networking team. - -### Communication paths - -| Communication Path | Protocol | Encryption | -| --- | --- | --- | -| Client → Control Plane | ConnectRPC (gRPC-Web) over HTTPS | TLS 1.2+ | -| Control Plane ↔ Data Plane | Cloudflare Tunnel (outbound-initiated) | mTLS | -| Client → Object Store (presigned URL) | HTTPS | TLS 1.2+ (cloud provider enforced) | -| Fluent Bit → Log Aggregator | Cloud provider SDK | TLS (cloud-native) | -| Task Pods → Object Store | Cloud provider SDK | TLS (cloud-native) | - -> [!NOTE] -> BYOC deployments add a PrivateLink/PSC management path between Union.ai and the customer's K8s API. See [BYOC deployment differences: Network architecture](./byoc-differences#network-architecture). - -## Data flow architecture - -Union.ai implements two primary data access patterns, both designed to keep customer data out of the control plane: - -### Presigned URL pattern - -For task inputs, outputs, code bundles, and reports, the control plane proxies signing requests to the data plane, which generates time-limited presigned URLs using customer-managed credentials. -The client fetches data directly from the customer’s object store—the data never transits the control plane. -Presigned URLs generated on the data plane are single-object scope, operation-specific (GET or PUT), time-limited (default 1 hour maximum), and transport-encrypted at every hop. - -Union.ai applies several controls: - -* **TTL enforcement** — URLs expire after a configurable window (default 1 hour, configurable shorter) -* **Single-object scope** — each URL grants access to exactly one object, not a bucket or prefix -* **Operation specificity** — each URL is locked to a single operation (GET or PUT) -* **Transport encryption** — URLs are transmitted only over TLS-encrypted channels -* **No URL logging** — presigned URLs are not persisted in control plane logs or databases - -Organizations with stricter requirements can configure shorter TTLs. The presigned URL model was chosen because it eliminates the need for the control plane to hold persistent cloud IAM credentials, which would represent a larger and more persistent attack surface than time-limited bearer URLs. - -### Streaming relay pattern - -For logs and observability metrics, the control plane acts as a stateless relay—streaming data from the data plane through the Cloudflare tunnel to the client in real time. -The data passes through the control plane’s memory as a TLS encrypted stream with a termination point in the cloud. -It is never written to disk, cached, or stored. - -### Execution flow diagram - -![Execution flow](../_static/images/security/execution-flow.png) - -### Data in the UI - -| Field | What is it? | Where is it stored? | How is it retrieved? | -| --- | --- | --- | --- | -| Task names | Python function and module names | Control Plane | CP API | -| Users’ names | First and last names of users on the platform | IDP | Cached in memory in CP, otherwise retrieved directly from IDP | -| Inputs/Outputs | Primitive inputs/outputs returned by tasks (e.g. return 5) | Dataplane’s S3 bucket | Cloudflare Tunnel | -| Logs | Runtime logs written by the task code/SDK | Dataplane K8s for live logs, dataplane S3/Cloudwatch/Stackdriver for persistent logs | Cloudflare Tunnel | -| K8s Events | Pod autoscaling events explaining whether a node is found or the cluster needs to scale up… etc. | Dataplane K8s | Cloudflare Tunnel | -| Report | Reports produced by the task code in HTML | Dataplane’s S3 bucket | A signed URL is generated through the tunnel, then the browser renders it in iframe | -| Code explorer | Code bundled when the task was kicked off, that contains the task code and surrounding dependencies/functions it calls| Dataplane’s S3 bucket | A signed URL is generated through the tunnel, then JS in the browser downloads and unzips the bundle to render | -| Timeline timestamps | Showing when did a task start, when it moved from queued to running to completed | Control Plane | CP API | -| Errors | Showing the failure message written into stderr or raised exceptions for a task attempt | Control Plane | CP API | diff --git a/content/security/threat-model.md b/content/security/threat-model.md new file mode 100644 index 000000000..4bb5674d0 --- /dev/null +++ b/content/security/threat-model.md @@ -0,0 +1,29 @@ +--- +title: Threat model +weight: 4 +variants: -flyte +union +--- + +# Threat model + +This page enumerates the principal threat scenarios considered in Union.ai's security architecture, with pointers to the canonical pages where each scenario's controls and verification live. + +## Control plane compromise + +A compromised control plane would expose orchestration metadata, task definitions, error messages, and inline data transiting memory during active requests. It would not expose bulk customer data, secret values, or any path to initiate connections to customer data planes. The architectural properties that limit this blast radius are described in [Two-plane separation](./architecture/two-plane-separation#blast-radius), and the full classification of what does and does not reside in the control plane is in [Data classification and residency](./data-protection/classification-and-residency). + +## Cross-plane network interception + +Cross-plane traffic uses two outbound-only encrypted channels: a Cloudflare Tunnel (TLS + mTLS + Cloudflare Access tokens) and a direct gRPC connection (TLS 1.2+). All payloads are encrypted in transit, and no intermediate caching or storage occurs on either channel. See [Network architecture](./architecture/network) for the channel design and [Encryption](./data-protection/encryption) for the per-data-type encryption matrix. + +## Presigned URL leakage + +A leaked presigned URL exposes a single object for the duration of its TTL (1 hour by default, configurable shorter) and is locked to a single operation (GET or PUT). It cannot enumerate other objects or be replayed beyond its scope. The full enumeration of presigned URL controls is in [Data flow](./data-protection/data-flow#presigned-url-pattern). Customer-managed encryption keys provide an additional kill-switch -- see [Customer-managed key authority](./data-protection/encryption#customer-managed-key-authority). + +## Secret exfiltration + +The secrets API is write-only by design: there is no API endpoint that returns a secret value, regardless of the caller's privileges. Compromising the control plane API or a privileged user account does not yield secret values. See [Secrets management](./data-protection/secrets) for the lifecycle and write-only design. + +## Cross-tenant data access + +Tenant isolation is enforced at multiple layers: org-scoped primary keys in the control plane databases, service-layer query gating before any data access, and physically separate data planes in different cloud accounts. See [Tenant isolation](./identity-and-access/tenant-isolation). diff --git a/content/security/vulnerability-and-risk-management.md b/content/security/vulnerability-and-risk-management.md deleted file mode 100644 index 366da34cc..000000000 --- a/content/security/vulnerability-and-risk-management.md +++ /dev/null @@ -1,76 +0,0 @@ ---- -title: Vulnerability and risk management -weight: 12 -variants: -flyte +union ---- - -# Vulnerability and risk management - -## Vulnerability assessment - -Union.ai maintains a comprehensive vulnerability management program that includes dependency analysis and automated alerts for known CVEs in software dependencies, container image scanning for both platform and customer-facing components, and periodic third-party penetration testing to identify potential attack vectors. - -## Patch management - -Union.ai follows a risk-based approach to patch management. -Critical vulnerabilities (CVSS 9.0+) are prioritized for immediate remediation, while high-severity vulnerabilities are addressed within defined SLA windows. -The control plane is updated independently of customer data planes, ensuring that security patches can be applied rapidly without requiring customer-side changes. The customer is responsible for data plane patching (K8s version, platform components, monitoring stack). - -> [!NOTE] -> In BYOC deployments, Union.ai manages data plane patching. See [BYOC deployment differences: Data plane patching](./byoc-differences#data-plane-patching). - -## Threat modeling - -Union.ai’s architecture has been designed with the following threat model considerations: - -### Control plane compromise - -In the event of a control plane compromise, an attacker would gain access to orchestration metadata only. -They would not obtain customer data payloads, secret values, code bundles, container images, or log content. -The attacker could not initiate connections to customer data planes (outbound-only tunnel). -Presigned URLs are generated on the data plane, so the attacker could not generate data access URLs. - -### Tunnel interception - -The Cloudflare Tunnel uses mTLS, making man-in-the-middle attacks infeasible. -Even if an attacker could intercept tunnel traffic, customer data flowing through the tunnel (logs, secret creation requests) is encrypted in transit and is not cached or stored at any intermediate point. - -### Presigned URL leakage - -If a presigned URL were leaked, the exposure is limited to a single object for a maximum of one hour (default configuration). -URLs grant only the specific operation requested (GET or PUT) and cannot be used to enumerate or access other objects. -Organizations can configure shorter expiration times to further reduce this risk window. -Because presigned URLs are bearer tokens—possession alone grants access with no additional auth—Union.ai recommends that customers treat presigned URLs with the same care as short-lived credentials and configure the shortest practical TTL for their use case. - -## Security architecture benefits - -Union.ai’s architectural decisions provide inherent security benefits that reduce overall risk exposure: - -| Architectural Decision | Security Benefit | Risk Mitigated | -| --- | --- | --- | -| Control plane stores no customer data | Minimizes blast radius of CP compromise | Data breach from CP attack | -| Outbound-only tunnel | No inbound attack surface on customer network | Network intrusion via open ports | -| Presigned URLs for data access | No persistent data access credentials | Credential theft / lateral movement | -| Write-only secrets API | Cannot exfiltrate secrets via API | Secret leakage via API abuse | -| Workload identity federation | No static credentials on data plane | Static credential compromise | -| Per-org database scoping | Enforces tenant isolation at data layer | Cross-tenant data access | -| Cloud-native encryption | Leverages provider-managed encryption | Data at rest exposure | - -## Third-party dependency risk - -Union.ai's architecture depends on a set of core third-party services. This section provides a risk-tier classification of these dependencies and the mitigations in place for each. - -| Dependency | Tier | Role | Mitigation | -| --- | --- | --- | --- | -| Cloudflare | Critical | Tunnel connectivity between control plane and data plane | mTLS encryption, outbound-only architecture, health monitoring, automatic reconnection | -| AWS (control plane) | Critical | Hosts control plane infrastructure (RDS, EKS, S3) | Multi-AZ redundancy, automated failover, encryption at rest and in transit | -| Customer cloud provider | Critical | Hosts data plane infrastructure | Customer-managed; Union.ai provides guidance and tooling | -| Vanta | Operational | Continuous compliance monitoring | Independent SOC 2 audit validates controls | -| Okta | Operational | Identity provider for OIDC authentication | Standard OAuth2/OIDC; API keys and service accounts provide fallback | - -Union.ai's vendor management program, covered under the SOC 2 Type II audit, includes periodic evaluation of third-party providers. A formal dependency risk assessment document is available upon request for customers conducting in-depth supply chain reviews. - -The customer owns all data plane dependencies. Union.ai's dependency risk scope is limited to the control plane and Cloudflare tunnel. - -> [!NOTE] -> In BYOC deployments, Union.ai assumes responsibility for cluster-level dependencies. See [BYOC deployment differences: Third-party dependency risk](./byoc-differences#third-party-dependency-risk). diff --git a/content/security/workflow-execution-security.md b/content/security/workflow-execution-security.md deleted file mode 100644 index 32bd643e2..000000000 --- a/content/security/workflow-execution-security.md +++ /dev/null @@ -1,33 +0,0 @@ ---- -title: Workflow execution security -weight: 8 -variants: -flyte +union ---- - -# Workflow execution security - -This section traces the security controls applied at each stage of a workflow’s lifecycle, from registration through execution and result retrieval. - -## Task registration - -* SDK serializes the task specification (container image reference, resource requirements, typed interface) into a protobuf message -* Code bundle is uploaded directly to the customer’s object store via presigned PUT URL—the code never touches the control plane -* Only the specification metadata (including the object store URI) is stored in the control plane database - -## Run creation and execution - -* Input data is serialized and uploaded to the customer’s object store; only the input URI is stored in the control plane -* The control plane enqueues the action to the data plane via the Cloudflare tunnel -* The Executor (a Kubernetes controller on the data plane) creates a pod that reads inputs from the customer’s object store and writes outputs back to it -* Secrets are injected into pods from the customer’s secrets backend—they never traverse the control plane during runtime - -## Result retrieval - -* Outputs, reports, and code bundles are accessed via presigned URLs—the data flows directly from the customer’s object store to the client -* Logs are streamed from the data plane through the Cloudflare tunnel as a stateless relay -* Metadata (run status, phase, errors) is served from the control plane database - -## Data flow summary - -> [!NOTE] -> At every stage of the workflow lifecycle, customer data (code, inputs, outputs, images, secrets) stays within the customer’s infrastructure or travels directly between the client and the customer’s object store. Logs are relayed through the tunnel but never stored. The control plane handles only orchestration metadata.