From c169d49dc381c11026a6cc31a8afccecd78b720d Mon Sep 17 00:00:00 2001 From: Kyle Hazen Date: Fri, 24 Apr 2026 15:32:25 -0700 Subject: [PATCH 1/4] Add resource management and multi-team scaling guide --- content/user-guide/project-patterns/_index.md | 4 + .../project-patterns/resource-management.md | 181 ++++++++++++++++++ 2 files changed, 185 insertions(+) create mode 100644 content/user-guide/project-patterns/resource-management.md diff --git a/content/user-guide/project-patterns/_index.md b/content/user-guide/project-patterns/_index.md index 9d4faa0d9..39365fd53 100644 --- a/content/user-guide/project-patterns/_index.md +++ b/content/user-guide/project-patterns/_index.md @@ -26,4 +26,8 @@ How to structure Flyte projects with uv, from single-package setups to multi-tea How to deploy a Flyte project from CI. Uses GitHub Actions as the reference, but the building blocks — API key, `flyte deploy`, commit-pinned versions — translate to any runner. {{< /link-card >}} +{{< link-card target="resource-management" title="Resource management and multi-team scaling" >}} +Projects, domains, quotas, RBAC, and secrets — the primitives to set up before you have ten teams and a noisy-neighbor problem. +{{< /link-card >}} + {{< /grid >}} diff --git a/content/user-guide/project-patterns/resource-management.md b/content/user-guide/project-patterns/resource-management.md new file mode 100644 index 000000000..b881e255f --- /dev/null +++ b/content/user-guide/project-patterns/resource-management.md @@ -0,0 +1,181 @@ +--- +title: Resource management and multi-team scaling +weight: 4 +variants: -flyte +union +--- + +# Resource management and multi-team scaling + +This guide covers the foundational primitives Union provides for multi-tenancy — projects, domains, quotas, task-level resources, RBAC, and secrets — and the patterns that work best as you scale to multiple teams. It also outlines what's changing in the v2 quota model so you can plan accordingly. + +Teams that set these up well early avoid most of the noisy-neighbor, cache-bleed, and resource-starvation problems that surface later. + +## Project-domain structure + +The combination of **project × domain** is Union's atomic unit of isolation. Each pair gets its own Kubernetes namespace, its own quotas, its own RBAC, and its own secrets. + +### One project per team or ML product + +Every independent team or ML product should have its own Union project. Projects are isolated from one another by default, though you can reference workflows or tasks across projects to reuse generalizable resources. + +The tradeoff worth flagging: cross-project task reuse is possible, but it requires advance coordination and shared coding standards. Don't reach for it casually — the coupling it creates is easy to underestimate. + +### Domains are environments, not teams + +Domains are orthogonal to projects. They represent distinct environments — typically development, staging, and production — and enable dedicated configurations, permissions, secrets, cached execution history, and resource allocations for each environment. + +A production domain in particular ensures a clean slate, so cached executions from development don't produce unexpected behavior in production runs. + +## Resource quotas + +### Set quotas per project-domain pair + +Quotas should be configured for each project-domain pair, not globally. This ensures workflows can't exceed designated limits and prevents any single project or domain from impacting resources available to others. + +Configure via `uctl` with a YAML attribute file: + +```yaml +domain: development +project: team-alpha +attributes: + projectQuotaCpu: "500" + projectQuotaMemory: 2Ti +``` + +Apply it with: + +```bash +uctl update cluster-resource-attribute --attrFile cra.yaml +``` + +Verify with: + +```bash +uctl get cluster-resource-attribute -p -d +``` + +### GPU quotas need explicit setup + +`projectQuotaGpu` exists in Union BYOC but is not in Flyte OSS. If any team runs GPU workloads, work with Union to set GPU quotas explicitly. + +Without GPU quotas, you risk starvation across teams: Propeller won't queue executions, it dispatches them to Kubernetes immediately, and pods then sit pending while the execution shows as "running." + +## Task-level resources + +### Always declare explicit requests and limits + +Pass a `(request, limit)` tuple to `flyte.Resources` for each resource dimension you want to bound: + +```python +import flyte + +env = flyte.TaskEnvironment( + name="my_env", + resources=flyte.Resources(cpu=("4", "8"), memory=("16Gi", "32Gi")), +) + +@env.task +def my_task(): + ... +``` + +If you attempt to execute a workflow with unsatisfiable resource requests, the execution fails immediately rather than queueing forever. This fail-fast behavior is a Union-specific improvement over silent Kubernetes pending — but it requires that the node types you request are physically available in the data plane. + +### Be explicit about ephemeral storage + +The `disk` default is zero, which means a task pod will consume node storage as needed. A pod can be evicted if the node runs short on storage. Any team doing heavy data processing should always set `disk` explicitly: + +```python +env = flyte.TaskEnvironment( + name="my_env", + resources=flyte.Resources(cpu="4", memory="16Gi", disk="50Gi"), +) +``` + +## RBAC and secrets + +### Scope roles to project-domain pairs + +Through Role-Based Access Control, users can be assigned roles — such as contributor or admin — scoped to specific project-domain pairs. + +A reasonable default policy: + +- **Development domain**: contributor access for everyone on the team +- **Production domain**: restricted to CI/CD service accounts and admins only + +### Scope secrets as narrowly as possible + +Union supports secrets at the project-domain level, ensuring API keys, tokens, and other sensitive material are only accessible within the workflows that need them. Avoid global-scoped secrets — always scope to the narrowest project-domain that requires access. + +## Multi-team scaling patterns + +### Establish naming conventions early + +Once you have ten or more projects, discoverability degrades quickly. A `-` pattern (for example, `ml-training`, `data-etl`, `inference-serving`) makes quota management, RBAC, and billing attribution substantially easier. + +### Put shared utility tasks in a dedicated project + +If multiple teams need to share preprocessing tasks or model wrappers, create a `shared-utils` or `platform` project rather than duplicating code. This requires governance around versioning and backward compatibility, but it scales better than copy-paste. + +### Use cluster assignment for multi-cluster deployments + +The cluster assignment matchable attribute forces matching executions to consistently run on a specific Kubernetes cluster in multi-cluster deployments. Without an explicit assignment, cluster selection is random. + +If you have GPU clusters alongside CPU-only clusters, configure execution cluster labels per project-domain so GPU workloads land on the right nodes. Random assignment is fine for homogeneous setups, but a poor default once cluster heterogeneity exists. + +### Treat production as a managed service + +Each `/production` pair should have its own quota budget and change-management process. Quota changes in production should go through review rather than ad-hoc CLI updates. + +## What's changing in v2 + +The v2 quota system is being modernized substantially. + +### Current state + +V2 quotas still largely follow the v1 model, but with infrastructure changes underneath. The v2 UI for quota management is not yet available — quota configuration today is done through the v1 UI. + +### Architectural changes already underway + +- **Decoupled from namespaces.** V2 moves away from the strict "one project equals one namespace" model. Multiple projects can share namespaces, with Union providing its own quota enforcement layer above Kubernetes. +- **Queue-based scheduling.** V2 introduces a queue construct in the control plane that enables priority-based scheduling and more flexible resource management. + +### Planned for the first half of 2026 + +**Enhanced quota model:** + +- More granular quotas beyond project-domain level +- Team-level quotas spanning multiple namespaces +- GPU-class-specific quotas (different limits for different GPU types) +- Hierarchical quota structures + +**Priority and preemption:** + +- Workflows assignable to different priorities +- Higher-priority workloads can preempt lower-priority ones +- Fair-scheduling algorithms in the spirit of Slurm + +**Flexible enforcement:** + +Union will rely less on Kubernetes ResourceQuotas and more on its own enforcement mechanisms, enabling more dynamic and intelligent resource allocation. + +### Migration + +The v2 quota system is designed to be backward compatible. Existing v1 quotas will continue to work during the transition. The enhanced capabilities — priority scheduling, hierarchical quotas, GPU-class quotas — are landing in the first half of 2026. + +--- + +## Quick reference + +| Decision | Recommendation | +|---|---| +| Team isolation | One project per team or ML product | +| Environments | Use domains (dev / staging / prod) | +| Quota scope | Per project-domain pair, never global | +| GPU workloads | Set `projectQuotaGpu` explicitly via Union | +| Task resources | Always declare `cpu` and `memory` as request/limit tuples | +| Ephemeral storage | Set `disk` explicitly for data-heavy tasks | +| Production access | CI/CD service accounts + admins only | +| Secrets | Scope to narrowest project-domain | +| Multi-cluster | Use cluster assignment, not random routing | +| Naming | `-` once you exceed ~10 projects | From b97cca3edb5ea689d959296df1564f9a5af320c3 Mon Sep 17 00:00:00 2001 From: Peeter Piegaze <1153481+ppiegaze@users.noreply.github.com> Date: Mon, 27 Apr 2026 14:26:16 +0200 Subject: [PATCH 2/4] Update content/user-guide/project-patterns/resource-management.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- content/user-guide/project-patterns/resource-management.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/user-guide/project-patterns/resource-management.md b/content/user-guide/project-patterns/resource-management.md index b881e255f..1a3f8758a 100644 --- a/content/user-guide/project-patterns/resource-management.md +++ b/content/user-guide/project-patterns/resource-management.md @@ -133,12 +133,12 @@ The v2 quota system is being modernized substantially. ### Current state -V2 quotas still largely follow the v1 model, but with infrastructure changes underneath. The v2 UI for quota management is not yet available — quota configuration today is done through the v1 UI. +v2 quotas still largely follow the v1 model, but with infrastructure changes underneath. The v2 UI for quota management is not yet available — quota configuration today is done through the v1 UI. ### Architectural changes already underway -- **Decoupled from namespaces.** V2 moves away from the strict "one project equals one namespace" model. Multiple projects can share namespaces, with Union providing its own quota enforcement layer above Kubernetes. -- **Queue-based scheduling.** V2 introduces a queue construct in the control plane that enables priority-based scheduling and more flexible resource management. +- **Decoupled from namespaces.** v2 moves away from the strict "one project equals one namespace" model. Multiple projects can share namespaces, with Union providing its own quota enforcement layer above Kubernetes. +- **Queue-based scheduling.** v2 introduces a queue construct in the control plane that enables priority-based scheduling and more flexible resource management. ### Planned for the first half of 2026 From 8ec22492737c5c3814aef40b680f5de1e4de92ce Mon Sep 17 00:00:00 2001 From: Daniel Sola Date: Thu, 30 Apr 2026 11:12:08 -0700 Subject: [PATCH 3/4] daniels edits --- .../project-patterns/resource-management.md | 116 +++++++++--------- 1 file changed, 56 insertions(+), 60 deletions(-) diff --git a/content/user-guide/project-patterns/resource-management.md b/content/user-guide/project-patterns/resource-management.md index 1a3f8758a..02916169c 100644 --- a/content/user-guide/project-patterns/resource-management.md +++ b/content/user-guide/project-patterns/resource-management.md @@ -6,26 +6,26 @@ variants: -flyte +union # Resource management and multi-team scaling -This guide covers the foundational primitives Union provides for multi-tenancy — projects, domains, quotas, task-level resources, RBAC, and secrets — and the patterns that work best as you scale to multiple teams. It also outlines what's changing in the v2 quota model so you can plan accordingly. +This guide covers the foundational primitives Union provides for multi-tenancy — projects, domains, quotas, task-level resources, RBAC, and secrets — and the patterns that work best as you scale to multiple teams. Teams that set these up well early avoid most of the noisy-neighbor, cache-bleed, and resource-starvation problems that surface later. ## Project-domain structure -The combination of **project × domain** is Union's atomic unit of isolation. Each pair gets its own Kubernetes namespace, its own quotas, its own RBAC, and its own secrets. +The combination of **project × domain** is Union's primary unit of isolation. Each pair gets its own quota budget. RBAC and secrets are flexible: they can be scoped narrowly to a project-domain pair, or broadened across projects, across domains, or organization-wide depending on how you want to share access. ### One project per team or ML product Every independent team or ML product should have its own Union project. Projects are isolated from one another by default, though you can reference workflows or tasks across projects to reuse generalizable resources. -The tradeoff worth flagging: cross-project task reuse is possible, but it requires advance coordination and shared coding standards. Don't reach for it casually — the coupling it creates is easy to underestimate. - ### Domains are environments, not teams Domains are orthogonal to projects. They represent distinct environments — typically development, staging, and production — and enable dedicated configurations, permissions, secrets, cached execution history, and resource allocations for each environment. A production domain in particular ensures a clean slate, so cached executions from development don't produce unexpected behavior in production runs. +A common pattern is to split clusters and networking across domains as well — for example, a dedicated production cluster with stricter network controls, separate from the cluster development and staging share. See [multi-cluster and multi-cloud](../../deployment/byoc/multi-cluster) for how this maps to underlying cloud accounts. + ## Resource quotas ### Set quotas per project-domain pair @@ -54,24 +54,22 @@ Verify with: uctl get cluster-resource-attribute -p -d ``` -### GPU quotas need explicit setup - -`projectQuotaGpu` exists in Union BYOC but is not in Flyte OSS. If any team runs GPU workloads, work with Union to set GPU quotas explicitly. +### Why quotas matter -Without GPU quotas, you risk starvation across teams: Propeller won't queue executions, it dispatches them to Kubernetes immediately, and pods then sit pending while the execution shows as "running." +Without quotas, projects can starve each other for shared resources. Runs that exceed available capacity are still dispatched to the cluster, and pods sit pending while the execution shows as "running." Quotas turn that silent contention into an explicit, fail-fast signal teams can act on. ## Task-level resources -### Always declare explicit requests and limits +### Declare resources on the task environment -Pass a `(request, limit)` tuple to `flyte.Resources` for each resource dimension you want to bound: +Set resources on a `flyte.TaskEnvironment` (or override per task) using `flyte.Resources`: ```python import flyte env = flyte.TaskEnvironment( name="my_env", - resources=flyte.Resources(cpu=("4", "8"), memory=("16Gi", "32Gi")), + resources=flyte.Resources(cpu="4", memory="16Gi", disk="50Gi"), ) @env.task @@ -79,33 +77,33 @@ def my_task(): ... ``` -If you attempt to execute a workflow with unsatisfiable resource requests, the execution fails immediately rather than queueing forever. This fail-fast behavior is a Union-specific improvement over silent Kubernetes pending — but it requires that the node types you request are physically available in the data plane. +If a task's resource request exceeds your project-domain quota, the execution fails immediately rather than queueing forever. That's the behavior you want — but it means teams should know what their quota is before sizing tasks. Coordinate with whoever owns quota configuration so requests stay within budget, or so the budget gets raised intentionally. ### Be explicit about ephemeral storage -The `disk` default is zero, which means a task pod will consume node storage as needed. A pod can be evicted if the node runs short on storage. Any team doing heavy data processing should always set `disk` explicitly: - -```python -env = flyte.TaskEnvironment( - name="my_env", - resources=flyte.Resources(cpu="4", memory="16Gi", disk="50Gi"), -) -``` +The `disk` default is zero, which means a task pod will consume node storage as needed. A pod can be evicted if the node runs short on storage. Any team doing heavy data processing should always set `disk` explicitly. ## RBAC and secrets -### Scope roles to project-domain pairs +### Roles vs policies + +Union splits access control into two concepts: + +- **Roles** are named sets of actions (for example, "can register workflows", "can launch executions"). They describe *what* a principal can do. +- **Policies** bind roles to a scope — a specific project-domain pair, a whole domain (across all projects), a whole project (across all domains), or the entire organization. They describe *where* the role applies. -Through Role-Based Access Control, users can be assigned roles — such as contributor or admin — scoped to specific project-domain pairs. +This split means you don't have to define roles per project-domain pair. A single "Contributor" role can be bound by one policy to `team-alpha/development`, and by another policy to *every* `production` domain across the organization. Pick the binding scope that matches the access you actually want to grant. -A reasonable default policy: +A reasonable default: -- **Development domain**: contributor access for everyone on the team -- **Production domain**: restricted to CI/CD service accounts and admins only +- **Development domains**: bind contributor roles broadly so everyone on the team can register and run workflows. +- **Production domains**: restrict to CI/CD service accounts and admins only. + +See [user management](../user-management) for the full walkthrough on creating roles, policies, and assignments. ### Scope secrets as narrowly as possible -Union supports secrets at the project-domain level, ensuring API keys, tokens, and other sensitive material are only accessible within the workflows that need them. Avoid global-scoped secrets — always scope to the narrowest project-domain that requires access. +Union supports secrets at the project-domain level, ensuring API keys, tokens, and other sensitive material are only accessible within the workflows that need them. Like RBAC, secrets can also be scoped more broadly when shared across projects or domains — but default to the narrowest scope that satisfies the workflows that need access. ## Multi-team scaling patterns @@ -115,53 +113,49 @@ Once you have ten or more projects, discoverability degrades quickly. A `- ### Put shared utility tasks in a dedicated project -If multiple teams need to share preprocessing tasks or model wrappers, create a `shared-utils` or `platform` project rather than duplicating code. This requires governance around versioning and backward compatibility, but it scales better than copy-paste. - -### Use cluster assignment for multi-cluster deployments - -The cluster assignment matchable attribute forces matching executions to consistently run on a specific Kubernetes cluster in multi-cluster deployments. Without an explicit assignment, cluster selection is random. +If multiple teams need to share preprocessing tasks or model wrappers, create a `shared-utils` or `platform` project rather than duplicating code. Other teams target these without pulling in the implementation by referencing them through the [remote tasks API](../task-programming/remote-tasks): -If you have GPU clusters alongside CPU-only clusters, configure execution cluster labels per project-domain so GPU workloads land on the right nodes. Random assignment is fine for homogeneous setups, but a poor default once cluster heterogeneity exists. - -### Treat production as a managed service - -Each `/production` pair should have its own quota budget and change-management process. Quota changes in production should go through review rather than ad-hoc CLI updates. - -## What's changing in v2 +```python +import flyte.remote -The v2 quota system is being modernized substantially. +shared_preprocess = flyte.remote.Task.get( + "shared-utils.preprocess", + auto_version="latest", +) +``` -### Current state +This requires governance around versioning and backward compatibility, but it scales better than copy-paste. -v2 quotas still largely follow the v1 model, but with infrastructure changes underneath. The v2 UI for quota management is not yet available — quota configuration today is done through the v1 UI. +### Use cluster assignment for multi-cluster deployments -### Architectural changes already underway +The cluster assignment matchable attribute pins matching executions to a specific Union cluster in multi-cluster deployments. Without an explicit assignment, cluster selection is random — fine for homogeneous setups, but a poor default once cluster heterogeneity exists (for example, GPU clusters alongside CPU-only clusters). -- **Decoupled from namespaces.** v2 moves away from the strict "one project equals one namespace" model. Multiple projects can share namespaces, with Union providing its own quota enforcement layer above Kubernetes. -- **Queue-based scheduling.** v2 introduces a queue construct in the control plane that enables priority-based scheduling and more flexible resource management. +Set the assignment per project-domain with `uctl`: -### Planned for the first half of 2026 +```yaml +# cpa.yaml +domain: production +project: team-alpha +clusterPoolName: gpu-pool +``` -**Enhanced quota model:** +```bash +uctl update cluster-pool-attributes --attrFile cpa.yaml +``` -- More granular quotas beyond project-domain level -- Team-level quotas spanning multiple namespaces -- GPU-class-specific quotas (different limits for different GPU types) -- Hierarchical quota structures +See [`uctl update cluster-pool-attributes`](../../api-reference/uctl-cli/uctl-update/uctl-update-cluster-pool-attributes) for the full reference. -**Priority and preemption:** +### Treat production as a managed service -- Workflows assignable to different priorities -- Higher-priority workloads can preempt lower-priority ones -- Fair-scheduling algorithms in the spirit of Slurm +Each `/production` pair should have its own quota budget and change-management process. Quota changes in production should go through review rather than ad-hoc CLI updates. -**Flexible enforcement:** +The [Union Terraform provider](../../deployment/terraform) is a good fit for this: it lets you manage projects, roles, policies, and access assignments declaratively, so production configuration lives in version control and changes go through PR review like any other infrastructure change. -Union will rely less on Kubernetes ResourceQuotas and more on its own enforcement mechanisms, enabling more dynamic and intelligent resource allocation. +## What's coming next -### Migration +The next major step in scheduling is the **queue** construct — a control-plane primitive that lets you submit work into named queues with priority levels. Higher-priority work can preempt lower-priority work, and fair-share scheduling decides what runs when capacity is contested. This moves resource arbitration off raw quotas and onto something closer to a Slurm-style scheduler, which scales better for teams running mixed-criticality workloads on shared clusters. -The v2 quota system is designed to be backward compatible. Existing v1 quotas will continue to work during the transition. The enhanced capabilities — priority scheduling, hierarchical quotas, GPU-class quotas — are landing in the first half of 2026. +If you're planning ahead for multi-team scaling, the project-domain and quota patterns described above remain the right foundation — queues will sit on top of them rather than replace them. --- @@ -172,10 +166,12 @@ The v2 quota system is designed to be backward compatible. Existing v1 quotas wi | Team isolation | One project per team or ML product | | Environments | Use domains (dev / staging / prod) | | Quota scope | Per project-domain pair, never global | -| GPU workloads | Set `projectQuotaGpu` explicitly via Union | -| Task resources | Always declare `cpu` and `memory` as request/limit tuples | +| Task resources | Declare `cpu`, `memory`, and `disk` on `flyte.Resources` and stay within your quota | | Ephemeral storage | Set `disk` explicitly for data-heavy tasks | +| RBAC | Bind roles via policies at the scope you actually need (project-domain, domain, project, or org) | | Production access | CI/CD service accounts + admins only | | Secrets | Scope to narrowest project-domain | | Multi-cluster | Use cluster assignment, not random routing | +| Shared tasks | Put in a dedicated project, target via `flyte.remote.Task` | +| Production config | Manage with the Union Terraform provider | | Naming | `-` once you exceed ~10 projects | From 594ae3e70933ebf4f56df256cc5545db18d4d31d Mon Sep 17 00:00:00 2001 From: Daniel Sola Date: Thu, 30 Apr 2026 11:28:58 -0700 Subject: [PATCH 4/4] fix terraform link and clarify disk default --- content/user-guide/project-patterns/resource-management.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/user-guide/project-patterns/resource-management.md b/content/user-guide/project-patterns/resource-management.md index 02916169c..533f03331 100644 --- a/content/user-guide/project-patterns/resource-management.md +++ b/content/user-guide/project-patterns/resource-management.md @@ -81,7 +81,7 @@ If a task's resource request exceeds your project-domain quota, the execution fa ### Be explicit about ephemeral storage -The `disk` default is zero, which means a task pod will consume node storage as needed. A pod can be evicted if the node runs short on storage. Any team doing heavy data processing should always set `disk` explicitly. +By default, `disk` is unset, so no ephemeral-storage request or limit is applied. A task pod can still consume node storage as needed, and it may be evicted if the node comes under storage pressure. Any team doing heavy data processing should always set `disk` explicitly. ## RBAC and secrets @@ -149,7 +149,7 @@ See [`uctl update cluster-pool-attributes`](../../api-reference/uctl-cli/uctl-up Each `/production` pair should have its own quota budget and change-management process. Quota changes in production should go through review rather than ad-hoc CLI updates. -The [Union Terraform provider](../../deployment/terraform) is a good fit for this: it lets you manage projects, roles, policies, and access assignments declaratively, so production configuration lives in version control and changes go through PR review like any other infrastructure change. +The [Union Terraform provider](../../deployment/terraform/_index) is a good fit for this: it lets you manage projects, roles, policies, and access assignments declaratively, so production configuration lives in version control and changes go through PR review like any other infrastructure change. ## What's coming next