-
Notifications
You must be signed in to change notification settings - Fork 11
Add resource management and multi-team scaling guide #950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
k1sauce
wants to merge
7
commits into
main
Choose a base branch
from
add-resource-management-guide
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+181
−0
Open
Changes from 1 commit
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
c169d49
Add resource management and multi-team scaling guide
k1sauce a4d139a
Merge branch 'main' into add-resource-management-guide
ppiegaze b97cca3
Update content/user-guide/project-patterns/resource-management.md
ppiegaze 2c3aa2e
Merge branch 'main' into add-resource-management-guide
ppiegaze 8ec2249
daniels edits
dansola 594ae3e
fix terraform link and clarify disk default
dansola 2b632ee
Merge branch 'main' into add-resource-management-guide
dansola File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
181 changes: 181 additions & 0 deletions
181
content/user-guide/project-patterns/resource-management.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,181 @@ | ||||||
| --- | ||||||
| title: Resource management and multi-team scaling | ||||||
| weight: 4 | ||||||
| variants: -flyte +union | ||||||
| --- | ||||||
|
|
||||||
| # Resource management and multi-team scaling | ||||||
|
|
||||||
| This guide covers the foundational primitives Union provides for multi-tenancy — projects, domains, quotas, task-level resources, RBAC, and secrets — and the patterns that work best as you scale to multiple teams. It also outlines what's changing in the v2 quota model so you can plan accordingly. | ||||||
|
|
||||||
| Teams that set these up well early avoid most of the noisy-neighbor, cache-bleed, and resource-starvation problems that surface later. | ||||||
|
|
||||||
| ## Project-domain structure | ||||||
|
|
||||||
| The combination of **project × domain** is Union's atomic unit of isolation. Each pair gets its own Kubernetes namespace, its own quotas, its own RBAC, and its own secrets. | ||||||
|
|
||||||
| ### One project per team or ML product | ||||||
|
|
||||||
| Every independent team or ML product should have its own Union project. Projects are isolated from one another by default, though you can reference workflows or tasks across projects to reuse generalizable resources. | ||||||
|
|
||||||
| The tradeoff worth flagging: cross-project task reuse is possible, but it requires advance coordination and shared coding standards. Don't reach for it casually — the coupling it creates is easy to underestimate. | ||||||
|
|
||||||
| ### Domains are environments, not teams | ||||||
|
|
||||||
| Domains are orthogonal to projects. They represent distinct environments — typically development, staging, and production — and enable dedicated configurations, permissions, secrets, cached execution history, and resource allocations for each environment. | ||||||
|
|
||||||
| A production domain in particular ensures a clean slate, so cached executions from development don't produce unexpected behavior in production runs. | ||||||
|
|
||||||
| ## Resource quotas | ||||||
|
|
||||||
| ### Set quotas per project-domain pair | ||||||
|
|
||||||
| Quotas should be configured for each project-domain pair, not globally. This ensures workflows can't exceed designated limits and prevents any single project or domain from impacting resources available to others. | ||||||
|
|
||||||
| Configure via `uctl` with a YAML attribute file: | ||||||
|
|
||||||
| ```yaml | ||||||
| domain: development | ||||||
| project: team-alpha | ||||||
| attributes: | ||||||
| projectQuotaCpu: "500" | ||||||
| projectQuotaMemory: 2Ti | ||||||
| ``` | ||||||
|
|
||||||
| Apply it with: | ||||||
|
|
||||||
| ```bash | ||||||
| uctl update cluster-resource-attribute --attrFile cra.yaml | ||||||
| ``` | ||||||
|
|
||||||
| Verify with: | ||||||
|
|
||||||
| ```bash | ||||||
| uctl get cluster-resource-attribute -p <project> -d <domain> | ||||||
| ``` | ||||||
|
|
||||||
| ### GPU quotas need explicit setup | ||||||
|
|
||||||
| `projectQuotaGpu` exists in Union BYOC but is not in Flyte OSS. If any team runs GPU workloads, work with Union to set GPU quotas explicitly. | ||||||
|
|
||||||
| Without GPU quotas, you risk starvation across teams: Propeller won't queue executions, it dispatches them to Kubernetes immediately, and pods then sit pending while the execution shows as "running." | ||||||
|
|
||||||
| ## Task-level resources | ||||||
|
|
||||||
| ### Always declare explicit requests and limits | ||||||
|
|
||||||
| Pass a `(request, limit)` tuple to `flyte.Resources` for each resource dimension you want to bound: | ||||||
|
|
||||||
| ```python | ||||||
| import flyte | ||||||
|
|
||||||
| env = flyte.TaskEnvironment( | ||||||
| name="my_env", | ||||||
| resources=flyte.Resources(cpu=("4", "8"), memory=("16Gi", "32Gi")), | ||||||
| ) | ||||||
|
|
||||||
| @env.task | ||||||
| def my_task(): | ||||||
| ... | ||||||
| ``` | ||||||
|
|
||||||
| If you attempt to execute a workflow with unsatisfiable resource requests, the execution fails immediately rather than queueing forever. This fail-fast behavior is a Union-specific improvement over silent Kubernetes pending — but it requires that the node types you request are physically available in the data plane. | ||||||
|
|
||||||
| ### Be explicit about ephemeral storage | ||||||
|
|
||||||
| The `disk` default is zero, which means a task pod will consume node storage as needed. A pod can be evicted if the node runs short on storage. Any team doing heavy data processing should always set `disk` explicitly: | ||||||
|
||||||
| The `disk` default is zero, which means a task pod will consume node storage as needed. A pod can be evicted if the node runs short on storage. Any team doing heavy data processing should always set `disk` explicitly: | |
| By default, `disk` is unset, so no ephemeral-storage request or limit is applied. A task pod can still consume node storage as needed, and it may be evicted if the node comes under storage pressure. Any team doing heavy data processing should always set `disk` explicitly: |
ppiegaze marked this conversation as resolved.
Outdated
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.