-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add KEP-12513: Introduce PVC-based artifact storage for Kubeflow Pipe… #12515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…lines Signed-off-by: Helber Belmiro <[email protected]>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Here are some thoughts, but I still need to read the full document. That is very far away from V1 artifact passing via PVC and would violate zero overhead namespaces. Just imagine a cluster with 1000 namespaces for scalability. Then you add 1000 permanent pods, so massive overhead. In V1 we just accepted that the artifacts will not be in the UI instead. PVC for all namespace also sounds scary. Why should we offer such a security nightmare in the first place? That would break the namespace isolation contract. Maybe I am missing something and I very much admire that you spend the time for the KEP, but there might be some fundamental problems as mentioned above regarding security and scalability that we did not have in the KFP V1 Implementation. But as I said some of my statements could be wrong and that is just my initial assessment without checking it thoroughly. What I also do not understand is that by default we ship seaweedfs and you do not have to configure object storage yourself. So where is this making it easier for beginners, i think it is actually more difficult? |
Signed-off-by: Helber Belmiro <[email protected]>
Signed-off-by: Helber Belmiro <[email protected]>
There will be two modes: central and namespace-local. Users should choose which mode fits to their scenarios or not even use the filesystem storage if it doesn't make sense to them. The idea is not to replace the existing storage solutions, but add a new alternative.
I'm not sure I'm following. Please correct me if I'm wrong, but the existing s3 solutions already break the namespace isolation contract once we have one instance for all namespaces. With the namespace-local mode proposed here we can achieve complete namespace isolation, which we don't have today.
Thanks for clarifying regarding SeaweedFS. I updated the motivation with that in mind. You can see the specific commit here. Maybe taking a look on Goals and Non-Goals may clarify some things before going deep into the proposal. |
…pace configuration. Signed-off-by: Helber Belmiro <[email protected]>
That is not the case anymore. Seaweedfs is now multi-tenant with ACLS and credentials per namespace as of KFP release 2.15. We even have tests that verify the namespace isolation of seaweedfs. Therefore i doubt that the central mode is needed at all. It just adds complexity and decreases security. |
| #### Story 1: User Running Pipelines on Local Kubernetes | ||
|
|
||
| As a user with KFP on kind/minikube/k3s, I want my pipeline artifacts to automatically use the local cluster's default `StorageClass` via the `kfp-artifacts://` scheme, so that I can develop pipelines offline without any storage configuration. | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - KFP works out-of-the-box with filesystem storage on local clusters | ||
| - Artifacts are stored using the `kfp-artifacts://` URI scheme | ||
| - No S3/GCS credentials required | ||
| - Artifact viewing in UI works seamlessly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 1: User Running Pipelines on Local Kubernetes | |
| As a user with KFP on kind/minikube/k3s, I want my pipeline artifacts to automatically use the local cluster's default `StorageClass` via the `kfp-artifacts://` scheme, so that I can develop pipelines offline without any storage configuration. | |
| **Acceptance Criteria:** | |
| - KFP works out-of-the-box with filesystem storage on local clusters | |
| - Artifacts are stored using the `kfp-artifacts://` URI scheme | |
| - No S3/GCS credentials required | |
| - Artifact viewing in UI works seamlessly |
That is already covered by the default seaweedfs with zero effort. No storage configuration needed. seaweedfs just uses the default StorageClass. It already satisfies
- KFP works out-of-the-box with filesystem storage on local clusters
- No S3/GCS credentials required
- Artifact viewing in UI works seamlessly
So i recommend to remove this story
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This user story is already covered. The KEP just proposes a new way to do this.
| #### Story 2: Operator Deploying KFP Without External Object Storage | ||
|
|
||
| As an operator for a Kubeflow distribution, I want to deploy KFP with filesystem storage so that I don't need to productize and support a separate object storage system. | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - Single configuration option to enable filesystem storage | ||
| - Artifact handling is part of KFP (no separate object storage component) | ||
| - Storage automatically provisioned via PVCs | ||
| - Backup/restore follows standard Kubernetes PVC procedures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 2: Operator Deploying KFP Without External Object Storage | |
| As an operator for a Kubeflow distribution, I want to deploy KFP with filesystem storage so that I don't need to productize and support a separate object storage system. | |
| **Acceptance Criteria:** | |
| - Single configuration option to enable filesystem storage | |
| - Artifact handling is part of KFP (no separate object storage component) | |
| - Storage automatically provisioned via PVCs | |
| - Backup/restore follows standard Kubernetes PVC procedures |
That is already covered by the default seaweedfs with zero effort. No storage configuration needed. seaweedfs just uses the default StorageClass. It already satisfies
- Artifact handling is part of KFP (no separate object storage component)
- Storage automatically provisioned via PVCs
- Backup/restore follows standard Kubernetes PVC procedures
So i recommend to remove this story
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this user story still applies as it's about the admin not wanting to maintain an object storage service, which is valid for use cases where the organization wants on-premise storage (no AWS access) but doesn't have an enterprise license/support for SeaweedFS.
| #### Story 3: Operator Configuring Storage Class and Size | ||
|
|
||
| As an operator, I want to configure KFP to use a specific StorageClass and PVC size instead of defaults, so that I can match storage performance and capacity to my workload requirements. | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - Can specify `StorageClass` in KFP configuration | ||
| - Can set PVC size limits (global configuration) | ||
| - Storage quotas enforced via Kubernetes `ResourceQuotas` | ||
| - Clear error messages when storage limits are reached | ||
| - Can choose between RWO and RWX access modes based on needs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 3: Operator Configuring Storage Class and Size | |
| As an operator, I want to configure KFP to use a specific StorageClass and PVC size instead of defaults, so that I can match storage performance and capacity to my workload requirements. | |
| **Acceptance Criteria:** | |
| - Can specify `StorageClass` in KFP configuration | |
| - Can set PVC size limits (global configuration) | |
| - Storage quotas enforced via Kubernetes `ResourceQuotas` | |
| - Clear error messages when storage limits are reached | |
| - Can choose between RWO and RWX access modes based on needs |
That is already covered by the default seaweedfs with zero effort. You can already decide which storageclass is used for Seaweedfs. It already satisfies
- Can specify
StorageClassin KFP configuration - Can set PVC size limits (global configuration)
- Clear error messages when storage limits are reached
- Can choose between RWO and RWX access modes based on needs
So i recommend to remove this story
The only thing remaining is - Storage quotas enforced via Kubernetes ResourceQuotas. We can set it per bucket already, but i am not sure about per folder. So it is partially solved and fully solved if you create on bucket per namespace. This item can be moved to another story.
| #### Story 5: Operator Deploying Multi-Tenant KFP with Namespace Isolation | ||
|
|
||
| As an operator, I want to deploy KFP in namespace-local mode where each namespace annotated with `pipelines.kubeflow.org/enabled=true` gets its own artifact server pod and dedicated PVC, so that Team A's artifacts in namespace `team-a` are physically isolated from Team B's artifacts in namespace `team-b`. | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - Each namespace with `pipelines.kubeflow.org/enabled=true` annotation gets its own artifact server deployment | ||
| - Each namespace gets its own dedicated PVC (no shared storage) | ||
| - Artifact server in `team-a` namespace cannot access PVC in `team-b` namespace | ||
| - Users can only access artifacts in namespaces they have RBAC permissions for (via `SubjectAccessReview`) | ||
| - Physical isolation verified: deleting `team-a` namespace doesn't affect `team-b`'s artifacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 5: Operator Deploying Multi-Tenant KFP with Namespace Isolation | |
| As an operator, I want to deploy KFP in namespace-local mode where each namespace annotated with `pipelines.kubeflow.org/enabled=true` gets its own artifact server pod and dedicated PVC, so that Team A's artifacts in namespace `team-a` are physically isolated from Team B's artifacts in namespace `team-b`. | |
| **Acceptance Criteria:** | |
| - Each namespace with `pipelines.kubeflow.org/enabled=true` annotation gets its own artifact server deployment | |
| - Each namespace gets its own dedicated PVC (no shared storage) | |
| - Artifact server in `team-a` namespace cannot access PVC in `team-b` namespace | |
| - Users can only access artifacts in namespaces they have RBAC permissions for (via `SubjectAccessReview`) | |
| - Physical isolation verified: deleting `team-a` namespace doesn't affect `team-b`'s artifacts |
This storay has been made completely obsolete by the multi-tenant default seaweedfs. It would just be more complicated and break zero overhead namespaces. Its all done already in a secure manner and without massive overhead from additinal artifact servers per namespace. Just imagine 1000 namespaces and 1000 extra pods when idle.
| #### Story 6: Operator Preferring KFP-Native Storage | ||
|
|
||
| As an operator in a regulated environment (e.g., healthcare, finance), I want to deploy KFP with filesystem storage using an encrypted `StorageClass` (e.g., `encrypted-gp3`), so that artifact handling stays within the KFP codebase and I don't need to include a separate object storage system in my security audits. | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - All artifacts stored on PVCs within the cluster | ||
| - KFP configuration uses `Filesystem.Type: "pvc"` with encrypted `StorageClass` | ||
| - `SubjectAccessReview` validates all artifact access requests | ||
| - Encryption at rest provided by the configured `StorageClass` (e.g., `encrypted-gp3`) | ||
| - No separate object storage component to audit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 6: Operator Preferring KFP-Native Storage | |
| As an operator in a regulated environment (e.g., healthcare, finance), I want to deploy KFP with filesystem storage using an encrypted `StorageClass` (e.g., `encrypted-gp3`), so that artifact handling stays within the KFP codebase and I don't need to include a separate object storage system in my security audits. | |
| **Acceptance Criteria:** | |
| - All artifacts stored on PVCs within the cluster | |
| - KFP configuration uses `Filesystem.Type: "pvc"` with encrypted `StorageClass` | |
| - `SubjectAccessReview` validates all artifact access requests | |
| - Encryption at rest provided by the configured `StorageClass` (e.g., `encrypted-gp3`) | |
| - No separate object storage component to audit |
This story is also obsolete because you can just make the seaweedfs PVC encrypted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think we can remove this.
| #### Story 7: Operator Running KFP on Storage-Constrained Infrastructure | ||
|
|
||
| As an operator with limited storage budget (only 1TB total available), I want to deploy KFP in central mode with a single 500GB PVC shared across 10 team namespaces, so that all teams can run pipelines without each needing their own 100GB PVC (which would require 1TB total). | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - Central mode configured with: `DeploymentMode: "central"`, `Size: "500Gi"` | ||
| - Single PVC created in `kubeflow` namespace mounted by one artifact server | ||
| - All 10 teams' artifacts stored in `/artifacts/<namespace>/` directories on same PVC | ||
| - Teams can run pipelines concurrently without storage allocation failures | ||
| - No per-namespace storage limits (trade-off of central mode - shared PVC means no per-team quotas) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 7: Operator Running KFP on Storage-Constrained Infrastructure | |
| As an operator with limited storage budget (only 1TB total available), I want to deploy KFP in central mode with a single 500GB PVC shared across 10 team namespaces, so that all teams can run pipelines without each needing their own 100GB PVC (which would require 1TB total). | |
| **Acceptance Criteria:** | |
| - Central mode configured with: `DeploymentMode: "central"`, `Size: "500Gi"` | |
| - Single PVC created in `kubeflow` namespace mounted by one artifact server | |
| - All 10 teams' artifacts stored in `/artifacts/<namespace>/` directories on same PVC | |
| - Teams can run pipelines concurrently without storage allocation failures | |
| - No per-namespace storage limits (trade-off of central mode - shared PVC means no per-team quotas) |
That is also just fully covered by the default multi-tenant seaweedfs. Just one single pvc in the kubeflow namespace.
| #### Story 8: Operator Scaling High-Throughput Model Training Platform | ||
|
|
||
| As an operator supporting 100+ concurrent pipeline runs with multi-GB model checkpoints, I want to deploy KFP in central mode with RWX storage (e.g., NFS/CephFS) and multiple artifact server replicas behind a load balancer, so that artifact upload/download operations can scale horizontally without bottlenecks. | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - Can deploy artifact servers in both central and namespace-local modes | ||
| - Artifact servers stream large files without loading into memory | ||
| - Can use high-performance `StorageClasses` (e.g., SSD-backed) | ||
| - Horizontal scaling possible in central mode | ||
| - Direct pod-to-pod communication in namespace-local mode reduces latency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 8: Operator Scaling High-Throughput Model Training Platform | |
| As an operator supporting 100+ concurrent pipeline runs with multi-GB model checkpoints, I want to deploy KFP in central mode with RWX storage (e.g., NFS/CephFS) and multiple artifact server replicas behind a load balancer, so that artifact upload/download operations can scale horizontally without bottlenecks. | |
| **Acceptance Criteria:** | |
| - Can deploy artifact servers in both central and namespace-local modes | |
| - Artifact servers stream large files without loading into memory | |
| - Can use high-performance `StorageClasses` (e.g., SSD-backed) | |
| - Horizontal scaling possible in central mode | |
| - Direct pod-to-pod communication in namespace-local mode reduces latency |
That is also obsolete by merging #12391, so a distributed seaweedfs.
| #### Story 9: Operator with Mixed Isolation Requirements | ||
|
|
||
| As an operator, I want to deploy KFP in central mode by default, but configure specific namespaces (e.g., `team-finance`) to use namespace-local mode for stricter isolation, so that most teams share the simple central server while sensitive teams get dedicated resources. | ||
|
|
||
| **Acceptance Criteria:** | ||
|
|
||
| - Global deployment mode set to `central` (default) | ||
| - Specific namespaces can override to `namespaced` via their `kfp-launcher` ConfigMap | ||
| - Teams using central mode share the central artifact server | ||
| - Teams with namespace-local override get their own artifact server and PVC | ||
| - UI correctly routes artifact requests based on each namespace's deployment mode | ||
| - No cluster-wide restart needed to change a namespace's mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### Story 9: Operator with Mixed Isolation Requirements | |
| As an operator, I want to deploy KFP in central mode by default, but configure specific namespaces (e.g., `team-finance`) to use namespace-local mode for stricter isolation, so that most teams share the simple central server while sensitive teams get dedicated resources. | |
| **Acceptance Criteria:** | |
| - Global deployment mode set to `central` (default) | |
| - Specific namespaces can override to `namespaced` via their `kfp-launcher` ConfigMap | |
| - Teams using central mode share the central artifact server | |
| - Teams with namespace-local override get their own artifact server and PVC | |
| - UI correctly routes artifact requests based on each namespace's deployment mode | |
| - No cluster-wide restart needed to change a namespace's mode | |
| #### Story 9: Operator with enterprise isolation requirements and fast local storage for giant models | |
| I want to have as in kfp v1 data_passing_method() a way to have a per pipeline or per namespace PVC that is very fast local storage such that i do not need to up and download large models for each step from and to S3. | |
That is a security nightmare i would not want to support or recommend to any enterprise. It is just worse than the default multi-tenant seaweedfs. I added a typical usecase i have seen so far.
| ##### Mode 1: Central Artifact Server (Default) | ||
|
|
||
| A single artifact server in the main KFP namespace serves all namespaces, configured via `ObjectStoreConfig.ArtifactServer.DeploymentMode: "central"`. | ||
| **Central Mode Characteristics:** | ||
| - **Single PVC** with directory structure: `/artifacts/<namespace>/<pipeline>/<run-id>/<node-id>/<artifact-name>` | ||
| - **Authorization**: Uses `SubjectAccessReview` to verify namespace access | ||
| - **Best for**: Simple deployments, single-user setups, small teams | ||
| - **Advantages**: Simple setup, single storage location, easy backup | ||
| - **Limitations**: All namespaces share same PVC and storage quota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### Mode 1: Central Artifact Server (Default) | |
| A single artifact server in the main KFP namespace serves all namespaces, configured via `ObjectStoreConfig.ArtifactServer.DeploymentMode: "central"`. | |
| **Central Mode Characteristics:** | |
| - **Single PVC** with directory structure: `/artifacts/<namespace>/<pipeline>/<run-id>/<node-id>/<artifact-name>` | |
| - **Authorization**: Uses `SubjectAccessReview` to verify namespace access | |
| - **Best for**: Simple deployments, single-user setups, small teams | |
| - **Advantages**: Simple setup, single storage location, easy backup | |
| - **Limitations**: All namespaces share same PVC and storage quota |
For the reasons mentioned above and below this mode has been made fully obsolete by the default multi-tenant seaweedfs. Furthermore it is a security nightmare and adds unnecessary complexity. There is zero benefit implementing this.
| ##### Mode 2: Namespace-Local Artifact Servers | ||
|
|
||
| Each namespace runs its own artifact server, configured via `ObjectStoreConfig.ArtifactServer.DeploymentMode: "namespaced"`. | ||
| **Namespace-Local Mode Characteristics:** | ||
| - **PVC per namespace**: Complete storage isolation | ||
| - **Direct access**: Clients connect directly to namespace servers (no proxying) | ||
| - **Authorization**: Natural isolation (each server only accesses its namespace's PVC) | ||
| - **Best for**: Large multi-tenant deployments, strict isolation requirements | ||
| - **Advantages**: True multi-tenancy, per-namespace scaling, independent quotas | ||
| - **Deployment**: Proactive initialization when namespace has `pipelines.kubeflow.org/enabled=true` annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have multi tenancy from the default seaweedfs. The question is how can we avoid the per namespace overhead and not break zero overhead namespaces when idle. This was working well in KFP v1 already. Just copying the kfp v1 data_passing_method() to kfp v2 would be enough.
| ##### Mixed Mode Support | ||
|
|
||
| For multi-tenant deployments with varying isolation requirements, administrators can configure different deployment modes per namespace using the `kfp-launcher` ConfigMap: | ||
|
|
||
| ```yaml | ||
| apiVersion: v1 | ||
| kind: ConfigMap | ||
| metadata: | ||
| name: kfp-launcher | ||
| namespace: team-requiring-isolation | ||
| data: | ||
| defaultPipelineRoot: "kfp-artifacts://team-requiring-isolation" | ||
| artifactServer: | | ||
| deploymentMode: namespaced | ||
| ``` | ||
|
|
||
| The `artifactServer` key contains a YAML block that mirrors the global `ObjectStoreConfig.ArtifactServer` structure, allowing consistent configuration patterns across global and per-namespace settings. | ||
|
|
||
| This enables scenarios where: | ||
|
|
||
| - Most namespaces use the simpler central mode (global default) | ||
| - Specific namespaces requiring strict isolation use namespace-local mode | ||
| - Teams can be migrated between modes without cluster-wide changes | ||
|
|
||
| #### Request Routing | ||
|
|
||
| Based on the configured mode (global default or per-namespace override from `kfp-launcher` ConfigMap), artifact URIs are resolved differently: | ||
|
|
||
| **Central Mode:** | ||
|
|
||
| ```text | ||
| Client | ||
| │ | ||
| │ GET kfp-artifacts://<namespace>/... | ||
| ▼ | ||
| KFP API Server (central) | ||
| │ | ||
| │ Authorization check (SubjectAccessReview) | ||
| ▼ | ||
| Serve from /artifacts/<namespace>/... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ##### Mixed Mode Support | |
| For multi-tenant deployments with varying isolation requirements, administrators can configure different deployment modes per namespace using the `kfp-launcher` ConfigMap: | |
| ```yaml | |
| apiVersion: v1 | |
| kind: ConfigMap | |
| metadata: | |
| name: kfp-launcher | |
| namespace: team-requiring-isolation | |
| data: | |
| defaultPipelineRoot: "kfp-artifacts://team-requiring-isolation" | |
| artifactServer: | | |
| deploymentMode: namespaced | |
| ``` | |
| The `artifactServer` key contains a YAML block that mirrors the global `ObjectStoreConfig.ArtifactServer` structure, allowing consistent configuration patterns across global and per-namespace settings. | |
| This enables scenarios where: | |
| - Most namespaces use the simpler central mode (global default) | |
| - Specific namespaces requiring strict isolation use namespace-local mode | |
| - Teams can be migrated between modes without cluster-wide changes | |
| #### Request Routing | |
| Based on the configured mode (global default or per-namespace override from `kfp-launcher` ConfigMap), artifact URIs are resolved differently: | |
| **Central Mode:** | |
| ```text | |
| Client | |
| │ | |
| │ GET kfp-artifacts://<namespace>/... | |
| ▼ | |
| KFP API Server (central) | |
| │ | |
| │ Authorization check (SubjectAccessReview) | |
| ▼ | |
| Serve from /artifacts/<namespace>/... |
As explained above the central mode is dangerous and worse than the default multi-tenant seaweedfs.
| ###### Central Mode Architecture | ||
|
|
||
| ```text | ||
| ┌──────────────────────────────────────┐ | ||
| │ KFP SDK │ | ||
| └──────────────────┬───────────────────┘ | ||
| │ pipeline_root: "kfp-artifacts://..." | ||
| ▼ | ||
| ┌──────────────────────────────────────┐ | ||
| │ Pipeline Spec │ | ||
| └──────────────────┬───────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌──────────────────────────────────────┐ | ||
| │ Compiler │ | ||
| │ (generates artifact API calls) │ | ||
| └──────────────────┬───────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌──────────────────────────────────────┐ | ||
| │ Driver │ | ||
| │ (validates artifact server exists) │ | ||
| └──────────────────┬───────────────────┘ | ||
| │ | ||
| ▼ | ||
| ┌──────────────────────────────────────┐ | ||
| │ Launcher │ | ||
| │ (uploads/downloads via API) │ | ||
| └──────────────────┬───────────────────┘ | ||
| │ API calls | ||
| ▼ | ||
| ┌──────────────────────────────────────┐ | ||
| │ Central Artifact Server │ | ||
| │ (namespace: kubeflow) │ | ||
| │ │ | ||
| │ • SubjectAccessReview │ | ||
| │ • Mounts central PVC │ | ||
| │ • Serves all namespaces │ | ||
| └──────────────────┬───────────────────┘ | ||
| │ mounts | ||
| ▼ | ||
| ┌──────────────────────────────────────┐ | ||
| │ Central PVC (kfp-artifacts-central) │ | ||
| │ /artifacts/ │ | ||
| │ ├── ns1/ │ | ||
| │ ├── ns2/ │ | ||
| │ └── ns3/ │ | ||
| └──────────────────────────────────────┘ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As explained above the central mode is dangerous and worse than the default multi-tenant seaweedfs.
| ###### Central Mode Architecture | |
| ```text | |
| ┌──────────────────────────────────────┐ | |
| │ KFP SDK │ | |
| └──────────────────┬───────────────────┘ | |
| │ pipeline_root: "kfp-artifacts://..." | |
| ▼ | |
| ┌──────────────────────────────────────┐ | |
| │ Pipeline Spec │ | |
| └──────────────────┬───────────────────┘ | |
| │ | |
| ▼ | |
| ┌──────────────────────────────────────┐ | |
| │ Compiler │ | |
| │ (generates artifact API calls) │ | |
| └──────────────────┬───────────────────┘ | |
| │ | |
| ▼ | |
| ┌──────────────────────────────────────┐ | |
| │ Driver │ | |
| │ (validates artifact server exists) │ | |
| └──────────────────┬───────────────────┘ | |
| │ | |
| ▼ | |
| ┌──────────────────────────────────────┐ | |
| │ Launcher │ | |
| │ (uploads/downloads via API) │ | |
| └──────────────────┬───────────────────┘ | |
| │ API calls | |
| ▼ | |
| ┌──────────────────────────────────────┐ | |
| │ Central Artifact Server │ | |
| │ (namespace: kubeflow) │ | |
| │ │ | |
| │ • SubjectAccessReview │ | |
| │ • Mounts central PVC │ | |
| │ • Serves all namespaces │ | |
| └──────────────────┬───────────────────┘ | |
| │ mounts | |
| ▼ | |
| ┌──────────────────────────────────────┐ | |
| │ Central PVC (kfp-artifacts-central) │ | |
| │ /artifacts/ │ | |
| │ ├── ns1/ │ | |
| │ ├── ns2/ │ | |
| │ └── ns3/ │ | |
| └──────────────────────────────────────┘ |
|
/hold |
Are you saying that SeaweedFS uses one PVC per namespace (like the namespaced-local mode in the proposal)? |
No, it uses a hard multi-tenant separated S3 storage that is directly accessed by the ml-pipeline-ui by default. no proxy needed. See also https://github.com/kubeflow/manifests#architecture for my diagram
This is also in the new release and blogposts
|
|
|
||
| This KEP proposes adding filesystem-based storage as an alternative artifact storage backend for Kubeflow Pipelines v2. While KFP currently ships with S3-compatible storage by default, some deployments prefer not to depend on a separate object storage system. This proposal introduces filesystem storage as an additional option where artifact handling is integrated into KFP itself, eliminating the need for an external object storage component. | ||
|
|
||
| The filesystem backend will primarily use `PersistentVolumeClaim` (PVC) based storage in Kubernetes environments, providing namespace-isolated storage using Kubernetes native `PersistentVolumes`. However, the design is flexible enough to support other filesystem backends (e.g., local filesystem for development). Users can specify any access mode value which KFP will pass through to Kubernetes without validation (e.g., `ReadWriteMany` for parallel task execution across nodes, `ReadWriteOnce` for single node access, etc.), with RWO as the default if not specified. The actual behavior depends on what the underlying storage class supports. Existing pipelines will work without modification unless they contain hardcoded S3/object storage paths. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph implies the PVCs are mounted to the pods "which KFP will pass through to Kubernetes without validation". The following paragraph contradicts that.
| | **URI Scheme** | `s3://`, `gs://`, `minio://` | `kfp-artifacts://` | | ||
| | **Architecture** | Separate object storage service | KFP-native artifact server | | ||
| | **Required Knowledge** | S3 concepts (buckets, endpoints, regions) | Kubernetes concepts (PVCs, StorageClasses) | | ||
| | **Multi-tenancy** | Shared storage (single instance) | Per-namespace PVCs (in namespace-local mode) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per-namespace PVCs should be optional. In other words, default to the central artifact server, respecting multi-tenancy, but allow individual namespaces to use a different solution (e.g. S3 or another artifact server).
| Driver (detects "kfp-artifacts://") | ||
| │ | ||
| ▼ | ||
| Artifact Server (mounts PVC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the PVC always be mounted?
| ### Key Components | ||
|
|
||
| - **Storage**: Kubernetes `PersistentVolumeClaims` (one per namespace) | ||
| - **URI Format**: `kfp-artifacts://<namespace>/<pipeline>/<run-id>/<node-id>/<artifact-name>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be more specific on <pipeline>? Is it a name or ID? Is it a pipeline or pipeline version?
Also, what is ?
| A dedicated endpoint provides filesystem storage configuration and routing: | ||
|
|
||
| ```http | ||
| GET /apis/v2beta1/filesystem-storage/config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this. The artifact URI should contain the hostname of the artifact server that was used and it can be parsed directly to know how to route the request.
|
|
||
| **This proposal does not aim to replace existing object storage solutions.** S3-compatible storage remains fully supported and recommended for most production workloads. Instead, this KEP provides an additional option for deployments where a simpler, KFP-native artifact storage solution is preferred. | ||
|
|
||
| While KFP currently ships with S3-compatible storage by default, this still requires deploying and maintaining a separate object storage service. For some deployment scenarios, this additional component may not be desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also helps with running KFP locally, outside of Kubernetes.
|
|
||
| While KFP currently ships with S3-compatible storage by default, this still requires deploying and maintaining a separate object storage service. For some deployment scenarios, this additional component may not be desired. | ||
|
|
||
| ### Reduced External Dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this section adds a lot and could be consolidated in the Motivation section.
|
|
||
| Many enterprises and Kubeflow distributions prefer not to have additional external dependencies. With filesystem storage: | ||
|
|
||
| - No separate object storage project to productize and support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also include that this will directly leverage Kubernetes RBAC, aligned with existing permission mechanisms used by other parts of KFP. This makes onboarding and provisioning new namespaces simpler.
|
|
||
| ### Enterprise Considerations | ||
|
|
||
| Many enterprises and Kubeflow distributions prefer not to have additional external dependencies. With filesystem storage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another aspect is you'll automatically get namespace isolation of artifacts on the central artifact server through namespace aware paths and Kubernetes RBAC.
| In namespace-local mode, each namespace gets its own dedicated artifact server and PVC. This provides: | ||
|
|
||
| - **Storage isolation**: Each team's artifacts are physically separated in their own PVC | ||
| - **Independent scaling**: Teams can scale their artifact server horizontally (with RWX storage) and size their PVC based on workload requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Independent scaling**: Teams can scale their artifact server horizontally (with RWX storage) and size their PVC based on workload requirements | |
| - **Independent scaling**: Teams can scale their artifact server horizontally and size their PVC based on workload requirements |
|
|
||
| ### Per-Namespace Isolation and Scaling | ||
|
|
||
| In namespace-local mode, each namespace gets its own dedicated artifact server and PVC. This provides: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have a specific mode.
Essentially, the default behavior when using the KFP artifact server would be to use the central/default instance, but every namespace can override this configuration using the kfp-launcher ConfigMap or equivalent.
| **When to use filesystem storage:** | ||
|
|
||
| - Deployments where eliminating object storage dependency is preferred | ||
| - Environments where Kubeflow distributions or platform providers prefer not to support additional storage systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Environments where Kubeflow distributions or platform providers prefer not to support additional storage systems | |
| - Environments where Kubeflow distributions or platform providers do not offer a fully supported object store solution |
|
|
||
| - Deployments where eliminating object storage dependency is preferred | ||
| - Environments where Kubeflow distributions or platform providers prefer not to support additional storage systems | ||
| - Multi-tenant deployments requiring per-namespace storage isolation, scaling, and quotas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can all be achieved through S3.
|
|
||
| Based on the user story "As a user, I want to provision Kubeflow Pipelines with just a PVC for artifact storage so that I can quickly get started", this KEP aims to: | ||
|
|
||
| 1. **Add filesystem storage as an additional backend option** alongside S3-compatible and Google Cloud Storage, primarily using PVC but not limited to it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "but not limited to it"?
| 1. **Add filesystem storage as an additional backend option** alongside S3-compatible and Google Cloud Storage, primarily using PVC but not limited to it | ||
| 2. **Enable zero-configuration storage** for experimentation use cases - a KFP server can be installed with just a PVC for artifact storage | ||
| 3. **Provide namespace-isolated artifact storage** with proper subject access review guards in multi-user mode | ||
| 4. **Allow any Kubernetes access mode to be configured** - KFP passes through the configuration to Kubernetes (RWO default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
| 4. **Allow any Kubernetes access mode to be configured** - KFP passes through the configuration to Kubernetes (RWO default) | ||
| 5. **Support existing pipelines** that use KFP's standard artifact types (Dataset, Model, etc.) - pipelines work unchanged with the new filesystem backend | ||
| 6. **Match existing artifact persistence behavior** - artifacts persist indefinitely until explicitly deleted (no automatic cleanup) | ||
| 7. **Enable separate scaling of artifact serving** through an artifacts-only KFP instance with `--artifacts-only` flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also support deploying the artifact server as a DaemonSet (ensuring a pod is on every Kubernetes node) and set the Service to internalTrafficPolicy: Local to keep all artifact traffic local to the Kubernetes node.
|
|
||
| This KEP proposes adding a new artifact storage backend that uses filesystem storage (primarily Kubernetes `PersistentVolumeClaims`) instead of object storage. The implementation will: | ||
|
|
||
| 1. Create one PVC per namespace for artifact storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not require this as the central artifact server should be namespace aware (namespace in the path and subject access review based on the namespace in the path).
| This KEP proposes adding a new artifact storage backend that uses filesystem storage (primarily Kubernetes `PersistentVolumeClaims`) instead of object storage. The implementation will: | ||
|
|
||
| 1. Create one PVC per namespace for artifact storage | ||
| 2. Use configurable access mode with sensible defaults (RWO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean?
|
|
||
| 1. Create one PVC per namespace for artifact storage | ||
| 2. Use configurable access mode with sensible defaults (RWO) | ||
| 3. Organize artifacts in a filesystem hierarchy within the PVC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 3. Organize artifacts in a filesystem hierarchy within the PVC | |
| 3. Organize artifacts in a filesystem hierarchy within the PVC that is namespace aware |
| 1. Create one PVC per namespace for artifact storage | ||
| 2. Use configurable access mode with sensible defaults (RWO) | ||
| 3. Organize artifacts in a filesystem hierarchy within the PVC | ||
| 4. Provide transparent access through the existing KFP artifact APIs with new `kfp-artifacts://` URI scheme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have "KFP artifact APIs"? Perhaps this is referencing to @HumairAK 's MLMD removal PR.
| 2. Use configurable access mode with sensible defaults (RWO) | ||
| 3. Organize artifacts in a filesystem hierarchy within the PVC | ||
| 4. Provide transparent access through the existing KFP artifact APIs with new `kfp-artifacts://` URI scheme | ||
| 5. Maintain compatibility with existing pipeline definitions that don't have hardcoded storage paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this ever the case?
|
|
||
| #### Story 3: Operator Configuring Storage Class and Size | ||
|
|
||
| As an operator, I want to configure KFP to use a specific StorageClass and PVC size instead of defaults, so that I can match storage performance and capacity to my workload requirements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implied whenever you use a PVC, so if the user story is about making artifact storage configurable per namespace (e.g. use a dedicated artifact server or use S3 in this namespace), then I think this can be removed.
|
|
||
| #### Story 4: User Migrating from S3 to Filesystem Storage | ||
|
|
||
| As a user with existing pipelines containing components that call `boto3.upload_file()` directly, I want KFP system artifacts to use `kfp-artifacts://` with PVC storage while my custom components continue accessing S3, so that I can migrate incrementally without rewriting all components at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this user story is relevant. That's just a user's custom Python code.
|
|
||
| #### Story 5: Operator Deploying Multi-Tenant KFP with Namespace Isolation | ||
|
|
||
| As an operator, I want to deploy KFP in namespace-local mode where each namespace annotated with `pipelines.kubeflow.org/enabled=true` gets its own artifact server pod and dedicated PVC, so that Team A's artifacts in namespace `team-a` are physically isolated from Team B's artifacts in namespace `team-b`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said, we don't need the concept of modes. We just need a central artifact server and allow each namespace to override to use a different solution (e.g. dedicated artifact server or S3) through the kfp-launcher ConfigMap.
An admin can still opt in to this by configuring each provisioned namespace, but this largely defeats the spirit of the KEP of making administration easier (and potentially cheaper if enterprise licensing is required) with less components involved.
| 6. Support separate scaling of artifact serving through artifacts-only instances | ||
| 7. Update the UI to seamlessly handle artifact downloads from filesystem storage | ||
|
|
||
| ### User Stories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need 2 user stories:
- As an admin with a requirement to have storage on-premise, I want a simple artifact storage solution for KFP artifacts without having to maintain a separate service for artifacts due to administrative overhead or enterprise licensing costs.
- As an admin using the KFP artifact server, I want the ability to override a namespace's artifact configuration to use alternative storage such as S3 or a dedicated artifact server.

Resolves: #12513