Skip to content

Conversation

@hbelmiro
Copy link
Contributor

@hbelmiro hbelmiro commented Dec 3, 2025

Resolves: #12513

@google-oss-prow
Copy link

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@hbelmiro hbelmiro marked this pull request as ready for review December 3, 2025 19:53
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zazulam for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Dec 4, 2025

Here are some thoughts, but I still need to read the full document.

That is very far away from V1 artifact passing via PVC and would violate zero overhead namespaces. Just imagine a cluster with 1000 namespaces for scalability. Then you add 1000 permanent pods, so massive overhead. In V1 we just accepted that the artifacts will not be in the UI instead. PVC for all namespace also sounds scary. Why should we offer such a security nightmare in the first place? That would break the namespace isolation contract.

Maybe I am missing something and I very much admire that you spend the time for the KEP, but there might be some fundamental problems as mentioned above regarding security and scalability that we did not have in the KFP V1 Implementation.

But as I said some of my statements could be wrong and that is just my initial assessment without checking it thoroughly.

What I also do not understand is that by default we ship seaweedfs and you do not have to configure object storage yourself. So where is this making it easier for beginners, i think it is actually more difficult?
I see the main benefit in enterprise environments if you want to have purely per namespace storage/PVCs for faster performance than S3 and per namespace storage quotas and stticter isolation. For everyone else i would recommend to stick to seaweedfs.

@hbelmiro
Copy link
Contributor Author

hbelmiro commented Dec 4, 2025

@juliusvonkohout

Here are some thoughts, but I still need to read the full document.

That is very far away from V1 artifact passing via PVC and would violate zero overhead namespaces. Just imagine a cluster with 1000 namespaces for scalability. Then you add 1000 permanent pods, so massive overhead.

There will be two modes: central and namespace-local.

Users should choose which mode fits to their scenarios or not even use the filesystem storage if it doesn't make sense to them. The idea is not to replace the existing storage solutions, but add a new alternative.

In V1 we just accepted that the artifacts will not be in the UI instead. PVC for all namespace also sounds scary. Why should we offer such a security nightmare in the first place? That would break the namespace isolation contract.

Maybe I am missing something and I very much admire that you spend the time for the KEP, but there might be some fundamental problems as mentioned above regarding security and scalability that we did not have in the KFP V1 Implementation.

But as I said some of my statements could be wrong and that is just my initial assessment without checking it thoroughly.

I'm not sure I'm following. Please correct me if I'm wrong, but the existing s3 solutions already break the namespace isolation contract once we have one instance for all namespaces. With the namespace-local mode proposed here we can achieve complete namespace isolation, which we don't have today.

What I also do not understand is that by default we ship seaweedfs and you do not have to configure object storage yourself. So where is this making it easier for beginners, i think it is actually more difficult? I see the main benefit in enterprise environments if you want to have purely per namespace storage/PVCs for faster performance than S3 and per namespace storage quotas and stticter isolation. For everyone else i would recommend to stick to seaweedfs.

Thanks for clarifying regarding SeaweedFS. I updated the motivation with that in mind. You can see the specific commit here.

Maybe taking a look on Goals and Non-Goals may clarify some things before going deep into the proposal.

@juliusvonkohout
Copy link
Member

I'm not sure I'm following. Please correct me if I'm wrong, but the existing s3 solutions already break the namespace isolation contract once we have one instance for all namespaces. With the namespace-local mode proposed here we can achieve complete namespace isolation, which we don't have today.

That is not the case anymore. Seaweedfs is now multi-tenant with ACLS and credentials per namespace as of KFP release 2.15. We even have tests that verify the namespace isolation of seaweedfs. Therefore i doubt that the central mode is needed at all. It just adds complexity and decreases security.

Comment on lines +205 to +214
#### Story 1: User Running Pipelines on Local Kubernetes

As a user with KFP on kind/minikube/k3s, I want my pipeline artifacts to automatically use the local cluster's default `StorageClass` via the `kfp-artifacts://` scheme, so that I can develop pipelines offline without any storage configuration.

**Acceptance Criteria:**

- KFP works out-of-the-box with filesystem storage on local clusters
- Artifacts are stored using the `kfp-artifacts://` URI scheme
- No S3/GCS credentials required
- Artifact viewing in UI works seamlessly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 1: User Running Pipelines on Local Kubernetes
As a user with KFP on kind/minikube/k3s, I want my pipeline artifacts to automatically use the local cluster's default `StorageClass` via the `kfp-artifacts://` scheme, so that I can develop pipelines offline without any storage configuration.
**Acceptance Criteria:**
- KFP works out-of-the-box with filesystem storage on local clusters
- Artifacts are stored using the `kfp-artifacts://` URI scheme
- No S3/GCS credentials required
- Artifact viewing in UI works seamlessly

That is already covered by the default seaweedfs with zero effort. No storage configuration needed. seaweedfs just uses the default StorageClass. It already satisfies

  • KFP works out-of-the-box with filesystem storage on local clusters
  • No S3/GCS credentials required
  • Artifact viewing in UI works seamlessly

So i recommend to remove this story

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This user story is already covered. The KEP just proposes a new way to do this.

Comment on lines +216 to +225
#### Story 2: Operator Deploying KFP Without External Object Storage

As an operator for a Kubeflow distribution, I want to deploy KFP with filesystem storage so that I don't need to productize and support a separate object storage system.

**Acceptance Criteria:**

- Single configuration option to enable filesystem storage
- Artifact handling is part of KFP (no separate object storage component)
- Storage automatically provisioned via PVCs
- Backup/restore follows standard Kubernetes PVC procedures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 2: Operator Deploying KFP Without External Object Storage
As an operator for a Kubeflow distribution, I want to deploy KFP with filesystem storage so that I don't need to productize and support a separate object storage system.
**Acceptance Criteria:**
- Single configuration option to enable filesystem storage
- Artifact handling is part of KFP (no separate object storage component)
- Storage automatically provisioned via PVCs
- Backup/restore follows standard Kubernetes PVC procedures

That is already covered by the default seaweedfs with zero effort. No storage configuration needed. seaweedfs just uses the default StorageClass. It already satisfies

  • Artifact handling is part of KFP (no separate object storage component)
  • Storage automatically provisioned via PVCs
  • Backup/restore follows standard Kubernetes PVC procedures

So i recommend to remove this story

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this user story still applies as it's about the admin not wanting to maintain an object storage service, which is valid for use cases where the organization wants on-premise storage (no AWS access) but doesn't have an enterprise license/support for SeaweedFS.

Comment on lines +227 to +237
#### Story 3: Operator Configuring Storage Class and Size

As an operator, I want to configure KFP to use a specific StorageClass and PVC size instead of defaults, so that I can match storage performance and capacity to my workload requirements.

**Acceptance Criteria:**

- Can specify `StorageClass` in KFP configuration
- Can set PVC size limits (global configuration)
- Storage quotas enforced via Kubernetes `ResourceQuotas`
- Clear error messages when storage limits are reached
- Can choose between RWO and RWX access modes based on needs
Copy link
Member

@juliusvonkohout juliusvonkohout Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 3: Operator Configuring Storage Class and Size
As an operator, I want to configure KFP to use a specific StorageClass and PVC size instead of defaults, so that I can match storage performance and capacity to my workload requirements.
**Acceptance Criteria:**
- Can specify `StorageClass` in KFP configuration
- Can set PVC size limits (global configuration)
- Storage quotas enforced via Kubernetes `ResourceQuotas`
- Clear error messages when storage limits are reached
- Can choose between RWO and RWX access modes based on needs

That is already covered by the default seaweedfs with zero effort. You can already decide which storageclass is used for Seaweedfs. It already satisfies

  • Can specify StorageClass in KFP configuration
  • Can set PVC size limits (global configuration)
  • Clear error messages when storage limits are reached
  • Can choose between RWO and RWX access modes based on needs

So i recommend to remove this story

The only thing remaining is - Storage quotas enforced via Kubernetes ResourceQuotas. We can set it per bucket already, but i am not sure about per folder. So it is partially solved and fully solved if you create on bucket per namespace. This item can be moved to another story.

Comment on lines +250 to +260
#### Story 5: Operator Deploying Multi-Tenant KFP with Namespace Isolation

As an operator, I want to deploy KFP in namespace-local mode where each namespace annotated with `pipelines.kubeflow.org/enabled=true` gets its own artifact server pod and dedicated PVC, so that Team A's artifacts in namespace `team-a` are physically isolated from Team B's artifacts in namespace `team-b`.

**Acceptance Criteria:**

- Each namespace with `pipelines.kubeflow.org/enabled=true` annotation gets its own artifact server deployment
- Each namespace gets its own dedicated PVC (no shared storage)
- Artifact server in `team-a` namespace cannot access PVC in `team-b` namespace
- Users can only access artifacts in namespaces they have RBAC permissions for (via `SubjectAccessReview`)
- Physical isolation verified: deleting `team-a` namespace doesn't affect `team-b`'s artifacts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 5: Operator Deploying Multi-Tenant KFP with Namespace Isolation
As an operator, I want to deploy KFP in namespace-local mode where each namespace annotated with `pipelines.kubeflow.org/enabled=true` gets its own artifact server pod and dedicated PVC, so that Team A's artifacts in namespace `team-a` are physically isolated from Team B's artifacts in namespace `team-b`.
**Acceptance Criteria:**
- Each namespace with `pipelines.kubeflow.org/enabled=true` annotation gets its own artifact server deployment
- Each namespace gets its own dedicated PVC (no shared storage)
- Artifact server in `team-a` namespace cannot access PVC in `team-b` namespace
- Users can only access artifacts in namespaces they have RBAC permissions for (via `SubjectAccessReview`)
- Physical isolation verified: deleting `team-a` namespace doesn't affect `team-b`'s artifacts

This storay has been made completely obsolete by the multi-tenant default seaweedfs. It would just be more complicated and break zero overhead namespaces. Its all done already in a secure manner and without massive overhead from additinal artifact servers per namespace. Just imagine 1000 namespaces and 1000 extra pods when idle.

Comment on lines +262 to +272
#### Story 6: Operator Preferring KFP-Native Storage

As an operator in a regulated environment (e.g., healthcare, finance), I want to deploy KFP with filesystem storage using an encrypted `StorageClass` (e.g., `encrypted-gp3`), so that artifact handling stays within the KFP codebase and I don't need to include a separate object storage system in my security audits.

**Acceptance Criteria:**

- All artifacts stored on PVCs within the cluster
- KFP configuration uses `Filesystem.Type: "pvc"` with encrypted `StorageClass`
- `SubjectAccessReview` validates all artifact access requests
- Encryption at rest provided by the configured `StorageClass` (e.g., `encrypted-gp3`)
- No separate object storage component to audit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 6: Operator Preferring KFP-Native Storage
As an operator in a regulated environment (e.g., healthcare, finance), I want to deploy KFP with filesystem storage using an encrypted `StorageClass` (e.g., `encrypted-gp3`), so that artifact handling stays within the KFP codebase and I don't need to include a separate object storage system in my security audits.
**Acceptance Criteria:**
- All artifacts stored on PVCs within the cluster
- KFP configuration uses `Filesystem.Type: "pvc"` with encrypted `StorageClass`
- `SubjectAccessReview` validates all artifact access requests
- Encryption at rest provided by the configured `StorageClass` (e.g., `encrypted-gp3`)
- No separate object storage component to audit

This story is also obsolete because you can just make the seaweedfs PVC encrypted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think we can remove this.

Comment on lines +274 to +284
#### Story 7: Operator Running KFP on Storage-Constrained Infrastructure

As an operator with limited storage budget (only 1TB total available), I want to deploy KFP in central mode with a single 500GB PVC shared across 10 team namespaces, so that all teams can run pipelines without each needing their own 100GB PVC (which would require 1TB total).

**Acceptance Criteria:**

- Central mode configured with: `DeploymentMode: "central"`, `Size: "500Gi"`
- Single PVC created in `kubeflow` namespace mounted by one artifact server
- All 10 teams' artifacts stored in `/artifacts/<namespace>/` directories on same PVC
- Teams can run pipelines concurrently without storage allocation failures
- No per-namespace storage limits (trade-off of central mode - shared PVC means no per-team quotas)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 7: Operator Running KFP on Storage-Constrained Infrastructure
As an operator with limited storage budget (only 1TB total available), I want to deploy KFP in central mode with a single 500GB PVC shared across 10 team namespaces, so that all teams can run pipelines without each needing their own 100GB PVC (which would require 1TB total).
**Acceptance Criteria:**
- Central mode configured with: `DeploymentMode: "central"`, `Size: "500Gi"`
- Single PVC created in `kubeflow` namespace mounted by one artifact server
- All 10 teams' artifacts stored in `/artifacts/<namespace>/` directories on same PVC
- Teams can run pipelines concurrently without storage allocation failures
- No per-namespace storage limits (trade-off of central mode - shared PVC means no per-team quotas)

That is also just fully covered by the default multi-tenant seaweedfs. Just one single pvc in the kubeflow namespace.

Comment on lines +286 to +296
#### Story 8: Operator Scaling High-Throughput Model Training Platform

As an operator supporting 100+ concurrent pipeline runs with multi-GB model checkpoints, I want to deploy KFP in central mode with RWX storage (e.g., NFS/CephFS) and multiple artifact server replicas behind a load balancer, so that artifact upload/download operations can scale horizontally without bottlenecks.

**Acceptance Criteria:**

- Can deploy artifact servers in both central and namespace-local modes
- Artifact servers stream large files without loading into memory
- Can use high-performance `StorageClasses` (e.g., SSD-backed)
- Horizontal scaling possible in central mode
- Direct pod-to-pod communication in namespace-local mode reduces latency
Copy link
Member

@juliusvonkohout juliusvonkohout Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 8: Operator Scaling High-Throughput Model Training Platform
As an operator supporting 100+ concurrent pipeline runs with multi-GB model checkpoints, I want to deploy KFP in central mode with RWX storage (e.g., NFS/CephFS) and multiple artifact server replicas behind a load balancer, so that artifact upload/download operations can scale horizontally without bottlenecks.
**Acceptance Criteria:**
- Can deploy artifact servers in both central and namespace-local modes
- Artifact servers stream large files without loading into memory
- Can use high-performance `StorageClasses` (e.g., SSD-backed)
- Horizontal scaling possible in central mode
- Direct pod-to-pod communication in namespace-local mode reduces latency

That is also obsolete by merging #12391, so a distributed seaweedfs.

Comment on lines +298 to +309
#### Story 9: Operator with Mixed Isolation Requirements

As an operator, I want to deploy KFP in central mode by default, but configure specific namespaces (e.g., `team-finance`) to use namespace-local mode for stricter isolation, so that most teams share the simple central server while sensitive teams get dedicated resources.

**Acceptance Criteria:**

- Global deployment mode set to `central` (default)
- Specific namespaces can override to `namespaced` via their `kfp-launcher` ConfigMap
- Teams using central mode share the central artifact server
- Teams with namespace-local override get their own artifact server and PVC
- UI correctly routes artifact requests based on each namespace's deployment mode
- No cluster-wide restart needed to change a namespace's mode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Story 9: Operator with Mixed Isolation Requirements
As an operator, I want to deploy KFP in central mode by default, but configure specific namespaces (e.g., `team-finance`) to use namespace-local mode for stricter isolation, so that most teams share the simple central server while sensitive teams get dedicated resources.
**Acceptance Criteria:**
- Global deployment mode set to `central` (default)
- Specific namespaces can override to `namespaced` via their `kfp-launcher` ConfigMap
- Teams using central mode share the central artifact server
- Teams with namespace-local override get their own artifact server and PVC
- UI correctly routes artifact requests based on each namespace's deployment mode
- No cluster-wide restart needed to change a namespace's mode
#### Story 9: Operator with enterprise isolation requirements and fast local storage for giant models
I want to have as in kfp v1 data_passing_method() a way to have a per pipeline or per namespace PVC that is very fast local storage such that i do not need to up and download large models for each step from and to S3.

That is a security nightmare i would not want to support or recommend to any enterprise. It is just worse than the default multi-tenant seaweedfs. I added a typical usecase i have seen so far.

Comment on lines +409 to +419
##### Mode 1: Central Artifact Server (Default)

A single artifact server in the main KFP namespace serves all namespaces, configured via `ObjectStoreConfig.ArtifactServer.DeploymentMode: "central"`.
**Central Mode Characteristics:**
- **Single PVC** with directory structure: `/artifacts/<namespace>/<pipeline>/<run-id>/<node-id>/<artifact-name>`
- **Authorization**: Uses `SubjectAccessReview` to verify namespace access
- **Best for**: Simple deployments, single-user setups, small teams
- **Advantages**: Simple setup, single storage location, easy backup
- **Limitations**: All namespaces share same PVC and storage quota
Copy link
Member

@juliusvonkohout juliusvonkohout Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Mode 1: Central Artifact Server (Default)
A single artifact server in the main KFP namespace serves all namespaces, configured via `ObjectStoreConfig.ArtifactServer.DeploymentMode: "central"`.
**Central Mode Characteristics:**
- **Single PVC** with directory structure: `/artifacts/<namespace>/<pipeline>/<run-id>/<node-id>/<artifact-name>`
- **Authorization**: Uses `SubjectAccessReview` to verify namespace access
- **Best for**: Simple deployments, single-user setups, small teams
- **Advantages**: Simple setup, single storage location, easy backup
- **Limitations**: All namespaces share same PVC and storage quota

For the reasons mentioned above and below this mode has been made fully obsolete by the default multi-tenant seaweedfs. Furthermore it is a security nightmare and adds unnecessary complexity. There is zero benefit implementing this.

Comment on lines +421 to +432
##### Mode 2: Namespace-Local Artifact Servers

Each namespace runs its own artifact server, configured via `ObjectStoreConfig.ArtifactServer.DeploymentMode: "namespaced"`.
**Namespace-Local Mode Characteristics:**
- **PVC per namespace**: Complete storage isolation
- **Direct access**: Clients connect directly to namespace servers (no proxying)
- **Authorization**: Natural isolation (each server only accesses its namespace's PVC)
- **Best for**: Large multi-tenant deployments, strict isolation requirements
- **Advantages**: True multi-tenancy, per-namespace scaling, independent quotas
- **Deployment**: Proactive initialization when namespace has `pipelines.kubeflow.org/enabled=true` annotation
Copy link
Member

@juliusvonkohout juliusvonkohout Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have multi tenancy from the default seaweedfs. The question is how can we avoid the per namespace overhead and not break zero overhead namespaces when idle. This was working well in KFP v1 already. Just copying the kfp v1 data_passing_method() to kfp v2 would be enough.

Comment on lines +434 to +473
##### Mixed Mode Support

For multi-tenant deployments with varying isolation requirements, administrators can configure different deployment modes per namespace using the `kfp-launcher` ConfigMap:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kfp-launcher
namespace: team-requiring-isolation
data:
defaultPipelineRoot: "kfp-artifacts://team-requiring-isolation"
artifactServer: |
deploymentMode: namespaced
```

The `artifactServer` key contains a YAML block that mirrors the global `ObjectStoreConfig.ArtifactServer` structure, allowing consistent configuration patterns across global and per-namespace settings.

This enables scenarios where:

- Most namespaces use the simpler central mode (global default)
- Specific namespaces requiring strict isolation use namespace-local mode
- Teams can be migrated between modes without cluster-wide changes

#### Request Routing

Based on the configured mode (global default or per-namespace override from `kfp-launcher` ConfigMap), artifact URIs are resolved differently:

**Central Mode:**

```text
Client
│ GET kfp-artifacts://<namespace>/...
KFP API Server (central)
│ Authorization check (SubjectAccessReview)
Serve from /artifacts/<namespace>/...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
##### Mixed Mode Support
For multi-tenant deployments with varying isolation requirements, administrators can configure different deployment modes per namespace using the `kfp-launcher` ConfigMap:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kfp-launcher
namespace: team-requiring-isolation
data:
defaultPipelineRoot: "kfp-artifacts://team-requiring-isolation"
artifactServer: |
deploymentMode: namespaced
```
The `artifactServer` key contains a YAML block that mirrors the global `ObjectStoreConfig.ArtifactServer` structure, allowing consistent configuration patterns across global and per-namespace settings.
This enables scenarios where:
- Most namespaces use the simpler central mode (global default)
- Specific namespaces requiring strict isolation use namespace-local mode
- Teams can be migrated between modes without cluster-wide changes
#### Request Routing
Based on the configured mode (global default or per-namespace override from `kfp-launcher` ConfigMap), artifact URIs are resolved differently:
**Central Mode:**
```text
Client
│ GET kfp-artifacts://<namespace>/...
KFP API Server (central)
│ Authorization check (SubjectAccessReview)
Serve from /artifacts/<namespace>/...

As explained above the central mode is dangerous and worse than the default multi-tenant seaweedfs.

Comment on lines +499 to +546
###### Central Mode Architecture

```text
┌──────────────────────────────────────┐
│ KFP SDK │
└──────────────────┬───────────────────┘
│ pipeline_root: "kfp-artifacts://..."
┌──────────────────────────────────────┐
│ Pipeline Spec │
└──────────────────┬───────────────────┘
┌──────────────────────────────────────┐
│ Compiler │
│ (generates artifact API calls) │
└──────────────────┬───────────────────┘
┌──────────────────────────────────────┐
│ Driver │
│ (validates artifact server exists) │
└──────────────────┬───────────────────┘
┌──────────────────────────────────────┐
│ Launcher │
│ (uploads/downloads via API) │
└──────────────────┬───────────────────┘
│ API calls
┌──────────────────────────────────────┐
│ Central Artifact Server │
│ (namespace: kubeflow) │
│ │
│ • SubjectAccessReview │
│ • Mounts central PVC │
│ • Serves all namespaces │
└──────────────────┬───────────────────┘
│ mounts
┌──────────────────────────────────────┐
│ Central PVC (kfp-artifacts-central) │
│ /artifacts/ │
│ ├── ns1/ │
│ ├── ns2/ │
│ └── ns3/ │
└──────────────────────────────────────┘
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained above the central mode is dangerous and worse than the default multi-tenant seaweedfs.

Suggested change
###### Central Mode Architecture
```text
┌──────────────────────────────────────┐
│ KFP SDK │
└──────────────────┬───────────────────┘
│ pipeline_root: "kfp-artifacts://..."
┌──────────────────────────────────────┐
│ Pipeline Spec │
└──────────────────┬───────────────────┘
┌──────────────────────────────────────┐
│ Compiler │
│ (generates artifact API calls) │
└──────────────────┬───────────────────┘
┌──────────────────────────────────────┐
│ Driver │
│ (validates artifact server exists) │
└──────────────────┬───────────────────┘
┌──────────────────────────────────────┐
│ Launcher │
│ (uploads/downloads via API) │
└──────────────────┬───────────────────┘
│ API calls
┌──────────────────────────────────────┐
│ Central Artifact Server │
│ (namespace: kubeflow) │
│ │
│ • SubjectAccessReview │
│ • Mounts central PVC │
│ • Serves all namespaces │
└──────────────────┬───────────────────┘
│ mounts
┌──────────────────────────────────────┐
│ Central PVC (kfp-artifacts-central) │
│ /artifacts/ │
│ ├── ns1/ │
│ ├── ns2/ │
│ └── ns3/ │
└──────────────────────────────────────┘

@juliusvonkohout
Copy link
Member

/hold

@hbelmiro
Copy link
Contributor Author

hbelmiro commented Dec 8, 2025

@juliusvonkohout

Seaweedfs is now multi-tenant with ACLS and credentials per namespace as of KFP release 2.15. We even have tests that verify the namespace isolation of seaweedfs.

Just imagine 1000 namespaces and 1000 extra pods when idle.

Are you saying that SeaweedFS uses one PVC per namespace (like the namespaced-local mode in the proposal)?
I'm curious how it workarounds the physical limitation between the api server and the PVCs on different namespaces without an additional pod between them.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Dec 10, 2025

@juliusvonkohout

Seaweedfs is now multi-tenant with ACLS and credentials per namespace as of KFP release 2.15. We even have tests that verify the namespace isolation of seaweedfs.

Just imagine 1000 namespaces and 1000 extra pods when idle.

Are you saying that SeaweedFS uses one PVC per namespace (like the namespaced-local mode in the proposal)? I'm curious how it workarounds the physical limitation between the api server and the PVCs on different namespaces without an additional pod between them.

No, it uses a hard multi-tenant separated S3 storage that is directly accessed by the ml-pipeline-ui by default. no proxy needed. See also https://github.com/kubeflow/manifests#architecture for my diagram

grafik

This is also in the new release and blogposts

  1. https://github.com/kubeflow/manifests/releases/tag/v1.11.0-rc.1 check the highlights
  2. https://blog.kubeflow.org/gsoc/community/kubeflow/2025/09/06/kubeflow-and-gsoc2025.html#project-1-kubeflow-platform-enhancements
  3. https://medium.com/@hpotpose26/kubeflow-pipelines-embraces-seaweedfs-9a7e022d5571 here are technical details, each namespace has its own S3 user and ml-pipeline-ui has administrative access. Read access would also be enough.


This KEP proposes adding filesystem-based storage as an alternative artifact storage backend for Kubeflow Pipelines v2. While KFP currently ships with S3-compatible storage by default, some deployments prefer not to depend on a separate object storage system. This proposal introduces filesystem storage as an additional option where artifact handling is integrated into KFP itself, eliminating the need for an external object storage component.

The filesystem backend will primarily use `PersistentVolumeClaim` (PVC) based storage in Kubernetes environments, providing namespace-isolated storage using Kubernetes native `PersistentVolumes`. However, the design is flexible enough to support other filesystem backends (e.g., local filesystem for development). Users can specify any access mode value which KFP will pass through to Kubernetes without validation (e.g., `ReadWriteMany` for parallel task execution across nodes, `ReadWriteOnce` for single node access, etc.), with RWO as the default if not specified. The actual behavior depends on what the underlying storage class supports. Existing pipelines will work without modification unless they contain hardcoded S3/object storage paths.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph implies the PVCs are mounted to the pods "which KFP will pass through to Kubernetes without validation". The following paragraph contradicts that.

| **URI Scheme** | `s3://`, `gs://`, `minio://` | `kfp-artifacts://` |
| **Architecture** | Separate object storage service | KFP-native artifact server |
| **Required Knowledge** | S3 concepts (buckets, endpoints, regions) | Kubernetes concepts (PVCs, StorageClasses) |
| **Multi-tenancy** | Shared storage (single instance) | Per-namespace PVCs (in namespace-local mode) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-namespace PVCs should be optional. In other words, default to the central artifact server, respecting multi-tenancy, but allow individual namespaces to use a different solution (e.g. S3 or another artifact server).

Driver (detects "kfp-artifacts://")
Artifact Server (mounts PVC)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the PVC always be mounted?

### Key Components

- **Storage**: Kubernetes `PersistentVolumeClaims` (one per namespace)
- **URI Format**: `kfp-artifacts://<namespace>/<pipeline>/<run-id>/<node-id>/<artifact-name>`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more specific on <pipeline>? Is it a name or ID? Is it a pipeline or pipeline version?

Also, what is ?

A dedicated endpoint provides filesystem storage configuration and routing:

```http
GET /apis/v2beta1/filesystem-storage/config
Copy link
Collaborator

@mprahl mprahl Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this. The artifact URI should contain the hostname of the artifact server that was used and it can be parsed directly to know how to route the request.


**This proposal does not aim to replace existing object storage solutions.** S3-compatible storage remains fully supported and recommended for most production workloads. Instead, this KEP provides an additional option for deployments where a simpler, KFP-native artifact storage solution is preferred.

While KFP currently ships with S3-compatible storage by default, this still requires deploying and maintaining a separate object storage service. For some deployment scenarios, this additional component may not be desired.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also helps with running KFP locally, outside of Kubernetes.


While KFP currently ships with S3-compatible storage by default, this still requires deploying and maintaining a separate object storage service. For some deployment scenarios, this additional component may not be desired.

### Reduced External Dependencies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this section adds a lot and could be consolidated in the Motivation section.


Many enterprises and Kubeflow distributions prefer not to have additional external dependencies. With filesystem storage:

- No separate object storage project to productize and support
Copy link
Collaborator

@mprahl mprahl Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also include that this will directly leverage Kubernetes RBAC, aligned with existing permission mechanisms used by other parts of KFP. This makes onboarding and provisioning new namespaces simpler.


### Enterprise Considerations

Many enterprises and Kubeflow distributions prefer not to have additional external dependencies. With filesystem storage:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect is you'll automatically get namespace isolation of artifacts on the central artifact server through namespace aware paths and Kubernetes RBAC.

In namespace-local mode, each namespace gets its own dedicated artifact server and PVC. This provides:

- **Storage isolation**: Each team's artifacts are physically separated in their own PVC
- **Independent scaling**: Teams can scale their artifact server horizontally (with RWX storage) and size their PVC based on workload requirements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Independent scaling**: Teams can scale their artifact server horizontally (with RWX storage) and size their PVC based on workload requirements
- **Independent scaling**: Teams can scale their artifact server horizontally and size their PVC based on workload requirements


### Per-Namespace Isolation and Scaling

In namespace-local mode, each namespace gets its own dedicated artifact server and PVC. This provides:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have a specific mode.

Essentially, the default behavior when using the KFP artifact server would be to use the central/default instance, but every namespace can override this configuration using the kfp-launcher ConfigMap or equivalent.

**When to use filesystem storage:**

- Deployments where eliminating object storage dependency is preferred
- Environments where Kubeflow distributions or platform providers prefer not to support additional storage systems
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Environments where Kubeflow distributions or platform providers prefer not to support additional storage systems
- Environments where Kubeflow distributions or platform providers do not offer a fully supported object store solution


- Deployments where eliminating object storage dependency is preferred
- Environments where Kubeflow distributions or platform providers prefer not to support additional storage systems
- Multi-tenant deployments requiring per-namespace storage isolation, scaling, and quotas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can all be achieved through S3.


Based on the user story "As a user, I want to provision Kubeflow Pipelines with just a PVC for artifact storage so that I can quickly get started", this KEP aims to:

1. **Add filesystem storage as an additional backend option** alongside S3-compatible and Google Cloud Storage, primarily using PVC but not limited to it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "but not limited to it"?

1. **Add filesystem storage as an additional backend option** alongside S3-compatible and Google Cloud Storage, primarily using PVC but not limited to it
2. **Enable zero-configuration storage** for experimentation use cases - a KFP server can be installed with just a PVC for artifact storage
3. **Provide namespace-isolated artifact storage** with proper subject access review guards in multi-user mode
4. **Allow any Kubernetes access mode to be configured** - KFP passes through the configuration to Kubernetes (RWO default)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

4. **Allow any Kubernetes access mode to be configured** - KFP passes through the configuration to Kubernetes (RWO default)
5. **Support existing pipelines** that use KFP's standard artifact types (Dataset, Model, etc.) - pipelines work unchanged with the new filesystem backend
6. **Match existing artifact persistence behavior** - artifacts persist indefinitely until explicitly deleted (no automatic cleanup)
7. **Enable separate scaling of artifact serving** through an artifacts-only KFP instance with `--artifacts-only` flag
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also support deploying the artifact server as a DaemonSet (ensuring a pod is on every Kubernetes node) and set the Service to internalTrafficPolicy: Local to keep all artifact traffic local to the Kubernetes node.


This KEP proposes adding a new artifact storage backend that uses filesystem storage (primarily Kubernetes `PersistentVolumeClaims`) instead of object storage. The implementation will:

1. Create one PVC per namespace for artifact storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not require this as the central artifact server should be namespace aware (namespace in the path and subject access review based on the namespace in the path).

This KEP proposes adding a new artifact storage backend that uses filesystem storage (primarily Kubernetes `PersistentVolumeClaims`) instead of object storage. The implementation will:

1. Create one PVC per namespace for artifact storage
2. Use configurable access mode with sensible defaults (RWO)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?


1. Create one PVC per namespace for artifact storage
2. Use configurable access mode with sensible defaults (RWO)
3. Organize artifacts in a filesystem hierarchy within the PVC
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Organize artifacts in a filesystem hierarchy within the PVC
3. Organize artifacts in a filesystem hierarchy within the PVC that is namespace aware

1. Create one PVC per namespace for artifact storage
2. Use configurable access mode with sensible defaults (RWO)
3. Organize artifacts in a filesystem hierarchy within the PVC
4. Provide transparent access through the existing KFP artifact APIs with new `kfp-artifacts://` URI scheme
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have "KFP artifact APIs"? Perhaps this is referencing to @HumairAK 's MLMD removal PR.

2. Use configurable access mode with sensible defaults (RWO)
3. Organize artifacts in a filesystem hierarchy within the PVC
4. Provide transparent access through the existing KFP artifact APIs with new `kfp-artifacts://` URI scheme
5. Maintain compatibility with existing pipeline definitions that don't have hardcoded storage paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this ever the case?


#### Story 3: Operator Configuring Storage Class and Size

As an operator, I want to configure KFP to use a specific StorageClass and PVC size instead of defaults, so that I can match storage performance and capacity to my workload requirements.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implied whenever you use a PVC, so if the user story is about making artifact storage configurable per namespace (e.g. use a dedicated artifact server or use S3 in this namespace), then I think this can be removed.


#### Story 4: User Migrating from S3 to Filesystem Storage

As a user with existing pipelines containing components that call `boto3.upload_file()` directly, I want KFP system artifacts to use `kfp-artifacts://` with PVC storage while my custom components continue accessing S3, so that I can migrate incrementally without rewriting all components at once.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this user story is relevant. That's just a user's custom Python code.


#### Story 5: Operator Deploying Multi-Tenant KFP with Namespace Isolation

As an operator, I want to deploy KFP in namespace-local mode where each namespace annotated with `pipelines.kubeflow.org/enabled=true` gets its own artifact server pod and dedicated PVC, so that Team A's artifacts in namespace `team-a` are physically isolated from Team B's artifacts in namespace `team-b`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said, we don't need the concept of modes. We just need a central artifact server and allow each namespace to override to use a different solution (e.g. dedicated artifact server or S3) through the kfp-launcher ConfigMap.

An admin can still opt in to this by configuring each provisioned namespace, but this largely defeats the spirit of the KEP of making administration easier (and potentially cheaper if enterprise licensing is required) with less components involved.

6. Support separate scaling of artifact serving through artifacts-only instances
7. Update the UI to seamlessly handle artifact downloads from filesystem storage

### User Stories
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need 2 user stories:

  1. As an admin with a requirement to have storage on-premise, I want a simple artifact storage solution for KFP artifacts without having to maintain a separate service for artifacts due to administrative overhead or enterprise licensing costs.
  2. As an admin using the KFP artifact server, I want the ability to override a namespace's artifact configuration to use alternative storage such as S3 or a dedicated artifact server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KEP-12513: Filesystem-Based Artifact Storage for Kubeflow Pipelines

3 participants