Skip to content

[k8s] Template for Slurm-like NFS on Kubernetes #4956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Mar 14, 2025

Adds a template showcasing how to do slurm-like NFS on Kubernetes. With this template, the specified NFS_USERNAME's home directory will be used as the SkyPilot task's home directory.

Tested with multi-node k3s cluster with EFS acting as the underlying shared storage.

This is purely in user space (i.e., no code changes to SkyPilot, defined purely in the task.yaml). For deeper integration, we can consider adding a field in config that automatically sets this up and passes the current active username to be used for mounting.

Some gotchas:

  • We use the image's default username (e.g., sky or root) and don't change it to NFS_USERNAME. Not sure if any applications rely on it.
  • Since the home dir is set in user space, there may be some initialization that happens before in ~, notably sky_workdir sync etc. Noted in the README.

Other alternatives we considered:

  • Directly mount the NFS at /home/sky. This is problematic because sky logs, sky workdir and skypilot-runtime will get shared across multi-node workers.

@romilbhardwaj
Copy link
Collaborator Author

Another requirement is to write files as a specific user/group to the NFS. By default, in the current implementation they would get written as whatever user/uid is baked in our base image.

In k8s, this can be emulated with:

experimental:
  config_overrides:
    kubernetes:
      pod_config:
        spec:
          # Set security context for proper file ownership
          securityContext:
            runAsUser: 1001
            runAsGroup: 1001
            fsGroup: 1001

However, SkyPilot setup fails with:

D 03-17 15:50:15 subprocess_utils.py:86] Using 1 workers for file mounts.
E 03-17 15:50:15 subprocess_utils.py:151] mkdir: cannot create directory ‘/home/sky/.sky’: Permission denied
E 03-17 15:50:15 subprocess_utils.py:151] command terminated with exit code 1

We can update our base image dir permissions to make this work, but this is not a robust solution, especially when users can come with their own custom images. For user's looking for a workaround, they can build an image and grant all their required uids rwx permissions in the default $HOME directory in the image.

@romilbhardwaj
Copy link
Collaborator Author

Also, we need to fix #4975 before this example can work.

@romilbhardwaj romilbhardwaj marked this pull request as draft March 17, 2025 23:18
@romilbhardwaj romilbhardwaj added the blocked PR blocked by other issues label Mar 17, 2025
Copy link
Contributor

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Jul 17, 2025
Copy link
Contributor

This PR was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this Jul 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked PR blocked by other issues Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant