This content provides a quickstart with multinode PyTorch FSDP training on Slurm and Kubernetes. It is designed to be simple with no data preparation or tokenizer to download, and uses Python virtual environment.
To run FSDP training, you will need to create a training cluster based on Slurm or Kubermetes with an Amazon FSx for Lustre You can find instruction how to create a Amazon SageMaker Hyperpod cluster with Slurm, Kubernetes or with in Amazon EKS.
This fold provides examples on how to train with PyTorch FSDP with Slurm or Kubernetes. You will find instructions for Slurm or Kubernetes in the subdirectories.