feat: Add Hyperpod Optimum-neuron LoRA example #631

Captainia · 2025-04-04T16:06:25Z

Issue #, if available: N/A, new feature

Description of changes:

This PR adds an example to the test_cases, by using Huggingface optimum-neuron library for PEFT fine-tuning. This repo is is the interface between the Huggingface Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks.

The example uses Hyperpod EKS environment, and the code will used in the Hyperpod workshop

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

mhuguesaws

Thank you for your contribution.

Could you please:

join the AWS organization on github
organize the folder as pytorch/fine-tuning/optimum-neuron
Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

Captainia · 2025-04-04T16:56:59Z

Thank you for your contribution.

Could you please:

join the AWS organization on github

organize the folder as pytorch/fine-tuning/optimum-neuron

Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

Thank you for the quick review!

I am in the AWS organization, https://github.com/aws and https://github.com/awslabs, is there another one I should join?
2,3. There will be another continuous pre-training example under pytorch/optimum-neuron/llama3/kubernetes/continuous-pre-training, so I organized currently as pytorch/optimum-neuron/llama3/kubernetes/fine-tuning. Following neuronx-distributed Does this sounds good?

mhuguesaws · 2025-04-04T19:35:39Z

Thank you for your contribution.
Could you please:

join the AWS organization on github

organize the folder as pytorch/fine-tuning/optimum-neuron

Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

Thank you for the quick review!

I am in the AWS organization, https://github.com/aws and https://github.com/awslabs, is there another one I should join?
2,3. There will be another continuous pre-training example under pytorch/optimum-neuron/llama3/kubernetes/continuous-pre-training, so I organized currently as pytorch/optimum-neuron/llama3/kubernetes/fine-tuning. Following neuronx-distributed Does this sounds good?

aws-samples organization

2and 3 are good.

jianyinglangaws · 2025-04-05T05:40:50Z

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/Dockerfile

+
+# # Update Neuron Compiler and Framework
+RUN python -m pip install --upgrade neuronx-cc==2.* torch-neuronx==2.1.* torchvision
+RUN python -m pip install --upgrade neuronx-distributed neuronx-distributed-training


Can we fix the library version? A new release could break the compatibility.

Thanks Jianying, I have pin the dependencies in this CR. Ideally when this PR is merged and released, we will switch to optimum-neuron 0.1.0 sagemaker image: https://github.com/aws/deep-learning-containers/pull/4670/files#diff-0f776bad437279bcc3d6005ec1b29170f0b4e53dfbc5c3c234fb202817f707a3. It has the required dependencies too.

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/Dockerfile

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/src/peft_train.py

KeitaW

Left a comment

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md

…ing/README.md Co-authored-by: Keita Watanabe <[email protected]>

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/Dockerfile

Captainia · 2025-04-14T13:59:19Z

Thank you for your contribution.
Could you please:

join the AWS organization on github

organize the folder as pytorch/fine-tuning/optimum-neuron

Add the kubernetes manifest into a kubernetes folder. The repo structure organizes by scheduler.

Thank you for the quick review!

I am in the AWS organization, https://github.com/aws and https://github.com/awslabs, is there another one I should join?
2,3. There will be another continuous pre-training example under pytorch/optimum-neuron/llama3/kubernetes/continuous-pre-training, so I organized currently as pytorch/optimum-neuron/llama3/kubernetes/fine-tuning. Following neuronx-distributed Does this sounds good?

aws-samples organization

2and 3 are good.

Thank you, I have added to the aws-samples organization.

3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/generate-jobspec.sh

jianyinglangaws

Looks good to me!

Not needed.

* Add Hyperpod Optimum-neuron LoRA example * fix README * restructure files * fix * Update to use newer version of peft and llama 3.8 * Update 3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md Co-authored-by: Keita Watanabe <[email protected]> * Update 3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md Co-authored-by: Keita Watanabe <[email protected]> * Update 3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning/README.md Co-authored-by: Keita Watanabe <[email protected]> * pin dependencies and address comments * fix * switch model, remove HF token, update compile steps --------- Co-authored-by: Keita Watanabe <[email protected]>

Captainia added 2 commits April 4, 2025 12:02

Add Hyperpod Optimum-neuron LoRA example

8ea40e3

fix README

56396be

mhuguesaws previously requested changes Apr 4, 2025

View reviewed changes

restructure files

d2c4fdf

fix

88503a2

Update to use newer version of peft and llama 3.8

4394fa9

jianyinglangaws reviewed Apr 5, 2025

View reviewed changes