-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419
base: release-1.9
Are you sure you want to change the base?
Added an example Notebook to fine-tune Llama3 model using PyTorchJob #2419
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort @aishwaryaraimule21!
I am fine with merging this KFP example.
Any thoughts @johnugeorge @tenzen-y @Electronic-Waste @astefanutti ?
" )\n", | ||
" \n", | ||
" # check the status of the job\n", | ||
" from kubeflow.pytorchjob import PyTorchJobClient\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you use TrainingClient here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR. Now using TrainingClient().get_job_conditions() to fetch the job status.
I have no objections:) |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In that case, what are the relationship Training examples in KFP repository something like https://github.com/kubeflow/pipelines/tree/472f8779ded18f8904c5cbe15c0573d461d57af5/components/kubeflow/pytorch-launcher? |
I think, you can use PyTorch launcher or you can directly use |
SGTM. |
@aishwaryaraimule21 Can you sign the DCO please ? |
891bb0c
to
d62081a
Compare
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
Signed-off-by: aishwarya.raimule <[email protected]>
d62081a
to
b65cfbf
Compare
@andreyvelich I have signed the DCO. Please check. Thanks. |
"\n", | ||
"In this component, use TrainingClient() to create PyTorchJob which will fine-tune Llama3 model on 1 worker with 1 GPU.\n", | ||
"\n", | ||
"Specify the required packages in the *dsl.component* decorator. We would need kubeflow-pytorchjob, kubeflow-training[huggingface] and numpy packages in this Kubeflow component.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is kubeflow-pytorchjob
really necessary since TrainingClient
is used now?
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"@dsl.component(packages_to_install=['kubeflow-pytorchjob', 'kubeflow-training[huggingface]','numpy<1.24'])\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dito, is kubeflow-pytorchjob
really necessary since TrainingClient
is used now?
" ),\n", | ||
" # it is assumed for text related tasks, you have 'text' column in the dataset.\n", | ||
" # for more info on how dataset is loaded check load_and_preprocess_data function in sdk/python/kubeflow/trainer/hf_llm_training.py\n", | ||
" dataset_provider_parameters=HuggingFaceDatasetParams(repo_id=\"aishwaryayyy/events_data\"),\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to remove dependencies on user specific repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@astefanutti may I ask why?
Do you recommend using something like https://huggingface.co/datasets/Yelp/yelp_review_full for the example?
" name=\"llama-3-1-8b-kubecon\",\n", | ||
" num_workers=1,\n", | ||
" num_procs_per_worker=1,\n", | ||
" # specify the storage class if you don't want to use the default one for the storage-initializer PVC\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to mention a provisioner capable of provisioning RWX PVC is needed when distributing the training on multiple nodes / workers.
" \"storage_class\": \"nfs-storage\",\n", | ||
" },\n", | ||
" model_provider_parameters=HuggingFaceModelParams(\n", | ||
" model_uri=\"hf://meta-llama/Llama-3.1-8B-Instruct\",\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we cover the distributed training case, and provide the configuration so the model does not get downloaded on each local node / worker?
Pull Request Test Coverage Report for Build 13375853453Details
💛 - Coveralls |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Fixes #
Checklist: