-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust the README file in cater with users #2440
base: master
Are you sure you want to change the base?
Changes from 3 commits
da59dc4
85a3460
707e7f7
bcdfa1b
8f51fbc
e731d4f
812aecd
0c44d15
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,8 @@ | |
|
||
## Overview | ||
|
||
> **Warning**: Kubeflow Trainer is currently in **alpha** status, and APIs may change. If you want to use stable release of Kubeflow Training Operator V1, please check [this section](#kubeflow-training-operator-v1). | ||
|
||
Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) | ||
fine-tuning and enabling scalable, distributed training of machine learning (ML) models across | ||
various frameworks, including PyTorch, JAX, TensorFlow, and others. | ||
|
@@ -35,9 +37,18 @@ The following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflo | |
|
||
## Getting Started | ||
|
||
Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/trainer/getting-started) | ||
You can simply run these commands to install Kubeflow Trainer if your Kubernetes cluster is ready: | ||
|
||
```bash | ||
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=master" | ||
kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=master" | ||
Comment on lines
+44
to
+45
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Electronic-Waste Previously, we've been discussing with @rimolive @varodrig and @kubeflow/wg-training-leads that we want to have single source of truth for the Kubeflow Trainer docs, since it is hard to keep all of these docs up-to-date. Thus, we just redirect users to the Kubeflow website for the installation steps: https://www.kubeflow.org/docs/components/trainer/operator-guides/installation/#installing-the-kubeflow-trainer-controller-manager There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From the user's side, they want to quickly install Kubeflow Trainer. So, we'd better put the installation guide at the README file to get started quickly like so many other oss including Katib:
And they also have detailed guidance in the website. So, I think it probably a better choice to put some installation commands in the README. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, they need to click "the official Kubeflow documentation" -> "the installation guide" -> scroll down, to see the installation commands. I think the guide is too deep for users. They may prefer a straightforward way and search for the official documentation if the straightforward way does not work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really, since we directly link the installation guide from the README: https://github.com/kubeflow/trainer?tab=readme-ov-file#getting-started. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It's a better choice compared to the current one. But from the user's perspective, they prefer install Trainer directly with some simple commands listed in the README, which is more straightforward, and search for the details if they want customized installation (like only installing manager) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we talk about it at the next Training WG call please ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah, of course. |
||
``` | ||
|
||
For more details, please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/trainer/getting-started) | ||
to install and get started with Kubeflow Trainer. | ||
|
||
If you are using Kubeflow Training Operator V1, please refer [to this migration document](/docs/components/trainer/operator-guides/migration). | ||
|
||
## Community | ||
|
||
The following links provide information on how to get involved in the community: | ||
|
@@ -56,12 +67,16 @@ Please refer to the [CHANGELOG](CHANGELOG.md). | |
|
||
## Kubeflow Training Operator V1 | ||
|
||
Kubeflow Trainer project is currently in <strong>alpha</strong> status, and APIs may change. | ||
If you are using Kubeflow Training Operator V1, please refer [to this migration document](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/). | ||
Kubeflow Trainer project is currently in <strong>alpha</strong> status, and APIs may change. You can install the stable release of the Kubeflow Training Operator V1 with: | ||
|
||
```bash | ||
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.0" | ||
``` | ||
|
||
For more details, please check [this guide](https://www.kubeflow.org/docs/components/trainer/legacy-v1/installation/) to install and get started with Kubeflow Training Operator V1. | ||
|
||
Kubeflow Community will maintain the Training Operator V1 source code at | ||
[the `release-1.9` branch](https://github.com/kubeflow/training-operator/tree/release-1.9). | ||
|
||
You can find the documentation for Kubeflow Training Operator V1 in [these guides](https://www.kubeflow.org/docs/components/trainer/legacy-v1). | ||
|
||
## Acknowledgement | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a GitHub Markdown way: https://github.com/orgs/community/discussions/16925
Anyway, do we really want to notice this to users? @kubeflow/wg-training-leads @astefanutti
If yes, when can we remove this warning? @Electronic-Waste Do you have a specific schedule for the removal this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we are implementing Trainer according to KEP-2170. It's unstable and has not been ready for production. So I think it's necessary to remind users of this. They can use the stable release of Training Operator V1 instead, at least before we release the first stable version of Trainer.
As for the removal timeline, I think probably we can remove this warning when the first stable release of Trainer is ready, maybe
v2.0.0
. It means that we are production-ready.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will cut v2.0.0 after MPI impl. So I'm not sure the reason why we can say production ready in the v2.0.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can turn the warning into a note. And also rephrase it in a way that says the v2 APIs are still subject to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the v2.0.0. I mean we can remove this warning when we are production ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, what is production-ready is a matter. If we do not decide graduation criteria for production ready, we lose the timeline for that, and we continue "this is not production ready, so any kind of break should be fine". And then, users will leave from this projects.
At least, it should be better to define production readiness criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true. We should discuss it further in another dedicated issue for graduation criteria. And maybe we could list it as an agenda for the next WG AutoML/Training community call.
/cc @andreyvelich