-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add initial GPU explanation #18
Conversation
5110674
to
9268320
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall everything looks good. Just a few minor requested changes for clarity.
explanation/gpus/slurmconf.md
Outdated
|
||
## Slurm configuration | ||
|
||
Each GPU-equipped node is added to the `gres.conf` configuration file as its own `NodeName` entry, following the format defined in the [Slurm `gres.conf` documentation](https://slurm.schedmd.com/gres.conf.html). Individual `NodeName` entries are used over an entry per GRES resource to provide greater support for heterogeneous environments, such as a cluster where the same model of GPU is not consistently the same device file across compute nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify/change 'entries are used over an entry per GRES resource'. As phrased currently, hard to understand what's what
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've revised this to give more information on what the NodeName
parameter is and why it's used. Hopefully things are a bit clearer now. The table I've added to explain the parameters might be unnecessary repetition, given we already link to the gres.conf
docs where they're explained in greater detail. Let me know what you think.
I've also removed the comment on heterogeneous environments as, while it's true, the real motivation is so we can have a single gres.conf
file on the slurmctld
controller that all compute nodes share (rather than an individual gres.conf
per compute node). The greater support is an effect of that.
Latest commit duplicates the Not sure what happened with the CI. It failed checking https://jwt.io/ but the link is still live for me. |
I'm having the same issue on my PR. Link is still active but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of non-blocking thoughts. Curious to know your opinions 😃
explanation/gpus/slurmconf.md
Outdated
Each GPU-equipped node is added to the `gres.conf` configuration file following the format defined in the [Slurm `gres.conf` documentation](https://slurm.schedmd.com/gres.conf.html). A single `gres.conf` is shared by all compute nodes in the cluster, using the optional `NodeName` specification to define GPU resources per node. Each line in `gres.conf` consists of the following parameters: | ||
|
||
| Parameter | Value | | ||
| ---------- | ---------------------------------------------------------- | | ||
| `NodeName` | Node the `gres.conf` line applies to. | | ||
| `Name` | Name of the generic resource. Always `gpu` here. | | ||
| `Type` | GPU model name. | | ||
| `File` | Path of the device file(s) associated with this GPU model. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically I italicize filenames like slurm.conf and gres.conf rather than encapsulate in inline code blocks. That way it's clear to me at least that I am referring to something technical that isn't code in a sentence. You can see it in the README for slurmutils
.
Kinda just highlights something I started doing and stuck with, but highlights that we should determine now how we should reference important file names in our documentation. Italicize, code block, or something else? Thoughts about this @AshleyCliff @dsloanm @jedel1043?
[side note]: @dsloanm I like the table 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed using different formats for file names and code makes sense. Italics works for me - updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it should be different than how code is distinguished from regular text; italics for filenames works well.
ea90402
to
caad1dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple suggestions for the table and slurm.conf sections, otherwise looks great!
explanation/gpus/slurmconf.md
Outdated
|
||
In `slurm.conf`, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type and count for each GPU on the node. | ||
In _slurm.conf_, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type, and count for each GPU on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _slurm.conf_, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type, and count for each GPU on the node. | |
In _slurm.conf_, the `Gres=` element of each line provides a comma-separated list of GPU-equipped node configurations. The format for each configuration is: `<name>:<type>:<count>`, as seen in the example below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably not quite right, but some way to make the pattern explicit and then refer to the example - this way if they don't scroll the text in the code example they still see the explicit pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded this with the pattern and pointing the reader to the example. I also added a bit more on when the Gres=
element is and isn't included.
ce123ed
to
95be22c
Compare
Looks good! |
Adds sections to Explanation for the GPU support recently merged into the Slurm charms. Once we have Reference docs, we'll likely want to update these sections to include links to at least a table of supported GPU models.
I've gone with
Driver auto-install
andSlurm enlistment
as subsection titles but wonder if these should beGPU driver auto-install
andSlurm enlistment of GPUs
. This would make the context obvious when, e.g., looking through search results, however would mean a lot of repetition of "GPU". Let me know your thoughts.