Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add initial GPU explanation #18

Merged
merged 9 commits into from
Jan 24, 2025

Conversation

dsloanm
Copy link
Contributor

@dsloanm dsloanm commented Jan 15, 2025

Adds sections to Explanation for the GPU support recently merged into the Slurm charms. Once we have Reference docs, we'll likely want to update these sections to include links to at least a table of supported GPU models.

I've gone with Driver auto-install and Slurm enlistment as subsection titles but wonder if these should be GPU driver auto-install and Slurm enlistment of GPUs. This would make the context obvious when, e.g., looking through search results, however would mean a lot of repetition of "GPU". Let me know your thoughts.

@dsloanm dsloanm requested a review from a team as a code owner January 15, 2025 14:36
@dsloanm dsloanm requested review from AshleyCliff and removed request for a team January 15, 2025 14:36
@dsloanm dsloanm force-pushed the explanation-gpu branch 3 times, most recently from 5110674 to 9268320 Compare January 17, 2025 17:49
Copy link
Contributor

@AshleyCliff AshleyCliff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall everything looks good. Just a few minor requested changes for clarity.


## Slurm configuration

Each GPU-equipped node is added to the `gres.conf` configuration file as its own `NodeName` entry, following the format defined in the [Slurm `gres.conf` documentation](https://slurm.schedmd.com/gres.conf.html). Individual `NodeName` entries are used over an entry per GRES resource to provide greater support for heterogeneous environments, such as a cluster where the same model of GPU is not consistently the same device file across compute nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify/change 'entries are used over an entry per GRES resource'. As phrased currently, hard to understand what's what

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've revised this to give more information on what the NodeName parameter is and why it's used. Hopefully things are a bit clearer now. The table I've added to explain the parameters might be unnecessary repetition, given we already link to the gres.conf docs where they're explained in greater detail. Let me know what you think.

I've also removed the comment on heterogeneous environments as, while it's true, the real motivation is so we can have a single gres.conf file on the slurmctld controller that all compute nodes share (rather than an individual gres.conf per compute node). The greater support is an effect of that.

@dsloanm
Copy link
Contributor Author

dsloanm commented Jan 21, 2025

Latest commit duplicates the gres.conf table in the Reference section and adds a related link.

Not sure what happened with the CI. It failed checking https://jwt.io/ but the link is still live for me.

@NucciTheBoss
Copy link
Member

Latest commit duplicates the gres.conf table in the Reference section and adds a related link.

Not sure what happened with the CI. It failed checking https://jwt.io/ but the link is still live for me.

I'm having the same issue on my PR. Link is still active but make linkcheck is throwing a fit. Hopefully an intermittent issue.

Copy link
Member

@NucciTheBoss NucciTheBoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of non-blocking thoughts. Curious to know your opinions 😃

Comment on lines 16 to 23
Each GPU-equipped node is added to the `gres.conf` configuration file following the format defined in the [Slurm `gres.conf` documentation](https://slurm.schedmd.com/gres.conf.html). A single `gres.conf` is shared by all compute nodes in the cluster, using the optional `NodeName` specification to define GPU resources per node. Each line in `gres.conf` consists of the following parameters:

| Parameter | Value |
| ---------- | ---------------------------------------------------------- |
| `NodeName` | Node the `gres.conf` line applies to. |
| `Name` | Name of the generic resource. Always `gpu` here. |
| `Type` | GPU model name. |
| `File` | Path of the device file(s) associated with this GPU model. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically I italicize filenames like slurm.conf and gres.conf rather than encapsulate in inline code blocks. That way it's clear to me at least that I am referring to something technical that isn't code in a sentence. You can see it in the README for slurmutils.

Kinda just highlights something I started doing and stuck with, but highlights that we should determine now how we should reference important file names in our documentation. Italicize, code block, or something else? Thoughts about this @AshleyCliff @dsloanm @jedel1043?

[side note]: @dsloanm I like the table 😎

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed using different formats for file names and code makes sense. Italics works for me - updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it should be different than how code is distinguished from regular text; italics for filenames works well.

Copy link
Contributor

@AshleyCliff AshleyCliff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple suggestions for the table and slurm.conf sections, otherwise looks great!


In `slurm.conf`, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type and count for each GPU on the node.
In _slurm.conf_, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type, and count for each GPU on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In _slurm.conf_, the configuration for GPU-equipped nodes has a comma-separated list in its `Gres=` element, giving the name, type, and count for each GPU on the node.
In _slurm.conf_, the `Gres=` element of each line provides a comma-separated list of GPU-equipped node configurations. The format for each configuration is: `<name>:<type>:<count>`, as seen in the example below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably not quite right, but some way to make the pattern explicit and then refer to the example - this way if they don't scroll the text in the code example they still see the explicit pattern.

Copy link
Contributor Author

@dsloanm dsloanm Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded this with the pattern and pointing the reader to the example. I also added a bit more on when the Gres= element is and isn't included.

@AshleyCliff
Copy link
Contributor

Looks good!

@AshleyCliff AshleyCliff merged commit 9f8b6f1 into charmed-hpc:main Jan 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants