-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Labels
contentImprovements or additions to documentation contentImprovements or additions to documentation content
Description
Description
@khossain4337 made several updates to the PyTorch page in October in 29d9187, but not every change made it to the oneCCL page in 88afc24
Both pages could use some review and improvements, since they haven't had many editors this year. We should include more folks beyond just @FilippoSimini @khossain4337 and @rickybalin
Possible updates needed, especially to https://docs.alcf.anl.gov/aurora/data-science/frameworks/oneCCL/
- Remove all mentions of Horovod? Or at least the PyTorch + Horovod example?
- Remove IPEX mentions?
- Check that the oneCCL env vars are set to the latest recommendations
- Deduplicate example scripts between the two pages. Create
.pyfile stored in that directory and included in each via https://squidfunk.github.io/mkdocs-material/setup/extensions/python-markdown-extensions/#snippets
This came up because @brianhol42 adapted the OneCCL > DDP example to an FSDP2 example, and he was getting hangs between two nodes. Unclear if there is a bug in the stack, his code, or the documentation was missing some vital environment variable setting etc.
See argonne-lcf/test_frameworks#7
Decisions, questions, uncertainties, and dependencies
No response
Priority Level
Medium (Normal priority)
Label Confirmation
- I have selected at least one appropriate label
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
contentImprovements or additions to documentation contentImprovements or additions to documentation content