Skip to content

Aurora > Data Science > Frameworks > PyTorch, oneCCL pages need updates #1062

@felker

Description

@felker

Description

@khossain4337 made several updates to the PyTorch page in October in 29d9187, but not every change made it to the oneCCL page in 88afc24

Both pages could use some review and improvements, since they haven't had many editors this year. We should include more folks beyond just @FilippoSimini @khossain4337 and @rickybalin

Possible updates needed, especially to https://docs.alcf.anl.gov/aurora/data-science/frameworks/oneCCL/

This came up because @brianhol42 adapted the OneCCL > DDP example to an FSDP2 example, and he was getting hangs between two nodes. Unclear if there is a bug in the stack, his code, or the documentation was missing some vital environment variable setting etc.

See argonne-lcf/test_frameworks#7

Decisions, questions, uncertainties, and dependencies

No response

Priority Level

Medium (Normal priority)

Label Confirmation

  • I have selected at least one appropriate label

Metadata

Metadata

Assignees

No one assigned

    Labels

    contentImprovements or additions to documentation content

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions