Skip to content

Graceful Failure with Incorrect API Usage #1174

@wfaderhold21

Description

@wfaderhold21

It can be the case that an application using UCC may attempt to perform a collective operation with a UCC team that was created on a UCC context after that context has been destroyed. While this is invalid and documented in the API, this can result in two potential failures:

  1. ucc_collective_init succeeds but ucc_collective_post produces a segmentation fault. This can occur if the context has been destroyed, but the library is not finalized.
  2. ucc_collective_init produces errors similar to:
[1751278711.720649] [eos0260:399988:0]          ucc_mc.c:143  UCC  ERROR no components supported memory type host available

This occurs when both the UCC context and UCC library have been destroyed/finalized, respectively.

These errors can be difficult for a user to track down unless familiar with UCC. Currently, we only check for the reuse of a destroyed UCC team. It may be beneficial to check for these additional failing cases in ucc_collective_init to prevent such failures and allow applications to continue executing for a graceful shutdown.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions