Skip to content

RFE: support ULFM2 mpiexec in cafrun #623

Open
@nathanweeks

Description

@nathanweeks

ULFM2 is a fork of Open-MPI that implements ULFM-based fault tolerance.

When OpenCoarrays 2.4.0 is configured with CAF_ENABLE_FAILED_IMAGES=TRUE, the cafrun wrapper script adds the --disable-auto-cleanup option to mpiexec to allow (an MPICH-based) MPI to continue execution in the event of an MPI process failure. If the user doesn't want fault tolerance, the user can specify the (MPICH-mpiexec-specific) --reenable-auto-cleanup option.

The ULFM2 mpiexec has neither of these options; rather, the equivalent of --disable-auto-cleanup is assumed by default. Fault tolerance can be disabled with the --disable-recovery mpiexec option.

It would be beneficial if cafrun could accommodate the ULFM2 mpiexec syntax. One possible approach would be to change the --reenable-auto-cleanup cafrun option to something more generic & descriptive (like --disable-failed-images), and select an appropriate mpiexec option based on the MPI implementation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions