Description
ULFM2 is a fork of Open-MPI that implements ULFM-based fault tolerance.
When OpenCoarrays 2.4.0 is configured with CAF_ENABLE_FAILED_IMAGES=TRUE, the cafrun wrapper script adds the --disable-auto-cleanup
option to mpiexec to allow (an MPICH-based) MPI to continue execution in the event of an MPI process failure. If the user doesn't want fault tolerance, the user can specify the (MPICH-mpiexec-specific) --reenable-auto-cleanup
option.
The ULFM2 mpiexec has neither of these options; rather, the equivalent of --disable-auto-cleanup
is assumed by default. Fault tolerance can be disabled with the --disable-recovery
mpiexec option.
It would be beneficial if cafrun could accommodate the ULFM2 mpiexec syntax. One possible approach would be to change the --reenable-auto-cleanup
cafrun option to something more generic & descriptive (like --disable-failed-images
), and select an appropriate mpiexec option based on the MPI implementation.