Skip to content

Conversation

gdicker1
Copy link
Collaborator

This PR adds a workaround for problems encountered on systems using OpenMPI v5.x (observed with 5.0.7). The inlist argument to mpas_dmpar_scatter_ints seems to be affected by the MPI_ScatterV call, which then causes run-time fails in mpas_block_decomp_cells_for_proc when re-reading the block decomposition file. Making global_list an allocatable, associating a pointer with it, and using that pointer as the inlist argument to mpas_dmpar_scatter_ints seems to resolve the issue.

NOTE: So this PR can be applied broadly it is based on a very, very old commit. I think around v4.0, at least as old as the v6.0 tag.

…_proc

This is a workaround for problems encountered on systems using OpenMPI
v5.x (observed with 5.0.7).  The inlist argument to
mpas_dmpar_scatter_ints seems to be affected by the MPI_ScatterV call,
which then causes run-time fails in mpas_block_decomp_cells_for_proc
when re-reading the block decomposition file. Making global_list an
allocatable, associating a pointer with it, and passing that pointer to
mpas_dmpar_scatter_ints seems to resolve the issue.
@gdicker1
Copy link
Collaborator Author

The problems in this PR were noted first when a collaborator on the EarthWorks project who was working on TACC's Vista system ran into issues getting MPAS-A to work as a step towards running EarthWorks (CESM). Around August of 2024.

Another EarthWorks user posted about this in EarthWorksOrg/EarthWorks#109 and proposed this fix. When they were running on the Narval system in Canada, runs would die with a "FIO-F-231/list-directed read/unit=1/error on data conversion." message which pointed back to mpas_block_decomp.F near line 181.

On Derecho, I ran the develop branch (from the merge of PR#1298) with nvhpc/25.1 and openmpi/5.0.7 modules. This reported a "double free or corruption (!prev)" which crashed the model. Examining this with gdb seems to point back to mpas_block_decomp.F line 261. Once the fix was applied, I could not recreate this error.

@mgduda
Copy link
Contributor

mgduda commented Aug 5, 2025

I think I'm able to reproduce the issue that this PR aims to resolve. An alternative fix to the one proposed here may involve simply changing the inlist argument to the mpas_dmpar_scatter_ints routine to a regular array (removing the pointer attribute); i.e, changing the inlist dummy argument so that it is declared as

      integer, dimension(:), intent(in) :: inlist !< Input: List of integers to send

@mgduda
Copy link
Contributor

mgduda commented Aug 26, 2025

PR #1361 implements an alternative fix that has the positive side effect of allowing the mpas_dmpar_scatter_ints routine to work with non-pointer send buffer arrays.

@mgduda mgduda closed this Aug 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants