Skip to content

Conversation

abishekg7
Copy link
Collaborator

@abishekg7 abishekg7 commented Aug 7, 2025

This PR enables execution of halo exchanges on GPUs via OpenACC directives. This uses #1315 as the base branch, so 1315 needs to be merged before the current PR can be merged.

The packing and unpacking code around the halo exchanges use!$acc parallel regions.

The actual MPI_Isend and MPI_Irecv operations use CUDA-aware MPI, by wrapping these calls within

!$acc host_data use_device(pointer_to_buffer)

!$acc end host_data

@abishekg7 abishekg7 force-pushed the framework/acc_halo_exch branch from f5a7287 to acdba1c Compare August 13, 2025 17:43
This PR consolidates much of the OpenACC host and device data transfers during
the course of the dynamical execution to two subroutines mpas_atm_pre_dynamics
_h2d and mpas_atm_post_dynamics_d2h that are called before and after the call
to atm_srk3 subroutine. Due to atm_compute_solve_diagnostics also being called
once before the start of model run, we also have a pair of subroutines mpas_atm
_pre_computesolvediag_h2d and mpas_atm_post_computesolvediag_d2h to handle data
movements around the first call to atm_compute_solve_diagnostics. Any fields
copied onto the device in these subroutines are removed from explicit data
movement statements in the dynamical core.

The mesh/time-invariant fields are still copied onto the device in mpas_atm_
dynamics_init and removed from the device in mpas_atm_dynamics_finalize, with
the exception of select fields moved in mpas_atm_pre_computesolvediag_h2d and
mpas_atm_post_computesolvediag_d2h. This is a special case due to atm_compute_
solve_diagnostics being called for the first time before the call to mpas_atm_
dynamics_init

This PR also includes explicit host-device data transfers in the mpas_atm_iau,
mpas_atmphys_interface and mpas_atmphys_todynamics modules to ensure that the
physics and IAU regions, which run on CPU, use the latest values from the
dynamical core running on GPUs, and vice versa. In addition, this PR also
includes explicit data transfers around halo exchanges in the atm_srk3
subroutine.

These subroutines for data routines, and the acc update statements are an
interim solution until we have a book-keeping method in place. This PR also
introduces a couple of new timers to keep track of the cost of data transfers.
…t_2d

This commit introduces two OpenACC data transfer routines,
mpas_reconstruct_2d_h2d and mpas_reconstruct_2d_d2h in order to remove the data
transfers from the mpas_reconstruct_2d routine itself. This also allows us to
remove extraneous data movements within the atm_srk3 routine.

mpas_reconstruct_2d_h2d and mpas_reconstruct_2d_d2h are called before and after
the call to mpas_reconstruct in atm_mpas_init_block. And the reconstructed
vector fields are also copied to and from the device before and after every
dynamics call in mpas_atm_pre_dynamics_h2d and mpas_atm_post_dynamics_d2h.
This commit introduces changes to ensure that building with -DCURVATURE still
produces the correct results, compared to the nvhpc cpu reference. This
involves removing the data movement of the reconstructed zonal and meridional
velocities in the atm_compute_dyn_tend_work subroutine and instead using
copyin for the same fields in mpas_atm_pre_dynamics_h2d.

This commit also removes the ACC data Xfer timers for the
atm_compute_dyn_tend_work subroutine, as we only have create/delete
statements
use mpas_derived_types, only : domain_type, mpas_halo_group, MPAS_HALO_REAL, MPAS_LOG_CRIT
use mpas_pool_routines, only : mpas_pool_get_array
use mpas_log, only : mpas_log_write
use mpas_timer, only : mpas_timer_start, mpas_timer_stop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update the Makefile to add a dependency on mpas_timer.o for the mpas_halo.o target.

abishekg7 and others added 17 commits October 3, 2025 15:05
This commit does work and matches the previous results!
NOTE: The last commit was successful!
Last commit had differences from the baseline. It's either this, or the
change dropping 'update device(group % sendBuf(:)' in the last commit
Last commit still had answer differences
This should make the dependency analysis easier on the compiler.

NOTE: The last commit succeeded and had no diffs after 1 timestep
compared to a reference run!
…o force GPUDirect MPI

NOTE: The last commit ran successfully and matched previous 1 step
results
…r variables

Last run failed with CUDA_ERROR_ILLEGAL_ADDRESS, I think keeping these
on the GPU would help!
Last commit gave me some big differences, let's see if this helps.

If this helps, then that means I wasn't using GPU-aware MPI routines
like I thought...
…calls instead

Last commit still had answer differences.

NOTE: This commit does too
Introducing a new namelist option under development, config_gpu_aware_mpi,
which will control whether the OpenACC run of MPAS on GPUs will use GPU-aware
MPI or do a device<->host update of variables around the call to a purely CPU-
based halo exchange.

Note: This feature is not available to use when config_halo_exch_method is set
to 'mpas_dmpar'
@abishekg7 abishekg7 force-pushed the framework/acc_halo_exch branch from d0c1431 to 1e08917 Compare October 13, 2025 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants