Is your feature request related to a problem? Please describe.
Proper usage of GPU-aware MPI is critical to performance of domain-decomposed GPU-accelerated code.
The FIAT MPL component is currently not GPU-aware, in that it can not properly pass device-resident buffers to the underlying MPI implementation.
Describe the solution you'd like
Two possible ways of lifting this restriction are proposed here :
- using OpenACC / OpenMP semantics to specify when buffers are device-resident
- using (de facto industry standard) compiler extension !DIR$ IGNORE_TKR (d) to allow passing device-resident buffers to existing MPL routines
The first option requires user-side code changes, to specify residency of buffers, either in MPL calls or with a default behaviour switch. It has the advantage of being standards-based, but requires GPU-related compilation of FIAT.
The second option has the advantage of not requiring user-side code changes, but relies on a compiler extension whose support is not guaranteed.
Example working (tested against ectrans in both cases) but not fully finished branches of both approaches are listed below:
Which option should we choose going forward?
Describe alternatives you've considered
A third possibility, relying on cuda-fortran to specify residency in routine interfaces, is discounted here due to vendor lock-in.
Additional context
No response
Organisation
No response
Is your feature request related to a problem? Please describe.
Proper usage of GPU-aware MPI is critical to performance of domain-decomposed GPU-accelerated code.
The FIAT MPL component is currently not GPU-aware, in that it can not properly pass device-resident buffers to the underlying MPI implementation.
Describe the solution you'd like
Two possible ways of lifting this restriction are proposed here :
The first option requires user-side code changes, to specify residency of buffers, either in MPL calls or with a default behaviour switch. It has the advantage of being standards-based, but requires GPU-related compilation of FIAT.
The second option has the advantage of not requiring user-side code changes, but relies on a compiler extension whose support is not guaranteed.
Example working (tested against ectrans in both cases) but not fully finished branches of both approaches are listed below:
Which option should we choose going forward?
Describe alternatives you've considered
A third possibility, relying on cuda-fortran to specify residency in routine interfaces, is discounted here due to vendor lock-in.
Additional context
No response
Organisation
No response