Skip to content

Latest commit

 

History

History

PWR026

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

PWR026: Annotate function for OpenMP offload

Issue

An OpenMP offload version of the function can be generated by the compiler.

Actions

Annotate the function with the OpenMP declare target directive.

Relevance

When a loop is offloaded to the GPU, the compiler translates the relevant code sections into GPU-friendly instructions, which are embedded into the main CPU program. The runtime is in charge of executing these code sections in the GPU and handles data movements between the CPU and GPU memories.

If an offloaded loop calls functions, GPU versions of those functions must be generated as well. To achieve this, mark the relevant functions with the OpenMP declare target directive. If this isn't done, the CPU version will be called instead. This results in significant performance issues because the computation data must be moved from the GPU to the CPU to execute the function, and then back to the GPU once the function returns.

Code example

C

While the loop below is correctly offloaded to the GPU, the function foo() isn't marked for GPU execution. Consequently, the compiler will only create a CPU version of foo(). This means every time foo() is called, the data must be transferred between the GPU and the CPU, negatively impacting performance:

__attribute__((const)) int foo(int a) {
  return 2 * a;
}

void example(int n, int *A) {
  #pragma omp target teams distribute parallel for default(none) shared(A, n)
  for (int i = 0; i < n; i++) {
    A[i] = foo(i);
  }
}

To prevent the performance loss caused by constant data transfers, add the #pragma omp declare target directive to foo(). This will instruct the compiler to create a GPU version of the function, allowing the loop to run entirely on the device:

#pragma omp declare target
__attribute__((const)) int foo(int a) {
  return 2 * a;
}
#pragma omp end declare target

void example(int n, int *A) {
  #pragma omp target teams distribute parallel for default(none) shared(A, n)
  for (int i = 0; i < n; i++) {
    A[i] = foo(i);
  }
}

Fortran

While the loop below is correctly offloaded to the GPU, the function foo() isn't marked for GPU execution. Consequently, the compiler will only create a CPU version of foo(). This means every time foo() is called, the data must be transferred between the GPU and the CPU, negatively impacting performance:

subroutine example(A)
  implicit none
  integer, intent(out) :: A(:)
  integer :: i

  !$omp target teams distribute parallel do default(none) private(i) shared(A)
  do i = 1, size(A, 1)
    A(i) = foo(i)
  end do

contains

  pure integer function foo(a)
    implicit none
    integer, intent(in) :: a
    foo = 2 * a
  end function foo
end subroutine example

To prevent the performance loss caused by constant data transfers, add the !$omp declare target directive to foo(). This will instruct the compiler to create a GPU version of the function, allowing the loop to run entirely on the device:

subroutine example(A)
  implicit none
  integer, intent(out) :: A(:)
  integer :: i

  !$omp target teams distribute parallel do default(none) private(i) shared(A)
  do i = 1, size(A, 1)
    A(i) = foo(i)
  end do

contains

  pure integer function foo(a)
    !$omp declare target
    implicit none
    integer, intent(in) :: a
    foo = 2 * a
  end function foo
end subroutine example

References