Skip to content

Latest commit

 

History

History
893 lines (652 loc) · 34.6 KB

utilization.md

File metadata and controls

893 lines (652 loc) · 34.6 KB
title
Utilisation cheat sheet for Kokkos

Kokkos utilization cheat sheet

Warning Only for Kokkos 4.5 and more, for older version look at the doc.

Doc https://kokkos.org/kokkos-core-wiki/programmingguide.html

Doc https://kokkos.org/kokkos-core-wiki/API/core-index.html

Doc https://github.com/kokkos/kokkos-tutorials/blob/main/LectureSeries/KokkosTutorial_01_Introduction.pdf

Header

#include <Kokkos_Core.hpp>

Initialization

Initialize and finalize

int main(int argc, char* argv[]) {
    Kokkos::initialize(argc, argv);
    { /* ... */ }
    Kokkos::finalize();
}

Scope guard

int main(int argc, char* argv[]) {
    Kokkos::ScopeGuard kokkos(argc, argv);
    /* ... */
}

Kokkos concepts

Execution spaces

Execution space Device backend Host backend
Kokkos::DefaultExecutionSpace On device On host
Kokkos::DefaultHostExecutionSpace On host On host

Doc https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html

Memory spaces

Doc https://kokkos.org/kokkos-core-wiki/API/core/memory_spaces.html

Generic memory spaces

Memory space Device backend Host backend
Kokkos::DefaultExecutionSpace::memory_space On device On host
Kokkos::DefaultHostExecutionSpace::memory_space On host On host

Specific memory spaces

Memory space Description
Kokkos::HostSpace Accessible from the host but maybe not from the device
Kokkos::SharedSpace Accessible from the host and the device; copy managed by the driver
Kokkos::SharedHostPinnedSpace Accessible from the host and the device; zero copy access in small chunks
Examples
// Host space
Kokkos::View<double*, Kokkos::HostSpace> hostView("hostView", numberOfElements);

// Shared space
Kokkos::View<double*, Kokkos::SharedSpace> sharedView("sharedView", numberOfElements);

// Shared host pinned space
Kokkos::View<double*, Kokkos::SharedHostPinnedSpace> sharedHostPinnedView("sharedHostPinnedView", numberOfElements);

Memory management

View

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html

Create

Kokkos::View<DataType, LayoutType, MemorySpace, MemoryTraits> view("label", numberOfElementsAtRuntimeI, numberOfElementsAtRuntimeJ);
Template argument Description
DataType ScalarType for the data type, followed by a * for each runtime dimension, then by a [numberOfElements] for each compile time dimension, mandatory
LayoutType See memory layouts, optional
MemorySpace See memory spaces, optional
MemoryTraits See memory traits, optional

Warning The order of template arguments is important.

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#constructors

Examples
// A 1D view of doubles
Kokkos::View<double*> view1D("view1D", 100);

// Const view of doubles
Kokkos::View<const double*> constView1D("constView1D", 100) = view1D;

// A 2D view of integers
Kokkos::View<int**> view2D("view2D", 50, 50);

// 3D view with 2 runtime dimensions and 1 compile time dimension
Kokkos::View<double**[25]> view3D("view3D", 50, 42, 25);

Manage

Method Description
(i, j...) Returns and sets the value at index i, j, etc.
size() Returns the total number of elements in the view
rank() Returns the number of dimensions
layout() Returns the layout of the view
extent(dim) Returns the number of elements in the requested dimension
data() Returns a pointer to the underlying data

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#data-access-functions

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#data-layout-dimensions-strides

Resize and preserve content
Kokkos::resize(view, newNumberOfElementsI, newNumberOfElementsJ...);

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/resize.html

Reallocate and do not preserve content
Kokkos::realloc(view, newNumberOfElementsI, newNumberOfElementsJ...);

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/realloc.html

Memory Layouts

Layout Description Default
Kokkos::LayoutRight Strides increase from the right most to the left most dimension, also known as row-major or C-like CPU
Kokkos::LayoutLeft Strides increase from the left most to the right most dimension, also known as column-major or Fortran-like GPU
Kokkos::LayoutStride Strides can be arbitrary for each dimension

By default, a layout suited for loops on the high frequency index is used.

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/view.html#data-layout-dimensions-strides

Example
// 2D view with LayoutRight
Kokkos::View<double**, Kokkos::LayoutRight> view2D("view2D", 50, 50);

Memory trait

Memory traits are indicated with Kokkos::MemoryTraits<> and are combined with the | (pipe) operator.

Memory trait Description
Kokkos::Unmanaged The allocation has to be managed manually
Kokkos::Atomic All accesses to the view are atomic
Kokkos::RandomAccess Hint that the view is used in a random access manner; if the view is also const this may trigger more efficient load operations on GPUs
Kokkos::Restrict There is no aliasing of the view by other data structures in the current scope

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/View.html#access-traits

Examples
// Unmanaged view on CPU
double* data = new double[numberOfElements];
Kokkos::View<double*, Kokkos::HostSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged>> unmanagedView(data, numberOfElements);

// Unmanaged view on GPU using CUDA
double* data;
cudaMalloc(&data, numberOfElements * sizeof(double));
Kokkos::View<double*, Kokkos::MemoryTraits<Kokkos::Unmanaged>, Kokkos::CudaSpace> unmanagedView(data, numberOfElements);

// Atomic view
Kokkos::View<double*, Kokkos::MemoryTraits<Kokkos::Atomic>> atomicView("atomicView", numberOfElements);

// Random access with constant data
// first, allocate non constant view
Kokkos::View<int*> nonConstView ("data", numberOfElements);
// then, make it constant
Kokkos::View<const int*, Kokkos::MemoryTraits<Kokkos::RandomAccess>> randomAccessView = nonConstView;

// Unmanaged, atomic, random access view on GPU using CUDA
double* data;
cudaMalloc(&data, numberOfElements* sizeof(double));
Kokkos::View<const double*,  Kokkos::CudaSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged | Kokkos::Atomic | Kokkos::RandomAccess>> unmanagedView(data, numberOfElements);

Deep copy

Warning Copying or assigning a view does a shallow copy, data are not synchronized in this case.

Kokkos::deep_copy(dest, src);

The views must have the same dimensions, data type, and reside in the same memory space (mirror views can be deep copied on different memory spaces).

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/deep_copy.html

Code

Example
Kokkos::View<double*> view1("view1", numberOfElements);
Kokkos::View<double*> view2("view2", numberOfElements);

// Deep copy of view1 to view2
Kokkos::deep_copy(view2, view1);

Mirror view

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/create_mirror.html

Code

Create and always allocate on host

auto mirrorView = Kokkos::create_mirror(view);

Create and allocate on host if source view is not in host space

auto mirrorView = Kokkos::create_mirror_view(view);

Create, allocate and synchronize if source view is not in same space as destination view

auto mirrorView = Kokkos::create_mirror_view_and_copy(ExecutionSpace(), view);

Subview

A subview has the same reference count as its parent view, so the parent view won't be deallocated before all subviews go away.

Doc https://kokkos.org/kokkos-core-wiki/API/core/view/subview.html

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Subviews.html

auto subview = Kokkos::subview(view, selector1, selector2, ...);
Subset selector Description
Kokkos::ALL All elements in this dimension
Kokkos::pair(first, last) Range of elements in this dimension
value Specific element in this dimension

Scatter view (experimental)

Doc https://kokkos.org/kokkos-core-wiki/API/containers/ScatterView.html

Training https://github.com/kokkos/kokkos-tutorials/blob/main/LectureSeries/KokkosTutorial_03_MDRangeMoreViews.pdf

Code

Specific header

#include <Kokkos_ScatterView.hpp>

Create

auto scatterView = Kokkos::Experimental::create_scatter_view<Operation, Duplication, Contribution>(targetView);
Template argument Description
Operation See scatter operation; defaults to Kokkos::Experimental::ScatterSum
Duplication Whether to duplicate the grid or not; choices are Kokkos::Experimental::ScatterDuplicated, and Kokkos::Experimental::ScatterNonDuplicated; defaults to the option that is the most optimised for targetView's execution space
Contribution Whether to contribute using atomics or not; choices are Kokkos::Experimental::ScatterAtomic, or Kokkos::Experimental::ScatterNonAtomic; defaults to the option that is the most optimised for targetView's execution space

Scatter operation

Operation Description
Kokkos::Experimental::ScatterSum Sum
Kokkos::Experimental::ScatterProd Product
Kokkos::Experimental::ScatterMin Minimum value
Kokkos::Experimental::ScatterMax Maximum value

Scatter, compute, and gather

Kokkos::parallel_for(
    "label",
    /* ... */,
    KOKKOS_LAMBDA (/* ... */) {
        // scatter
        auto scatterAccess = scatterView.access();

        // compute
        scatterAccess(/* index */) /* operation */ /* contribution */;
    }
);

// gather
Kokkos::Experimental::contribute(targetView, scatterView);
Full example
#include<Kokkos_ScatterView.hpp>

// Compute histogram of values in view1D
KOKKOS_INLINE_FUNCTION int getIndex(double position) { /* ... */ }
KOKKOS_INLINE_FUNCTION double compute(double weight) { /* ... */ }

// Views of 100 elements to process
Kokkos::View<double*> position("position", 100);
Kokkos::View<double*> weight("weight", 100);

// Histogram of 10 slots
Kokkos::View<double*> histogram("bar", 10);
auto histogramScatter = Kokkos::Experimental::create_scatter_view(histogram);

Kokkos::parallel_for(
    100,
    KOKKOS_LAMBDA (const int i) {
        // scatter
        auto access = histogramScatter.access();

        // compute
        const auto index = getIndex(position(i));
        const auto contribution = compute(weight(i));
        access(index) += contribution;
    }
);

// gather
Kokkos::Experimental::contribute(histogram, histogramScatter);

Parallel constructs

For loop

Kokkos::parallel_for(
    "label",
    ExecutionPolicy</* ... */>(/* ... */),
    KOKKOS_LAMBDA (/* ... */) { /* ... */ }
);

Reduction

ScalarType result;
Kokkos::parallel_reduce(
    "label",
    ExecutionPolicy</* ... */>(/* ... */),
    KOKKOS_LAMBDA (/* ... */, ScalarType& resultLocal) { /* ... */ },
    Kokkos::ReducerConcept<ScalarType>(result)
);

With Kokkos::ReducerConcept being one of the following:

Reducer Operation Description
Kokkos::BAnd & Binary and
Kokkos::BOr | Binary or
Kokkos::LAnd && Logical and
Kokkos::LOr || Logical or
Kokkos::Max std::max Maximum
Kokkos::MaxLoc std::max_element Maximum and associated index
Kokkos::Min std::min Minimum
Kokkos::MinLoc std::min_element Minimum and associated index
Kokkos::MinMax std::minmax Minimum and maximum
Kokkos::MinMaxLoc std::minmax_element Minimum and maximum and associated indices
Kokkos::Prod * Product
Kokkos::Sum + Sum

A scalar value may be passed, for which the reduction is limited to a sum. When using the TeamVectorMDRange, the TeamThreadMDRange, or the ThreadVectorMDRange execution policy, only a scalar value may be passed, for which the reduction is also limited to a sum.

Doc https://kokkos.org/kokkos-core-wiki/API/core/parallel-dispatch/parallel_reduce.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/builtin_reducers.html

Fences

Global fence

Kokkos::fence("label");

Doc https://kokkos.org/kokkos-core-wiki/API/core/parallel-dispatch/fence.html

Execution space fence

ExecutionSpace().fence("label");

Team barrier

Kokkos::TeamPolicy<>::member_type().team_barrier();

Doc https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html#functionality

Execution policy

Create

ExecutionPolicy<ExecutionSpace, Schedule, IndexType, LaunchBounds, WorkTag> policy(/* ... */);
Template argument Description
ExecutionSpace See execution spaces; defaults to Kokkos::DefaultExecutionSpace
Schedule How to schedule work items; defaults to machine and backend specifics
IndexType Integer type to be used for the index; defaults to int64_t
LaunchBounds Hints for CUDA and HIP launch bounds
WorkTag Empty tag class to call the functor

Doc https://kokkos.org/kokkos-core-wiki/API/core/Execution-Policies.html

Ranges

One-dimensional range

Kokkos::RangePolicy<ExecutionSpace, Schedule, IndexType LaunchBounds, WorkTag> policy(first, last);

If the range starts at 0 and uses default parameters, can be replaced by just the number of elements.

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/RangePolicy.html

Multi-dimensional (dimension 2)

Kokkos::MDRangePolicy<ExecutionSpace, Schedule, IndexType, LaunchBounds, WorkTag, Kokkos::Rank<2>> policy({firstI, firstJ}, {lastI, lastJ});

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/MDRangePolicy.html

Hierarchical parallelism

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html

Team policy

Kokkos::TeamPolicy<ExecutionSpace, Schedule, IndexType, LaunchBounds, WorkTag> policy(leagueSize, teamSize);

Usually, teamSize is replaced by Kokkos::AUTO to let Kokkos determine it. A kernel running in a team policy has a Kokkos::TeamPolicy<>::member_type argument:

Method Description
league_size() Number of teams in the league
league_rank() Index of the team within the league
team_size() Number of threads in the team
team_rank() Index of the thread within the team

Note that nested parallel constructs do not use KOKKOS_LAMBDA to create lambdas. One must use the C++ syntax, for example [=] or [&].

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamPolicy.html

Team vector level (2-level hierarchy)

Kokkos::parallel_for(
    "label",
    Kokkos::TeamPolicy(numberOfElementsI, Kokkos::AUTO),
    KOKKOS_LAMBDA (const Kokkos::TeamPolicy<>::member_type& teamMember) {
        const int i = teamMember.team_rank();

        Kokkos::parallel_for(
            Kokkos::TeamVectorRange(teamMember, firstJ, lastJ),
            [=] (const int j) { /* ... */ }
        );
    }
);
One-dimensional range
Kokkos::TeamVectorRange range(teamMember, firstJ, lastJ);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamVectorRange.html

Multi-dimensional range (dimension 2)
Kokkos::TeamVectorMDRange<Kokkos::Rank<2>, Kokkos::TeamPolicy<>::member_type> range(teamMember, numberOfElementsJ, numberOfElementsK);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamVectorMDRange.html

Team thread vector level (3-level hierarchy)

Kokkos::parallel_for(
    "label",
    Kokkos::TeamPolicy(numberOfElementsI, Kokkos::AUTO),
    KOKKOS_LAMBDA (const Kokkos::TeamPolicy<>::member_type& teamMember) {
        const int i = teamMember.team_rank();

        Kokkos::parallel_for(
            Kokkos::TeamThreadRange(teamMember, firstJ, lastJ),
            [=] (const int j) {
                Kokkos::parallel_for(
                    Kokkos::ThreadVectorRange(teamMember, firstK, lastK),
                    [=] (const int k) { /* ... */ }
                );
            }
        );
    }
);
One-dimensional range
Kokkos::TeamThreadRange range(teamMember, firstJ, lastJ);
Kokkos::ThreadVectorRange range(teamMember, firstK, lastK);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamThreadRange.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/ThreadVectorRange.html

Multi-dimensional range (dimension 2)
Kokkos::TeamThreadMDRange<Kokkos::Rank<2>, Kokkos::TeamPolicy<>::member_type> range(teamMember, numberOfElementsJ, numberOfElementsK);
Kokkos::ThreadVectorMDRange<Kokkos::Rank<2>, Kokkos::TeamPolicy<>::member_type> range(teamMember, numberOfElementsL, numberOfElementsM);

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/TeamThreadMDRange.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/policies/ThreadVectorMDRange.html

Scratch memory

Each team has access to a scratch memory pad, which has the team's lifetime, and is only accessible by the team's threads.

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/HierarchicalParallelism.html#team-scratch-pad-memory

Scratch memory space

Space level Memory size Access speed
0 Limited (tens of kilobytes) Fast
1 Larger (few gigabytes) Medium

Used when passing the team policy to the parallel construct and when creating the scratch memory pad.

Create and populate

// Define a scratch memory pad type
using ScratchPad = Kokkos::View<DataType, Kokkos::DefaultExecutionSpace::scratch_memory_space, Kokkos::MemoryTraits<Kokkos::Unmanaged>>;

// Compute how much scratch memory is needed (in bytes)
size_t bytes = ScratchPad::shmem_size(vectorSize);

// Create the team policy and specify the total scratch memory needed
Kokkos::parallel_for(
    "label",
    Kokkos::TeamPolicy<>(leagueSize, teamSize).set_scratch_size(spaceLevel, Kokkos::PerTeam(bytes)),
    KOKKOS_LAMBDA (const Kokkos::TeamPolicy<>::member_type& teamMember) {
        const int i = teamMember.league_rank();

        // Create the scratch pad
        ScratchPad scratch(teamMember.team_scratch(spaceLevel), vectorSize);

        // Initialize it
        Kokkos::parallel_for(
            Kokkos::TeamVectorRange(teamMember, vectorSize),
            [=] (const int j) { scratch(j) = getScratchData(i, j); }
        );

        // Synchronize
        teamMember.team_barrier();
    }
);

Atomics

Doc https://kokkos.org/kokkos-core-wiki/ProgrammingGuide/Atomic-Operations.html

Doc https://kokkos.org/kokkos-core-wiki/API/core/atomics.html

Atomic operations

Operation Replaces
Kokkos::atomic_add(&x, y) x += y
Kokkos::atomic_and(&x, y) x &= y
Kokkos::atomic_dec(&x) x--
Kokkos::atomic_inc(&x) x++
Kokkos::atomic_lshift(&x, y) x = x << y
Kokkos::atomic_max(&x, y) x = std::max(x, y)
Kokkos::atomic_min(&x, y) x = std::min(x, y)
Kokkos::atomic_mod(&x, y) x %= y
Kokkos::atomic_nand(&x, y) x = !(x && y)
Kokkos::atomic_or(&x, y) x |= y
Kokkos::atomic_rshift(&x, y) x = x >> y
Kokkos::atomic_sub(&x, y) x -= y
Kokkos::atomic_store(&x, y) x = y
Kokkos::atomic_xor(&x, y) x ^= y
Example
Kokkos::parallel_for(
    numberOfElements,
    KOKKOS_LAMBDA (const int i) {
        const int value = /* ... */;
        const int bucketIndex = computeBucketIndex (value);
        Kokkos::atomic_inc(&histogram(bucketIndex));
    }
);

Atomic exchanges

Operation Description
Kokkos::atomic_exchange(&x, desired) Assign desired value to object and return old value
Kokkos::atomic_compare_exchange(&x, expected, desired) Assign desired value to object if the object has the expected value and return the old value
Example
// Assign desired value to object and return old value
int desired = 20;
int obj = 10;
int old = Kokkos::atomic_exchange(&obj, desired);

// Assign desired value to object if the object has the expected value and return the old value
int desired = 20;
int obj = 10;
int expected = 10;
int old = atomic_compare_exchange(&obj, expected, desired);

Mathematics

Math functions

Function type List of functions (prefixed by Kokkos::)
Basic operations abs, fabs, fmod, remainder, fma, fmax, fmin, fdim, nan
Exponential exp, exp2, expm1, log, log2, log10, log1p
Power pow, sqrt, cbrt, hypot
Trigonometric sin, cos, tan, asin, acos, atan, atan2
Hyperbolic sinh, cosh, tanh, asinh, acosh, atanh
Error and gamma erf, erfc, tgamma, lgamma
Nearest ceil, floor, trunc, round, nearbyint
Floating point logb, nextafter, copysign
Comparisons isfinite, isinf, isnan, signbit

Note that not all C++ standard math functions are available.

Doc https://kokkos.org/kokkos-core-wiki/API/core/numerics/mathematical-functions.html?highlight=math

Complex numbers

Create

Kokkos::complex<double> complex(realPart, imagPart);

Manage

Method Description
real() Returns or sets the real part
imag() Returns or sets the imaginary part

Doc https://kokkos.org/kokkos-core-wiki/API/core/utilities/complex.html

Utilities

Doc https://kokkos.org/kokkos-core-wiki/API/core/Utilities.html

Code interruption

Kokkos::abort("message");

Print inside a kernel

Kokkos::printf("format string", arg1, arg2);

Similar to std::printf.

Timer

Create

Kokkos::Timer timer;

Manage

Method Description
seconds() Returns the time in seconds since construction or last reset
reset() Resets the timer to zero

Manage parallel environment

Function Description
Kokkos::device_id() Returns the device ID of the current device
Kokkos::num_devices() Returns the number of devices available to the current execution space

Macros

Doc https://kokkos.org/kokkos-core-wiki/API/core/Macros.html

Essential macros

Macro Description
KOKKOS_LAMBDA Replaces capture argument for lambdas
KOKKOS_CLASS_LAMBDA Replaces capture argument for lambdas, captures this
KOKKOS_FUNCTION Functor attribute
KOKKOS_INLINE_FUNCTION Inlined functor attribute

Extra macros

Macro Description
KOKKOS_VERSION Kokkos full version
KOKKOS_VERSION_MAJOR Kokkos major version
KOKKOS_VERSION_MINOR Kokkos minor version
KOKKOS_VERSION_PATCH Kokkos patch level
KOKKOS_ENABLE_* Any equivalent CMake option passed when building Kokkos, see installation cheat sheet
KOKKOS_ARCH_* Any equivalent CMake option passed when building Kokkos, see installation cheat sheet