Add joss-paper

Yuuichi Asahi · Yuuichi Asahi · commit c9d3ad684912 · 2025-07-11T01:17:30.000+09:00
diff --git a/.github/workflows/draft-pdf.yaml b/.github/workflows/draft-pdf.yaml
@@ -0,0 +1,24 @@
+name: Draft PDF
+on: [push]
+
+jobs:
+  paper:
+    runs-on: ubuntu-latest
+    name: Paper Draft
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Build draft PDF
+        uses: openjournals/openjournals-draft-action@master
+        with:
+          journal: joss
+          # This should be the path to the paper within your repo.
+          paper-path: paper/paper.md
+      - name: Upload
+        uses: actions/upload-artifact@v4
+        with:
+          name: paper
+          # This is the output path where Pandoc will write the compiled
+          # PDF. Note, this should be the same directory as the input
+          # paper.md
+          path: paper/paper.pdf
diff --git a/paper/hw2D.png b/paper/hw2D.png
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -0,0 +1,77 @@
+@ARTICLE{Trott2021,
+author={Trott, Christian and Berger-Vergiat, Luc and Poliakoff, David and Rajamanickam, Sivasankaran and Lebrun-Grandie, Damien and Madsen, Jonathan and Al Awar, Nader and Gligoric, Milos and Shipman, Galen and Womeldorff, Geoff},
+journal={ Computing in Science \& Engineering },
+title={{ The Kokkos EcoSystem: Comprehensive Performance Portability for High Performance Computing }},
+year={2021},
+volume={23},
+number={05},
+ISSN={1558-366X},
+pages={10-18},
+abstract={ State-of-the-art engineering and science codes have grown in complexity dramatically over the last two decades. Application teams have adopted more sophisticated development strategies, leveraging third party libraries, deploying comprehensive testing, and using advanced debugging and profiling tools. In today’s environment of diverse hardware platforms, these applications also desire performance portability—avoiding the need to duplicate work for various platforms. The Kokkos EcoSystem provides that portable software stack. Based on the Kokkos Core Programming Model, the EcoSystem provides math libraries, interoperability capabilities with Python and Fortran, and Tools for analyzing, debugging, and optimizing applications. In this article, we overview the components, discuss some specific use cases, and highlight how codesigning these components enables a more developer friendly experience. },
+keywords={High performance computing;Performance evaluation;Programming;Computer architecture;Debugging;Ecosystems},
+doi={10.1109/MCSE.2021.3098509},
+url = {https://doi.ieeecomputersociety.org/10.1109/MCSE.2021.3098509},
+publisher={IEEE Computer Society},
+address={Los Alamitos, CA, USA},
+month=sep}
+
+@ARTICLE{Rockmore2000,
+  author={Rockmore, D.N.},
+  journal={Computing in Science & Engineering}, 
+  title={The FFT: an algorithm the whole family can use}, 
+  year={2000},
+  volume={2},
+  number={1},
+  pages={60-64},
+  keywords={Discrete Fourier transforms;Equations;Signal processing algorithms;Fast Fourier transforms;Mathematics;Orbital calculations;Convolution;Information processing;Internet;Modems},
+  doi={10.1109/5992.814659}}
+
+@ARTICLE{Trott2022,
+  author={Trott, Christian R. and Lebrun-Grandi{\'e}, Damien and Arndt, Daniel and Ciesko, Jan and Dang, Vinh and Ellingwood, Nathan and Gayatri, Rahulkumar and Harvey, Evan and Hollman, Daisy S. and Ibanez, Dan and Liber, Nevin and Madsen, Jonathan and Miles, Jeff and Poliakoff, David and Powell, Amy and Rajamanickam, Sivasankaran and Simberg, Mikael and Sunderland, Dan and Turcksin, Bruno and Wilke, Jeremiah},
+  journal={IEEE Transactions on Parallel and Distributed Systems}, 
+  title={Kokkos 3: Programming Model Extensions for the Exascale Era}, 
+  year={2022},
+  volume={33},
+  number={4},
+  pages={805-817},
+  keywords={Programming;Hardware;Kernel;Graphics processing units;Layout;Laboratories;Benchmark testing;Performance portability;programming models;high-performance computing;heterogeneous computing;exascale},
+  doi={10.1109/TPDS.2021.3097283}}
+
+@Article{Harris2020,
+ title         = {Array programming with {NumPy}},
+ author        = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J.
+                 van der Walt and Ralf Gommers and Pauli Virtanen and David
+                 Cournapeau and Eric Wieser and Julian Taylor and Sebastian
+                 Berg and Nathaniel J. Smith and Robert Kern and Matti Picus
+                 and Stephan Hoyer and Marten H. van Kerkwijk and Matthew
+                 Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del
+                 R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre
+                 G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and
+                 Warren Weckesser and Hameer Abbasi and Christoph Gohlke and
+                 Travis E. Oliphant},
+ year          = {2020},
+ month         = sep,
+ journal       = {Nature},
+ volume        = {585},
+ number        = {7825},
+ pages         = {357--362},
+ doi           = {10.1038/s41586-020-2649-2},
+ publisher     = {Springer Science and Business Media {LLC}},
+ url           = {https://doi.org/10.1038/s41586-020-2649-2}
+}
+
+@article{Wakatani1984,
+    author = {Wakatani, Masahiro and Hasegawa, Akira},
+    title = {A collisional drift wave description of plasma edge turbulence},
+    journal = {The Physics of Fluids},
+    volume = {27},
+    number = {3},
+    pages = {611-618},
+    year = {1984},
+    month = {03},
+    abstract = {Model mode‐coupling equations for the resistive drift wave instability are numerically solved for realistic parameters found in tokamak edge plasmas. The Bohm diffusion is found to result if the parallel wavenumber is chosen to maximize the growth rate for a given value of the perpendicular wavenumber. The saturated turbulence energy has a broad frequency spectrum with a large fluctuation level proportional to κ̄ (=ρs/Ln, the normalized inverse scale length of the density gradient) and a wavenumber spectrum of the two‐dimensional Kolmogorov–Kraichnan type, ∼k−3.},
+    issn = {0031-9171},
+    doi = {10.1063/1.864660},
+    url = {https://doi.org/10.1063/1.864660},
+    eprint = {https://pubs.aip.org/aip/pfl/article-pdf/27/3/611/12476138/611\_1\_online.pdf},
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -0,0 +1,139 @@
+---
+title: 'kokkos-fft: A shared-memory FFT for the Kokkos ecosystem'
+tags:
+  - C++
+  - FFT
+  - High performance computing
+  - Performance portability
+authors:
+  - name: Yuuichi Asahi
+    orcid: 0000-0002-9997-1274
+    equal-contrib: true
+    affiliation: "1" # (Multiple affiliations must be quoted)
+  - name: Thomas Padioleau
+    orcid: 0000-0001-5496-0013
+    equal-contrib: true
+    affiliation: "1" # (Multiple affiliations must be quoted)
+  - name: Paul Zehner
+    orcid: 0000-0002-4811-0079
+    equal-contrib: true
+    affiliation: "1" # (Multiple affiliations must be quoted)
+  - name: Julien Bigot
+    orcid: 0000-0002-0015-4304
+    equal-contrib: true
+    affiliation: "1" # (Multiple affiliations must be quoted)
+  - name: Damien Lebrun-Grandie
+    orcid: 0000-0003-1952-7219
+    equal-contrib: true
+    affiliation: "2" # (Multiple affiliations must be quoted)
+affiliations:
+ - name: Université Paris-Saclay, UVSQ, CNRS, CEA, Maison de la Simulation, 91191, Gif-sur-Yvette, France
+   index: 1
+ - name: Oak Ridge National Laboratory, Oak Ridge, Tennessee, US
+   index: 2
+
+date: 6 June 2025
+bibliography: paper.bib
+
+# Optional fields if submitting to a AAS journal too, see this blog post:
+# https://blog.joss.theoj.org/2018/12/a-new-collaboration-with-aas-publishing
+aas-doi: 10.3847/xxxxx <- update this with the DOI from AAS once you know it.
+aas-journal: Astrophysical Journal <- The name of the AAS journal.
+---
+
+# Summary
+
+kokkos-fft provides a unified, performance-portable interface for Fast Fourier Transforms (FFTs) within the Kokkos ecosystem [@Trott2021]. It seamlessly integrates with leading local FFT libraries including FFTW, cuFFT, rocFFT, and oneMKL. Designed for simplicity and efficiency, kokkos-fft offers a user experience akin to numpy.fft for in-place and out-of-place transforms, while leveraging the raw speed of vendor-optimized libraries. A demonstration solving 2D Hasegawa-Wakatani turbulence with the Fourier spectral method illustrates how kokkos-fft can deliver significant speedups over Python-based alternatives without drastically increasing code complexity, empowering researchers to perform high-performance FFTs simply and effectively.
+
+# Statement of need
+
+The fast Fourier transform (FFT) is a family of fundamental algorithms that is widely used in scientific computing and other areas [@Rockmore2000]. [kokkos-fft](https://github.com/kokkos/kokkos-fft) is designed to help [Kokkos](https://github.com/kokkos/kokkos) [@Trott2022] users who are:
+
+* developing a Kokkos application which relies on FFT libraries. E.g., fluid simulation codes with periodic boundaries, plasma turbulence, etc.
+
+* inclined to integrate in-situ signal and image processing with FFTs. E.g., spectral analyses, low pass filtering, etc.
+
+* willing to use de facto standard FFT libraries just like [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html) [@Harris2020].
+
+kokkos-fft can benefit such users through the following features:
+
+* A simple interface like [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html) with in-place and out-of-place transforms:  
+Only accepts [Kokkos Views](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) which corresponds to the [numpy.array](https://numpy.org/doc/stable/reference/generated/numpy.array.html), to make APIs simple and safe.
+
+* 1D, 2D, 3D standard and real FFT functions (similar to [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html)) over 1D to 8D Kokkos Views:  
+Batched plans are automatically used if View dimension is larger than FFT dimension.
+
+* A reusable [FFT plan](https://kokkosfft.readthedocs.io/en/latest/api/plan/plan.html) which wraps the vendor libraries for each Kokkos backend:  
+[FFTW](http://www.fftw.org), [cuFFT](https://developer.nvidia.com/cufft), [rocFFT](https://github.com/ROCm/rocFFT), and [oneMKL](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html) are automatically enabled based on the enabled Kokkos backend.
+
+* Support for multiple CPU and GPU backends:  
+FFT libraries for the enabled Kokkos backend are executed on the stream/queue used in that [`ExecutionSpace`](https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html) where the parallel operations are performed.
+
+* Compile time and/or runtime errors for invalid usage (e.g. `View` extents mismatch).
+
+# How to use kokkos-fft
+
+For those who are familiar with [`numpy.fft`](https://numpy.org/doc/stable/reference/routines.fft.html), you may use kokkos-fft quite easily. In fact, all of the numpy.fft functions (`numpy.fft.<function_name>`) have an analogous counterpart in kokkos-fft (`KokkosFFT::<function_name>`), which can run on the Kokkos device. In addition, kokkos-fft supports [in-place transform](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#inplace-transform) and [plan reuse](https://kokkosfft.readthedocs.io/en/latest/intro/using.html#reuse-fft-plan) capabilities.
+
+Let's start with a simple example to perform the 1D real to complex transform using `rfft` in kokkos-fft.
+
+```C++
+#include <Kokkos_Core.hpp>
+#include <Kokkos_Random.hpp>
+#include <KokkosFFT.hpp>
+int main(int argc, char* argv[]) {
+  Kokkos::ScopeGuard guard(argc, argv);
+  const int n = 4;
+  Kokkos::View<double*> x("x", n);
+  Kokkos::View<Kokkos::complex<double>*> x_hat("x_hat", n/2+1);
+  // initialize the input array with random values
+  Kokkos::DefaultExecutionSpace exec;
+  Kokkos::Random_XorShift64_Pool<> random_pool(/*seed=*/12345);
+  Kokkos::fill_random(exec, x, random_pool, /*range=*/1.0);
+  KokkosFFT::rfft(exec, x, x_hat);
+  // block the current thread until all work enqueued into exec is finished
+  exec.fence();
+}
+```
+
+This is equivalent to the following Python code.
+
+```python
+import numpy as np
+x = np.random.rand(4)
+x_hat = np.fft.rfft(x)
+```
+
+There are two additional arguments in the Kokkos version:
+
+* `exec`: [*Kokkos execution space instance*](https://kokkos.org/kokkos-core-wiki/API/core/execution_spaces.html) that encapsulates the underlying compute resources (e.g., CPU cores, GPU devices) where the task will be dispatched for execution.
+
+* `x_hat`: [*Kokkos Views*](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) where the complex-valued FFT output will be stored. By accepting this view as an argument, the function allows the user to pre-allocate memory and optimize data placement, avoiding unnecessary allocations and copies.
+
+Also, kokkos-fft only accepts [Kokkos Views](https://kokkos.org/kokkos-core-wiki/API/core/view/view.html) as input data. The accessibility of a View from `ExecutionSpace` is statically checked and will result in a compilation error if not accessible. See [documentations](https://kokkosfft.readthedocs.io/en/latest/intro/quick_start.html) for basic usage.
+
+# Benchmark: 2D Hasegawa-Wakatani turbulence with the Fourier spectral method
+
+As a more scientific example, we solve a typical 2D plasma turbulence model, called the Hasegawa-Wakatani equation [@Wakatani1984] using the Fourier spectral method (see \autoref{fig:hw2D} for the vorticity structure).
+
+![Vorticity.\label{fig:hw2D}](hw2D.png)
+
+Using Kokkos and kokkos-fft, we can easily implement the code (see [example](https://github.com/kokkos/kokkos-fft/tree/main/examples/10_HasegawaWakatani/README.md)), just like Python, while getting a significant acceleration. The core computational kernel of the code is the nonlinear term which is computed with FFTs. We construct the forward and backward FFT plans once during initialization which are reused in the time evolution loops.
+
+We have performed a benchmark of this application over multiple backends. We performed a simulation for 100 steps with a resolution of `1024 x 1024` while I/Os are disabled. The following table shows the achieved performance.
+
+| Device | Icelake (python) | Icelake | A100 | H100 | MI250X | PVC |
+| --- | --- | --- | --- | --- | --- | --- |
+| Kokkos Backend | - | OpenMP | CUDA | CUDA | HIP | SYCL |
+| LOC | 568 | 738 | 738 | 738 | 738 | 738 |
+| Compiler/version | Python 3.12.3 | IntelLLVM 2023.0.0 | nvcc 12.2 | nvcc 12.3 | rocm 5.7 | IntelLLVM 2024.0.2 |
+| GB/s (Theoretical peak) | 205 | 205 | 1555 | 3350 | 1600 | 3276.8 |
+| Elapsed time [s] | 463 | 9.28 | 0.25 | 0.14 | 0.41 | 0.30 |
+
+Here, the testbed includes Intel Xeon Platinum 8360Y (referred to as Icelake), NVIDIA A100 and H100 GPUs, AMD MI250X GPU (1 GCD) and Intel Data Center GPU Max 1550 (referred to as PVC). On Icelake, we use 36 cores with OpenMP parallelization. As expected, the Python version is the simplest in terms of lines of code (LOC). With Kokkos and kokkos-fft, the same logic can be implemented without significantly increasing the source code size (roughly 1.5 times longer). However, the benefit is enormous, allowing a single and simple code runs on multiple architectures efficiently.
+
+# Acknowledgements
+
+This work has received support by the CExA Moonshot project of the CEA [cexa-project](https://cexa-project.org). This work was carried out using FUJITSU PRIMERGY GX2570 (Wisteria/BDEC-01) at The University of Tokyo. This work was partly supported by JHPCN project jh220036. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. This work was also granted access to the HPC resources of CINES under the allocation 2023-cin4492 made by GENCI.
+
+# References