Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.2.0 Release #74

Merged
merged 53 commits into from
Feb 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
ba32717
feat: 1. Modified the 'lib' rule in the Makefile file located in the …
buttfa Sep 27, 2024
0217886
Merge pull request #64 from buttfa/feature_library
austinvhuang Oct 5, 2024
7addf83
Add kShaderMatmul2DTiling in kernels.h
junjihashimoto Oct 13, 2024
da1f32d
Reduce matmul-kernel creation time
junjihashimoto Oct 13, 2024
dd2a25f
Change Kernel to shared_ptr<RawKernel> to support cached kernels
junjihashimoto Oct 13, 2024
9fff8cd
Add caches for ops.cpp
junjihashimoto Oct 16, 2024
b9fb38b
Add caches for unittests.cpp
junjihashimoto Oct 16, 2024
efb87ee
Fix bugs
junjihashimoto Oct 16, 2024
c474fba
Remove global variables of kernels
junjihashimoto Oct 16, 2024
438be22
Fix the matmul of version 1 in unittests
junjihashimoto Oct 16, 2024
dd145ac
Add the duration-time of matmul_forward_dummy to compare GPU's one wi…
junjihashimoto Oct 16, 2024
590f257
Add wgpuBufferRelease for CopyData
junjihashimoto Oct 18, 2024
6c52b98
Add wgpuCommandBufferRelease after calling wgpuQueueSubmit
junjihashimoto Oct 18, 2024
a6140e0
Add wgpuCommandEncoderRelease after calling wgpuCommandEncoderFinish
junjihashimoto Oct 18, 2024
30ed026
Add wgpuComputePassEncoderRelease after calling wgpuComputePassEncode…
junjihashimoto Oct 18, 2024
3c06137
Merge branch 'main' into dev
austinvhuang Oct 19, 2024
0a9437f
chore: Set a check-os target and improved the description of how to i…
buttfa Oct 19, 2024
f4e1683
Merge pull request #68 from junjihashimoto/feature/cache
austinvhuang Oct 20, 2024
0e89e65
Merge pull request #69 from buttfa/feature_library
austinvhuang Oct 20, 2024
d4eb571
Add the ops of AoT
junjihashimoto Oct 21, 2024
43e4ac0
Update
junjihashimoto Oct 21, 2024
1d8e435
Update
junjihashimoto Oct 22, 2024
4985930
Update
junjihashimoto Oct 22, 2024
f3e0dbc
Add summantion kernels
junjihashimoto Oct 30, 2024
f956f2b
Add SumKernel
junjihashimoto Oct 30, 2024
c13833f
Add SumKernel2d
junjihashimoto Nov 3, 2024
189375f
Merge branch 'dev' into feature/reduce
junjihashimoto Nov 3, 2024
e94aa02
Merge pull request #71 from junjihashimoto/feature/reduce
austinvhuang Nov 3, 2024
c9b7018
fix printf format codes
austinvhuang Nov 3, 2024
7ef40b0
Add a flag to disable bardward-pass
junjihashimoto Nov 4, 2024
6be7e1e
Fix the bug of memory allocation
junjihashimoto Nov 16, 2024
f629a33
Remove NUM_PARAMETER_LAYERS
junjihashimoto Nov 16, 2024
28c7062
Merge pull request #70 from junjihashimoto/feature/aot
austinvhuang Nov 18, 2024
6e3a240
Add python bindings
junjihashimoto Dec 25, 2024
3228b1b
Add haskell bindings
junjihashimoto Dec 28, 2024
a7520ce
Merge pull request #73 from junjihashimoto/feature/python
austinvhuang Dec 29, 2024
4669791
migrate to updated dawn commit 556f960f44690b3b808c779c08b44d48d42929…
austinvhuang Jan 28, 2025
254e4ea
remove legacy dir from experimental
austinvhuang Jan 28, 2025
f3f3b27
gpt2_webgpu_aot runs on mac after updating experimental/kernels/Makef…
austinvhuang Jan 28, 2025
b397959
add detailed note regarding dawn modifications to fix linker errors o…
austinvhuang Jan 28, 2025
40fd25d
correct commit hash
austinvhuang Jan 28, 2025
46db79d
update setup.py auto-downloads to point to updated libwebgpu_dawn.dyl…
austinvhuang Jan 28, 2025
73f438a
clang-format cleanup
austinvhuang Jan 28, 2025
d8d618a
make Context lifetime more robust dont rely on RVO which seems to fai…
austinvhuang Jan 30, 2025
ec68b14
Change priority of internal logging from kInfo to kTrace. Make julia …
austinvhuang Jan 30, 2025
4589f1f
bump dawn version to c469d593ac and remove WebGPU-distribution from t…
austinvhuang Feb 1, 2025
c3ee69b
move web build example to experimental due to emscriptens webgpu impl…
austinvhuang Feb 2, 2025
3924552
update artifact link
austinvhuang Feb 2, 2025
a8a44d3
skip float16 targets in CI
austinvhuang Feb 2, 2025
7dc064c
test float16 in CI
austinvhuang Feb 2, 2025
041d2fd
Fix pybind
junjihashimoto Feb 7, 2025
89f9097
Fix haskell binding
junjihashimoto Feb 7, 2025
6447d85
Merge pull request #75 from junjihashimoto/fix/pybind
austinvhuang Feb 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 0 additions & 4 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
[submodule "third_party/local/WebGPU-distribution"]
path = third_party/local/WebGPU-distribution
url = https://github.com/eliemichel/WebGPU-distribution.git
branch = dawn
[submodule "third_party/llm.c"]
path = third_party/llm.c
url = https://github.com/karpathy/llm.c
58 changes: 48 additions & 10 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,42 @@ pch:
mkdir -p build && $(CXX) -std=c++17 $(INCLUDES) -x c++-header gpu.hpp -o build/gpu.hpp.pch

# TODO(avh): change extension based on platform
lib:
mkdir -p build && $(CXX) -std=c++17 $(INCLUDES) -L$(LIBDIR) -ldawn -ldl -shared -fPIC gpu.cpp -o build/libgpucpp.dylib
# Get the current OS name
OS = $(shell uname | tr -d '\n')
# Set the specific variables for each platform
LIB_PATH ?= /usr/lib
HEADER_PATH ?= /usr/include
ifeq ($(OS), Linux)
OS_TYPE ?= Linux
GPU_CPP_LIB_NAME ?= libgpucpp.so
DAWN_LIB_NAME ?= libwebgpu_dawn.so
else ifeq ($(OS), Darwin)
OS_TYPE ?= macOS
GPU_CPP_LIB_NAME ?= libgpucpp.dylib
DAWN_LIB_NAME ?= libwebgpu_dawn.dylib
else
OS_TYPE ?= unknown
endif

lib: check-clang dawnlib
mkdir -p build && $(CXX) -std=c++17 $(INCLUDES) -L$(LIBDIR) -lwebgpu_dawn -ldl -shared -fPIC gpu.cpp -o build/$(GPU_CPP_LIB_NAME)
python3 build.py
cp third_party/lib/$(DAWN_LIB_NAME) build/

install:
cp build/$(GPU_CPP_LIB_NAME) $(LIB_PATH)
cp build/$(DAWN_LIB_NAME) $(LIB_PATH)
cp build/gpu.hpp $(HEADER_PATH)

uninstall:
rm $(LIB_PATH)/$(GPU_CPP_LIB_NAME)
rm $(LIB_PATH)/$(DAWN_LIB_NAME)
rm $(HEADER_PATH)/gpu.hpp

examples/hello_world/build/hello_world: check-clang dawnlib examples/hello_world/run.cpp check-linux-vulkan
$(LIBSPEC) && cd examples/hello_world && make build/hello_world && ./build/hello_world

dawnlib: $(if $(wildcard third_party/lib/libdawn.so third_party/lib/libdawn.dylib),,run_setup)
dawnlib: $(if $(wildcard third_party/lib/libwebgpu_dawn.so third_party/lib/libwebgpu_dawn.dylib),,run_setup)

run_setup: check-python
python3 setup.py
Expand All @@ -42,7 +71,7 @@ all: dawnlib check-clang check-linux-vulkan lib pch

# Test 16-bit floating point type
test-half: dawnlib check-clang
$(LIBSPEC) && clang++ -std=c++17 $(INCLUDES) numeric_types/half.cpp -L$(LIBDIR) -ldawn -ldl -o build/half && ./build/half
$(LIBSPEC) && clang++ -std=c++17 $(INCLUDES) numeric_types/half.cpp -L$(LIBDIR) -lwebgpu_dawn -ldl -o build/half && ./build/half

docs: Doxyfile
doxygen Doxyfile
Expand Down Expand Up @@ -73,7 +102,7 @@ all-cmake: check-clang check-cmake
################################################################################

clean-dawnlib:
rm -f third_party/lib/libdawn.so third_party/lib/libdawn.dylib
rm -f third_party/lib/libwebgpu_dawn.so third_party/lib/libwebgpu_dawn.dylib

clean:
read -r -p "This will delete the contents of build/*. Are you sure? [CTRL-C to abort] " response && rm -rf build/*
Expand All @@ -90,21 +119,30 @@ clean:
rm -f build/half

clean-all:
read -r -p "This will delete the contents of build/* and third_party/*. Are you sure? [CTRL-C to abort] " response && rm -rf build/* third_party/fetchcontent/* third_party/gpu-build third_party/gpu-subbuild third_party/gpu-src third_party/lib/libdawn.so third_party/lib/libdawn.dylib
read -r -p "This will delete the contents of build/* and third_party/*. Are you sure? [CTRL-C to abort] " response && rm -rf build/* third_party/fetchcontent/* third_party/gpu-build third_party/gpu-subbuild third_party/gpu-src third_party/lib/libwebgpu_dawn.so third_party/lib/libwebgpu_dawn.dylib

################################################################################
# Checks
################################################################################

# Check all
check-all: check-os check-clang check-cmake check-python

# check the os
check-os:
ifeq ($(OS_TYPE), unknown)
$(error Unsupported operating system)
endif

# check for the existence of clang++ and cmake
check-clang:
@command -v clang++ >/dev/null 2>&1 || { echo >&2 "Please install clang++ with 'sudo apt-get install clang' or 'brew install llvm'"; exit 1; }
@command -v clang++ >/dev/null 2>&1 || { echo -e >&2 "Clang++ is not installed. Please install clang++ to continue.\nOn Debian / Ubuntu: 'sudo apt-get install clang' or 'brew install llvm'\nOn Centos: 'sudo yum install clang'"; exit 1; }

check-cmake:
@command -v cmake >/dev/null 2>&1 || { echo >&2 "Please install cmake with 'sudo apt-get install cmake' or 'brew install cmake'"; exit 1; }
@command -v cmake >/dev/null 2>&1 || { echo -e >&2 "Cmake is not installed. Please install cmake to continue.\nOn Debian / Ubuntu: 'sudo apt-get install cmake' or 'brew install cmake'\nOn Centos: 'sudo yum install cmake'"; exit 1; }

check-python:
@command -v python3 >/dev/null 2>&1 || { echo >&2 "Python needs to be installed and in your path."; exit 1; }
@command -v python3 >/dev/null 2>&1 || { echo -e >&2 "Python is not installed. Please install python to continue.\nOn Debian / Ubuntu: 'sudo apt-get install python'\nOn Centos: 'sudo yum install python'"; exit 1; }

check-linux-vulkan:
@echo "Checking system type and Vulkan availability..."
Expand All @@ -113,7 +151,7 @@ check-linux-vulkan:
echo "Vulkan is installed."; \
vulkaninfo; \
else \
echo "Vulkan is not installed. Please install Vulkan drivers to continue. On Debian / Ubuntu: sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools"; \
echo -e "Vulkan is not installed. Please install Vulkan drivers to continue.\nOn Debian / Ubuntu: 'sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools'.\nOn Centos: 'sudo yum install vulkan vulkan-tools.'"; \
exit 1; \
fi \
else \
Expand Down
16 changes: 4 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ GPU code in C++ projects and have it run on Nvidia, Intel, AMD, and other GPUs.
The same C++ code can work on a wide variety of laptops, workstations, mobile
devices or virtually any hardware with Vulkan, Metal, or DirectX support.

## Technical Objectives: Lightweight, Fast Iteration, and Low Boilerplate
## Objectives: Lightweight, Fast Iteration, and Low Boilerplate

With gpu.cpp we want to enable a high-leverage library for individual developers and researchers to incorporate GPU computation into programs relying on nothing more than a standard C++ compiler as tooling. Our goals are:

Expand Down Expand Up @@ -189,7 +189,7 @@ illustrate how to use gpu.cpp as a library.

After you have run `make` in the top-level directory which retrieves the prebuilt Dawn shared library, you can run each example by navigating to its directory and running `make` from the example's directory.

An example of tiled matrix multiplication is in [examples/matmul](https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/matmul/). This implements a WebGPU version of the first few kernels of Simon Boehm's [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM) post. It currently runs at ~ 2.5+ TFLOPs on a Macbook Pro M1 Max laptop, which has a theoretical peak of 10.4 TFLOPs. Contributions to optimize this further are welcome.
An example of tiled matrix multiplication is in [examples/matmul](https://github.com/AnswerDotAI/gpu.cpp/blob/main/examples/matmul/). This implements a WebGPU version of the first few kernels of Simon Boehm's [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM) post. It currently runs at ~ 3.5+ TFLOPs on a Macbook Pro M1 Max laptop. Contributions to optimize this further are welcome.

A parallel physics simulation of an ensemble of double pendulums simulated in parallel with different initial conditions on the GPU is shown in [examples/physics](https://github.com/AnswerDotAI/gpu.cpp/tree/main/examples/physics).

Expand All @@ -198,9 +198,7 @@ A parallel physics simulation of an ensemble of double pendulums simulated in pa
<img src="docs/images/pendulum.gif" alt="physics example animated gif" width=42%>
</div>

We also show some examples of signed distance function computations, rendered in the terminal as ascii. A 3D SDF of spheres is shown in [examples/render](https://github.com/AnswerDotAI/gpu.cpp/tree/main/examples/render]) and a shadertoy-like live-reloading example is in [examples/shadertui](https://github.com/AnswerDotAI/gpu.cpp/tree/main/examples/shadertui).

Interestingly, given a starting example, LLMs such as Claude 3.5 Sonnet can be quite capable at writing low-level WGSL code for you - the other shaders in the shadertui example are written by the LLM.
We also show some examples of signed distance function computations, rendered in the terminal as ascii. A 3D SDF of spheres is shown in [examples/render](https://github.com/AnswerDotAI/gpu.cpp/tree/main/examples/render) and a shadertoy-like live-reloading example is in [examples/shadertui](https://github.com/AnswerDotAI/gpu.cpp/tree/main/examples/shadertui).

<div align="center">
<img src="docs/images/shadertui.gif" alt="shadertui example animated gif" width=88%>
Expand Down Expand Up @@ -232,22 +230,16 @@ gpu.cpp lets us implement and drop-in any algorithm with fine-grained control of

gpu.cpp is meant for developers with some familiarity with C++ and GPU programming. It is not a high-level numerical computing or machine learning framework or inference engine, though it can be used in support of such implementations.

Second, in spite of the name, WebGPU has native implementations decoupled from the web and the browser. gpu.cpp leverages WebGPU as a portable _native_ GPU API first and foremost, with the possibility of running in the browser being a convenient additional benefit in the future.

If you find it counterintuitive, as many do, that WebGPU is a native technology and not just for the web, watch Elie Michel's excellent talk ["WebGPU is Not Just About the Web"](https://www.youtube.com/watch?v=qHrx41aOTUQ).
Second, in spite of the name, WebGPU has native implementations decoupled from the web and the browser. If you find it counterintuitive, watch Elie Michel's excellent talk ["WebGPU is Not Just About the Web"](https://www.youtube.com/watch?v=qHrx41aOTUQ).

Finally, the focus of gpu.cpp is general-purpose GPU computation rather than rendering/graphics on the GPU, although it can be useful for offline rendering or video processing use cases. We may explore directions with graphics in the future, but for now our focus is GPU compute.

## Limitations and Upcoming Features

_API Improvements_ - gpu.cpp is a work-in-progress and there are many features and improvements to come. At this early stage, we expect the API design to evolve as we identify improvements / needs from use cases. In particular, the handling of structured parameters and asynchronous dispatch will undergo refinement and maturation in the short-term.

_Browser Targets_ - In spite of using WebGPU we haven't tested builds targeting the browser yet though this is a short-term priority.

_Reusable Kernel Library_ - Currently the core library is strictly the operations and types for interfacing with the WebGPU API, with some specific use case example WGSL implementations in `examples/`. Over time, as kernel implementations mature we may migrate some of the reusable operations from specific examples into a small reusable kernel library.

_More Use Case Examples and Tests_ - Expect an iteration loop of use cases to design tweaks and improvements, which in turn make the use cases cleaner and easier to write. One short term use cases to flesh out the kernels from [llm.c](https://github.com/karpathy/llm.c) in WebGPU form. As these mature into a reusable kernel library, we hope to help realize the potential for WebGPU compute in AI.

## Troubleshooting

If you run into issues building the project, please open an issue.
Expand Down
5 changes: 5 additions & 0 deletions bindings/haskell/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Revision history for gpu-cpp

## 0.1.0.0 -- 2024-12-28

* First version.
3 changes: 3 additions & 0 deletions bindings/haskell/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
all:
cabal configure --extra-include-dirs=$(PWD)/../.. --extra-include-dirs=$(PWD)/../../third_party/headers --extra-lib-dirs=$(PWD)/../../third_party/lib
cabal build .
37 changes: 37 additions & 0 deletions bindings/haskell/app/Main.hs
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
module Main where

import GpuCpp.Types
import GpuCpp
import qualified Data.Vector.Storable as V
import Foreign.C.Types

main :: IO ()
main = do
context <- createContext
input <- createTensor context [12] kf32
output <- createTensor context [12] kf32
kernelCode <- createKernelCode
(
"const GELU_SCALING_FACTOR: f32 = 0.7978845608028654; // sqrt(2.0 / PI)\n" <>
"@group(0) @binding(0) var<storage, read_write> inp: array<{{precision}}>;\n" <>
"@group(0) @binding(1) var<storage, read_write> out: array<{{precision}}>;\n" <>
"@group(0) @binding(1) var<storage, read_write> dummy: array<{{precision}}>;\n" <>
"@compute @workgroup_size({{workgroupSize}})\n" <>
"fn main(\n" <>
" @builtin(global_invocation_id) GlobalInvocationID: vec3<u32>) {\n" <>
" let i: u32 = GlobalInvocationID.x;\n" <>
" if (i < arrayLength(&inp)) {\n" <>
" let x: f32 = inp[i];\n" <>
" out[i] = select(0.5 * x * (1.0 + tanh(GELU_SCALING_FACTOR \n" <>
" * (x + .044715 * x * x * x))), x, x > 10.0);\n" <>
" }\n" <>
"}\n"
)
256
kf32
kernel <- createKernel context kernelCode [input, output] [0,0] [12,1,1]
toGpu context (V.fromList [1 :: CFloat,2,3,4,1,2,3,4,1,2,3,4]) input
async <- dispatchKernel context kernel
wait context async
vec <- toCpu context output :: IO (V.Vector CFloat)
print vec
49 changes: 49 additions & 0 deletions bindings/haskell/gpu-cpp.cabal
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
cabal-version: 3.0
name: gpu-cpp
version: 0.1.0.0
license: BSD-3-Clause
author: Junji Hashimoto
maintainer: [email protected]
category: Math
build-type: Simple

extra-doc-files: CHANGELOG.md

common warnings
ghc-options: -Wall

library
import: warnings
exposed-modules: GpuCpp
, GpuCpp.Types
build-depends: base ^>=4.18.1.0
, inline-c
, inline-c-cpp
, containers
, template-haskell
, safe-exceptions
, vector
hs-source-dirs: src
default-language: Haskell2010
ghc-options: -optcxx-std=c++17
extra-libraries: webgpu_dawn

executable gpu-cpp
import: warnings
main-is: Main.hs
build-depends: base ^>=4.18.1.0
, gpu-cpp
, vector
hs-source-dirs: app
default-language: Haskell2010

test-suite gpu-cpp-test
import: warnings
default-language: Haskell2010
type: exitcode-stdio-1.0
hs-source-dirs: test
main-is: Main.hs
build-depends: base ^>=4.18.1.0
, gpu-cpp
, vector
, hspec
Loading