fix up resource freeing, add to demo tutorial add DEVELOPERS.md

austinvhuang · austinvhuang · commit a05a22b7dc2b · 2024-06-14T14:39:34.000-04:00
diff --git a/DEVELOPERS.md b/DEVELOPERS.md
@@ -0,0 +1,86 @@
+# Developers
+
+This note is for developers who want to contribute to the gpu.cpp library.
+
+## Design Objectives
+
+1. Maximal Leverage. Maximize the space of implementations that this
+   library is useful for with the least amount of implementation complexity.
+   Implementation complexity. 
+
+2. Minimize integration complexity. Whereas the integration pattern for custom
+   low-level GPU algorithm code is to integrate it into an existing engine (eg
+   an inference runtime, or a compiler), the objective of gpu.cpp is to enable
+   adding GPU computation code inside your own project with a minimal amount of
+   integration complexity.
+
+2. High ceiling on low-level control.
+    - Direct control of on-device GPU code unconstrained by fixed set of ops
+    - Direct control of on-device GPU memory management
+
+## Separating Resource Acquisition and Dispatch
+
+We can think of the use of gpu.cpp as GPU resources modeled by the type
+definitions of the library and actions on GPU resources, modeled by the
+functions of the library. 
+
+The key functions can be further subdivided into two categories in relation to
+when the GPU computation occurs: 
+
+1) Ahead-of-time preparation of resources and state: thess are functions that
+   acquire resources and prepare state for GPU computation. These are less
+   performance critical.
+
+2) Performance critical dispatch of GPU computation: these are functions that
+   dispatch GPU computation to the GPU, usually in a tight hot-path loop. 
+
+This pattern is different from non-performance critical application code where
+resource acquisition is often interleaved with computation throughout the
+program execution.
+
+This is a pattern for performance critical GPU computation that gpu.cpp is
+intended for. Some example use cases that fit this are custom neural network
+inference engines, render loops, simulations loops, etc.
+
+We'll see how the functions and types of the library are organized around these
+two types of actions.
+
+## Resource Type Definitions and Acquisition
+
+The main resources are:
+
+- `GPUContext` - the state of resources for interacting with the GPU.
+- `GPUTensor` - a buffer of data on the GPU.
+- `ShaderCode` - the code for a shader program that can be dispatched to the
+  GPU. This is a thin wrapper around a WGSL string but also includes the
+  workgroup size the code is designed to run with.
+- `Kernel` - a GPU program that can be dispatched to the GPU. This accepts a
+  `ShaderCode` and a list of `GPUTensor` resources to bind for the dispatch
+  computation.
+- `MultiKernel` - a collection of kernels that can be dispatched to the GPU.
+
+Resources are acquired using the `Create` functions. These are assumed to be
+ahead-of-time and not performance critical.
+
+- `GPUContext CreateGPUContext(...)` - creates a GPU context.
+- `GPUTensor CreateTensor(...)` - creates and allocates a buffer for a tensor
+  on the GPU.
+- `Kernel CreateKernel(...)` - creates and prepares a kernel on the GPU,
+  including underlying GPU buffer data bindings and compute pipeline for the
+  shader code.
+- `MultiKernel CreateMultiKernel(...)` - Same as `CreateKernel`, but for
+  multiple kernels to be dispatched together.
+
+There's a few supporting types in addition to these. `Shape` is a simple type
+to specify the shape of a tensor. `KernelDesc` and `MultiKernelDesc` are
+effectively. `TensorPool` manages `GPUTensor` resources and is used as context
+for allocating and deallocating tensors data on the GPU. In practice
+`TensorPool` is managed as a member variable of `GPUContext`.
+
+## Dispatching GPU Computation
+
+GPU computation is launched using the `Dispatch` functions. These are assumed
+to be performance critical.
+
+- `void DispatchKernel(...)` - dispatches a single kernel to the GPU.
+- `void DispatchMultiKernel(...)` - dispatches multiple kernels to the GPU.
diff --git a/examples/physics/TODO b/examples/physics/TODO
diff --git a/gpu.h b/gpu.h
@@ -131,7 +131,7 @@ struct Kernel {
   size_t numInputs;
   WGPUCommandBuffer commandBuffer;
   WGPUBuffer readbackBuffer;
-  CallbackDataDyn callbackData; 
+  CallbackDataDyn callbackData;
   std::promise<void> promise;
   std::future<void> future;
 };
@@ -174,26 +174,27 @@ bool operator<(const Kernel &lhs, const Kernel &rhs) {
   return lhs.commandBuffer < rhs.commandBuffer;
 }
 
-void FreeKernel(Kernel* op) {
+void FreeKernel(Kernel *op) {
   log(kDefLog, kInfo, "Freeing kernel");
+  // TODO(avh): nullptr is insufficient check for freeable resources
   if (op->commandBuffer != nullptr) {
-    // wgpuCommandBufferRelease(op->commandBuffer);
+    wgpuCommandBufferRelease(op->commandBuffer);
   }
   if (op->readbackBuffer != nullptr) {
-    // wgpuBufferRelease(op->readbackBuffer);
+    wgpuBufferRelease(op->readbackBuffer);
   }
   if (op->callbackData.buffer != nullptr) {
-    // wgpuBufferRelease(op->callbackData.buffer);
+    wgpuBufferRelease(op->callbackData.buffer);
   }
 }
 
-void FreeMultiKernel(MultiKernel &pipeline) {
+void FreeMultiKernel(MultiKernel *pipeline) {
   log(kDefLog, kInfo, "Freeing multi kernel");
-  if (pipeline.commandBuffer) {
-    wgpuCommandBufferRelease(pipeline.commandBuffer);
+  if (pipeline->commandBuffer) {
+    // wgpuCommandBufferRelease(pipeline->commandBuffer);
   }
-  if (pipeline.readbackBuffer) {
-    wgpuBufferRelease(pipeline.readbackBuffer);
+  if (pipeline->readbackBuffer) {
+    // wgpuBufferRelease(pipeline->readbackBuffer);
   }
 }
 
@@ -202,7 +203,16 @@ struct KernelPool {
   GPUContext *ctx;
   std::set<Kernel *> data;
   std::set<MultiKernel *> multiData;
-  ~KernelPool();
+  ~KernelPool() {
+    for (auto kernelPtr : data) {
+      FreeKernel(kernelPtr);
+    }
+    data.clear();
+    for (MultiKernel *multiKernelPtr : multiData) {
+      FreeMultiKernel(multiKernelPtr);
+    }
+    multiData.clear();
+  }
 };
 
 struct GPUContext {
@@ -212,11 +222,8 @@ struct GPUContext {
   WGPUQueue queue;
   TensorPool pool = TensorPool(this);
   KernelPool kernelPool = KernelPool(this);
-  /*
   ~GPUContext() {
     log(kDefLog, kInfo, "Destroying context");
-    pool.~TensorPool();
-    kernelPool.~KernelPool();
     if (queue) {
       wgpuQueueRelease(queue);
       wgpuInstanceProcessEvents(instance);
@@ -240,29 +247,10 @@ struct GPUContext {
     } else {
       log(kDefLog, kWarn, "Instance is null");
     }
+    log(kDefLog, kInfo, "Destroyed context");
   }
-  */
 };
 
-KernelPool::~KernelPool() {
-  for (auto kernelPtr : data) {
-    FreeKernel(kernelPtr);
-    // data.erase(kernelPtr);
-  }
-  /*
-  for (MultiKernel *multiKernelPtr : multiData) {
-    while (multiKernelPtr->future.wait_for(std::chrono::seconds(0)) !=
-           std::future_status::ready) {
-      log(kDefLog, kWarn,
-          "MultiKernel future not ready, waiting before freeing");
-      wgpuInstanceProcessEvents(ctx->instance);
-    }
-    FreeMultiKernel(*multiKernelPtr);
-    multiData.erase(multiKernelPtr);
-  }
-  */
-}
-
 /* Tensor factory function */
 GPUTensor CreateTensor(TensorPool &pool, WGPUDevice &device, const Shape &shape,
                        NumType dtype,
@@ -380,9 +368,9 @@ void showDeviceInfo(WGPUAdapter &adapter) {
 }
 
 GPUContext CreateContext(bool quietLogging = true,
-                            const WGPUInstanceDescriptor &desc = {},
-                            const WGPURequestAdapterOptions &adapterOpts = {},
-                            WGPUDeviceDescriptor devDescriptor = {}) {
+                         const WGPUInstanceDescriptor &desc = {},
+                         const WGPURequestAdapterOptions &adapterOpts = {},
+                         WGPUDeviceDescriptor devDescriptor = {}) {
   if (quietLogging) {
     kDefLog.level = kError;
   }
@@ -732,8 +720,7 @@ Kernel CreateKernel(GPUContext &ctx, const ShaderCode &shader,
   }
 
   log(kDefLog, kInfo, "Initializing callbackData");
-  op.callbackData =
-    {op.readbackBuffer, op.outputSize, nullptr, &op.promise};
+  op.callbackData = {op.readbackBuffer, op.outputSize, nullptr, &op.promise};
 
   ctx.kernelPool.data.insert(&op);
 
diff --git a/run.cpp b/run.cpp
@@ -196,7 +196,11 @@ when the GPU computation occurs:
 )");
 
   section(R"(
+
+
 *Ahead-of-time GPU Resource Preparation*
+
+
 )");
 
   section(R"(
@@ -214,11 +218,12 @@ The main resources are:
   `ShaderCode` and a list of `GPUTensor` resources to bind for the dispatch
   computation.
 - `MultiKernel` - a collection of kernels that can be dispatched to the GPU.
+
 )");
 
   section(R"(
-Preparing GPU Resources II: Acquiring GPU Resources with `Create*` Functions
-----------------------------------------------------------------------------
+Preparing GPU Resources II: Acquiring GPU Resources with `Create*()` Functions
+------------------------------------------------------------------------------
 
 Resources are acquired using the `Create` functions. These are assumed to be
 ahead-of-time and not performance critical.
@@ -296,8 +301,8 @@ wait();
 
 
 section(R"(
-WGSL Compute Kernels Define GPU Computation Programs
-----------------------------------------------------
+WGSL Compute Kernels are Programs that run Computation on the GPU
+------------------------------------------------------------------
 
 Device code in WebGPU uses the WGSL shading language. In addition to mechanisms
 for invoking WGSL shaders as compute kernels as shown so far, you can write
@@ -330,18 +335,38 @@ The `@group(0)` and `@binding(0)` annotations are used to specify the binding
 points for the input and output buffers. The `@compute` annotation specifies
 that this is a compute kernel. The `@workgroup_size(256)` annotation specifies
 the workgroup size for the kernel.
-
-Workgroups are a concept in WebGPU that are similar to CUDA blocks. They are
-groups of threads that can share memory and synchronize with each other. The
-workgroup size is the number of threads in a workgroup.
-
 )");
 
 section(R"(
-Creating a kernel
-------------------
+`CreateKernel()` is used to create a Kernel
+-------------------------------------------
+
+Reviewing our GELU example and after using `CreateTensor()` to allocate and
+bind buffers for input and output data, we can use `CreateKernel()` to create a
+kernel.
+
+```
+  GPUTensor input = CreateTensor(ctx, {N}, kf32, inputArr.data());
+  GPUTensor output = CreateTensor(ctx, {N}, kf32, outputArr.data());
+  Kernel op =
+      CreateKernel(ctx, ShaderCode{kGELU, 256}, input, output);
+```
+
+Note this *does not run* the kernel, it just prepares the kernel as a resource
+to be dispatched later.
+
+There are four arguments to `CreateKernel()`:
+- `GPUContext` - the context for the GPU
+- `ShaderCode` - the shader code for the kernel
+- `GPUTensor` - the input tensor. Even though the kernel is not executed,
+GPUTensor provides a handle to the buffers on the GPU to be loaded when the
+kernel is run. If there's more than one input, `GPUTensors` can be used which
+is an ordered collection of `GPUTensor`.
+- `GPUTensor` - the output tensor. As with the input tensor, the values are not
+important at this point, the underlying reference to the GPU buffer is bound to
+the kernel so that when the kernel is dispatched, it will know where to write
+the output data.
 
-TODO(avh)
 )");