Skip to content

Commit a05a22b

Browse files
committed
fix up resource freeing, add to demo tutorial add DEVELOPERS.md
1 parent 61b92da commit a05a22b

File tree

4 files changed

+149
-51
lines changed

4 files changed

+149
-51
lines changed

DEVELOPERS.md

+86
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Developers
2+
3+
This note is for developers who want to contribute to the gpu.cpp library.
4+
5+
## Design Objectives
6+
7+
1. Maximal Leverage. Maximize the space of implementations that this
8+
library is useful for with the least amount of implementation complexity.
9+
Implementation complexity.
10+
11+
2. Minimize integration complexity. Whereas the integration pattern for custom
12+
low-level GPU algorithm code is to integrate it into an existing engine (eg
13+
an inference runtime, or a compiler), the objective of gpu.cpp is to enable
14+
adding GPU computation code inside your own project with a minimal amount of
15+
integration complexity.
16+
17+
2. High ceiling on low-level control.
18+
- Direct control of on-device GPU code unconstrained by fixed set of ops
19+
- Direct control of on-device GPU memory management
20+
21+
## Separating Resource Acquisition and Dispatch
22+
23+
We can think of the use of gpu.cpp as GPU resources modeled by the type
24+
definitions of the library and actions on GPU resources, modeled by the
25+
functions of the library.
26+
27+
The key functions can be further subdivided into two categories in relation to
28+
when the GPU computation occurs:
29+
30+
1) Ahead-of-time preparation of resources and state: thess are functions that
31+
acquire resources and prepare state for GPU computation. These are less
32+
performance critical.
33+
34+
2) Performance critical dispatch of GPU computation: these are functions that
35+
dispatch GPU computation to the GPU, usually in a tight hot-path loop.
36+
37+
This pattern is different from non-performance critical application code where
38+
resource acquisition is often interleaved with computation throughout the
39+
program execution.
40+
41+
This is a pattern for performance critical GPU computation that gpu.cpp is
42+
intended for. Some example use cases that fit this are custom neural network
43+
inference engines, render loops, simulations loops, etc.
44+
45+
We'll see how the functions and types of the library are organized around these
46+
two types of actions.
47+
48+
## Resource Type Definitions and Acquisition
49+
50+
The main resources are:
51+
52+
- `GPUContext` - the state of resources for interacting with the GPU.
53+
- `GPUTensor` - a buffer of data on the GPU.
54+
- `ShaderCode` - the code for a shader program that can be dispatched to the
55+
GPU. This is a thin wrapper around a WGSL string but also includes the
56+
workgroup size the code is designed to run with.
57+
- `Kernel` - a GPU program that can be dispatched to the GPU. This accepts a
58+
`ShaderCode` and a list of `GPUTensor` resources to bind for the dispatch
59+
computation.
60+
- `MultiKernel` - a collection of kernels that can be dispatched to the GPU.
61+
62+
Resources are acquired using the `Create` functions. These are assumed to be
63+
ahead-of-time and not performance critical.
64+
65+
- `GPUContext CreateGPUContext(...)` - creates a GPU context.
66+
- `GPUTensor CreateTensor(...)` - creates and allocates a buffer for a tensor
67+
on the GPU.
68+
- `Kernel CreateKernel(...)` - creates and prepares a kernel on the GPU,
69+
including underlying GPU buffer data bindings and compute pipeline for the
70+
shader code.
71+
- `MultiKernel CreateMultiKernel(...)` - Same as `CreateKernel`, but for
72+
multiple kernels to be dispatched together.
73+
74+
There's a few supporting types in addition to these. `Shape` is a simple type
75+
to specify the shape of a tensor. `KernelDesc` and `MultiKernelDesc` are
76+
effectively. `TensorPool` manages `GPUTensor` resources and is used as context
77+
for allocating and deallocating tensors data on the GPU. In practice
78+
`TensorPool` is managed as a member variable of `GPUContext`.
79+
80+
## Dispatching GPU Computation
81+
82+
GPU computation is launched using the `Dispatch` functions. These are assumed
83+
to be performance critical.
84+
85+
- `void DispatchKernel(...)` - dispatches a single kernel to the GPU.
86+
- `void DispatchMultiKernel(...)` - dispatches multiple kernels to the GPU.

examples/physics/TODO

Whitespace-only changes.

gpu.h

+26-39
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ struct Kernel {
131131
size_t numInputs;
132132
WGPUCommandBuffer commandBuffer;
133133
WGPUBuffer readbackBuffer;
134-
CallbackDataDyn callbackData;
134+
CallbackDataDyn callbackData;
135135
std::promise<void> promise;
136136
std::future<void> future;
137137
};
@@ -174,26 +174,27 @@ bool operator<(const Kernel &lhs, const Kernel &rhs) {
174174
return lhs.commandBuffer < rhs.commandBuffer;
175175
}
176176

177-
void FreeKernel(Kernel* op) {
177+
void FreeKernel(Kernel *op) {
178178
log(kDefLog, kInfo, "Freeing kernel");
179+
// TODO(avh): nullptr is insufficient check for freeable resources
179180
if (op->commandBuffer != nullptr) {
180-
// wgpuCommandBufferRelease(op->commandBuffer);
181+
wgpuCommandBufferRelease(op->commandBuffer);
181182
}
182183
if (op->readbackBuffer != nullptr) {
183-
// wgpuBufferRelease(op->readbackBuffer);
184+
wgpuBufferRelease(op->readbackBuffer);
184185
}
185186
if (op->callbackData.buffer != nullptr) {
186-
// wgpuBufferRelease(op->callbackData.buffer);
187+
wgpuBufferRelease(op->callbackData.buffer);
187188
}
188189
}
189190

190-
void FreeMultiKernel(MultiKernel &pipeline) {
191+
void FreeMultiKernel(MultiKernel *pipeline) {
191192
log(kDefLog, kInfo, "Freeing multi kernel");
192-
if (pipeline.commandBuffer) {
193-
wgpuCommandBufferRelease(pipeline.commandBuffer);
193+
if (pipeline->commandBuffer) {
194+
// wgpuCommandBufferRelease(pipeline->commandBuffer);
194195
}
195-
if (pipeline.readbackBuffer) {
196-
wgpuBufferRelease(pipeline.readbackBuffer);
196+
if (pipeline->readbackBuffer) {
197+
// wgpuBufferRelease(pipeline->readbackBuffer);
197198
}
198199
}
199200

@@ -202,7 +203,16 @@ struct KernelPool {
202203
GPUContext *ctx;
203204
std::set<Kernel *> data;
204205
std::set<MultiKernel *> multiData;
205-
~KernelPool();
206+
~KernelPool() {
207+
for (auto kernelPtr : data) {
208+
FreeKernel(kernelPtr);
209+
}
210+
data.clear();
211+
for (MultiKernel *multiKernelPtr : multiData) {
212+
FreeMultiKernel(multiKernelPtr);
213+
}
214+
multiData.clear();
215+
}
206216
};
207217

208218
struct GPUContext {
@@ -212,11 +222,8 @@ struct GPUContext {
212222
WGPUQueue queue;
213223
TensorPool pool = TensorPool(this);
214224
KernelPool kernelPool = KernelPool(this);
215-
/*
216225
~GPUContext() {
217226
log(kDefLog, kInfo, "Destroying context");
218-
pool.~TensorPool();
219-
kernelPool.~KernelPool();
220227
if (queue) {
221228
wgpuQueueRelease(queue);
222229
wgpuInstanceProcessEvents(instance);
@@ -240,29 +247,10 @@ struct GPUContext {
240247
} else {
241248
log(kDefLog, kWarn, "Instance is null");
242249
}
250+
log(kDefLog, kInfo, "Destroyed context");
243251
}
244-
*/
245252
};
246253

247-
KernelPool::~KernelPool() {
248-
for (auto kernelPtr : data) {
249-
FreeKernel(kernelPtr);
250-
// data.erase(kernelPtr);
251-
}
252-
/*
253-
for (MultiKernel *multiKernelPtr : multiData) {
254-
while (multiKernelPtr->future.wait_for(std::chrono::seconds(0)) !=
255-
std::future_status::ready) {
256-
log(kDefLog, kWarn,
257-
"MultiKernel future not ready, waiting before freeing");
258-
wgpuInstanceProcessEvents(ctx->instance);
259-
}
260-
FreeMultiKernel(*multiKernelPtr);
261-
multiData.erase(multiKernelPtr);
262-
}
263-
*/
264-
}
265-
266254
/* Tensor factory function */
267255
GPUTensor CreateTensor(TensorPool &pool, WGPUDevice &device, const Shape &shape,
268256
NumType dtype,
@@ -380,9 +368,9 @@ void showDeviceInfo(WGPUAdapter &adapter) {
380368
}
381369

382370
GPUContext CreateContext(bool quietLogging = true,
383-
const WGPUInstanceDescriptor &desc = {},
384-
const WGPURequestAdapterOptions &adapterOpts = {},
385-
WGPUDeviceDescriptor devDescriptor = {}) {
371+
const WGPUInstanceDescriptor &desc = {},
372+
const WGPURequestAdapterOptions &adapterOpts = {},
373+
WGPUDeviceDescriptor devDescriptor = {}) {
386374
if (quietLogging) {
387375
kDefLog.level = kError;
388376
}
@@ -732,8 +720,7 @@ Kernel CreateKernel(GPUContext &ctx, const ShaderCode &shader,
732720
}
733721

734722
log(kDefLog, kInfo, "Initializing callbackData");
735-
op.callbackData =
736-
{op.readbackBuffer, op.outputSize, nullptr, &op.promise};
723+
op.callbackData = {op.readbackBuffer, op.outputSize, nullptr, &op.promise};
737724

738725
ctx.kernelPool.data.insert(&op);
739726

run.cpp

+37-12
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,11 @@ when the GPU computation occurs:
196196
)");
197197

198198
section(R"(
199+
200+
199201
*Ahead-of-time GPU Resource Preparation*
202+
203+
200204
)");
201205

202206
section(R"(
@@ -214,11 +218,12 @@ The main resources are:
214218
`ShaderCode` and a list of `GPUTensor` resources to bind for the dispatch
215219
computation.
216220
- `MultiKernel` - a collection of kernels that can be dispatched to the GPU.
221+
217222
)");
218223

219224
section(R"(
220-
Preparing GPU Resources II: Acquiring GPU Resources with `Create*` Functions
221-
----------------------------------------------------------------------------
225+
Preparing GPU Resources II: Acquiring GPU Resources with `Create*()` Functions
226+
------------------------------------------------------------------------------
222227
223228
Resources are acquired using the `Create` functions. These are assumed to be
224229
ahead-of-time and not performance critical.
@@ -296,8 +301,8 @@ wait();
296301

297302

298303
section(R"(
299-
WGSL Compute Kernels Define GPU Computation Programs
300-
----------------------------------------------------
304+
WGSL Compute Kernels are Programs that run Computation on the GPU
305+
------------------------------------------------------------------
301306
302307
Device code in WebGPU uses the WGSL shading language. In addition to mechanisms
303308
for invoking WGSL shaders as compute kernels as shown so far, you can write
@@ -330,18 +335,38 @@ The `@group(0)` and `@binding(0)` annotations are used to specify the binding
330335
points for the input and output buffers. The `@compute` annotation specifies
331336
that this is a compute kernel. The `@workgroup_size(256)` annotation specifies
332337
the workgroup size for the kernel.
333-
334-
Workgroups are a concept in WebGPU that are similar to CUDA blocks. They are
335-
groups of threads that can share memory and synchronize with each other. The
336-
workgroup size is the number of threads in a workgroup.
337-
338338
)");
339339

340340
section(R"(
341-
Creating a kernel
342-
------------------
341+
`CreateKernel()` is used to create a Kernel
342+
-------------------------------------------
343+
344+
Reviewing our GELU example and after using `CreateTensor()` to allocate and
345+
bind buffers for input and output data, we can use `CreateKernel()` to create a
346+
kernel.
347+
348+
```
349+
GPUTensor input = CreateTensor(ctx, {N}, kf32, inputArr.data());
350+
GPUTensor output = CreateTensor(ctx, {N}, kf32, outputArr.data());
351+
Kernel op =
352+
CreateKernel(ctx, ShaderCode{kGELU, 256}, input, output);
353+
```
354+
355+
Note this *does not run* the kernel, it just prepares the kernel as a resource
356+
to be dispatched later.
357+
358+
There are four arguments to `CreateKernel()`:
359+
- `GPUContext` - the context for the GPU
360+
- `ShaderCode` - the shader code for the kernel
361+
- `GPUTensor` - the input tensor. Even though the kernel is not executed,
362+
GPUTensor provides a handle to the buffers on the GPU to be loaded when the
363+
kernel is run. If there's more than one input, `GPUTensors` can be used which
364+
is an ordered collection of `GPUTensor`.
365+
- `GPUTensor` - the output tensor. As with the input tensor, the values are not
366+
important at this point, the underlying reference to the GPU buffer is bound to
367+
the kernel so that when the kernel is dispatched, it will know where to write
368+
the output data.
343369
344-
TODO(avh)
345370
)");
346371

347372

0 commit comments

Comments
 (0)