Implement matmul 2D tiling for gpt2 #67

junjihashimoto · 2024-10-13T02:07:24Z

Implement matmul 2D tiling for gpt2.
It's faster than the previous version, but seems slower than the CPU.
According to microbenchmarks, the cost of loop unrolling and createKernel is high.
The parameters of matmul-forward are B=4, T=64, C=768 and OC=2304.

	Individual
Total(GPU)	184,968 µs
Create Context(GPU)	2,309 µs
Create Tensors(GPU)	1,223 µs
Loop Unrolling(GPU)	65,904 µs
Create Kernel(GPU)	111,024 µs
Calculation(GPU)	4,508 µs
Calculation(CPU)	3,757 µs

junjihashimoto · 2024-10-13T18:21:41Z

Maybe compiling the kernel takes a lot of time, in which case it might be nice to be able to cache the kernel.

junjihashimoto · 2024-10-13T19:23:24Z

I tried a cached kernel (junjihashimoto@3ee5944) , but I get this error:

[error] Device uncaptured error: [CommandBuffer] cannot be submitted more than once.
 - While calling [Queue].Submit([[CommandBuffer]])

The following issues may be related:
gpuweb/gpuweb#4138

Add kShaderMatmul2DTiling in kernels.h

7addf83

Reduce matmul-kernel creation time

3ba36b5

junjihashimoto force-pushed the feature/matmul2d-dev branch from da1f32d to 3ba36b5 Compare October 14, 2024 03:57

junjihashimoto closed this Oct 16, 2024

junjihashimoto mentioned this pull request Oct 16, 2024

Implement kernel cache #68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement matmul 2D tiling for gpt2 #67

Implement matmul 2D tiling for gpt2 #67

junjihashimoto commented Oct 13, 2024 •

edited

Loading

junjihashimoto commented Oct 13, 2024

junjihashimoto commented Oct 13, 2024 •

edited

Loading

Implement matmul 2D tiling for gpt2 #67

Implement matmul 2D tiling for gpt2 #67

Conversation

junjihashimoto commented Oct 13, 2024 • edited Loading

junjihashimoto commented Oct 13, 2024

junjihashimoto commented Oct 13, 2024 • edited Loading

junjihashimoto commented Oct 13, 2024 •

edited

Loading

junjihashimoto commented Oct 13, 2024 •

edited

Loading