Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement matmul 2D tiling for gpt2 #67

Closed

Conversation

junjihashimoto
Copy link
Collaborator

@junjihashimoto junjihashimoto commented Oct 13, 2024

Implement matmul 2D tiling for gpt2.
It's faster than the previous version, but seems slower than the CPU.
According to microbenchmarks, the cost of loop unrolling and createKernel is high.
The parameters of matmul-forward are B=4, T=64, C=768 and OC=2304.

Individual
Total(GPU) 184,968 µs
Create Context(GPU) 2,309 µs
Create Tensors(GPU) 1,223 µs
Loop Unrolling(GPU) 65,904 µs
Create Kernel(GPU) 111,024 µs
Calculation(GPU) 4,508 µs
Calculation(CPU) 3,757 µs

@junjihashimoto
Copy link
Collaborator Author

Maybe compiling the kernel takes a lot of time, in which case it might be nice to be able to cache the kernel.

@junjihashimoto
Copy link
Collaborator Author

junjihashimoto commented Oct 13, 2024

I tried a cached kernel (junjihashimoto@3ee5944) , but I get this error:

[error] Device uncaptured error: [CommandBuffer] cannot be submitted more than once.
 - While calling [Queue].Submit([[CommandBuffer]])

The following issues may be related:
gpuweb/gpuweb#4138

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant