-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement kernel cache #68
Conversation
7484178
to
7b896c0
Compare
7b896c0
to
a6140e0
Compare
Tensor output = createTensor(ctx, Shape{b * t * c}, kf32); | ||
printf("Created tensors\n"); | ||
// Generate the key of the cache by arguments. | ||
std::string key = "encoder_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be okay as a start, but we probably don't want to do this with all the implicit string allocations here from the to_string and concatenation.
As a start, might use snprintf for the string construction, but later can consider some alternative structure to a string->Kernel map for the pool cache.
Tensor mean_t = createTensor(ctx, Shape{b * t}, kf32); | ||
Tensor rstd_t = createTensor(ctx, Shape{b * t}, kf32); | ||
// Generate the key of the cache by arguments. | ||
std::string key = "layernorm_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above re: strings (in general trying not to use too much STL unless it's too onerous to find a lighter alternative)
std::string key = "encoder_forward_" + std::to_string(B) + "_" + std::to_string(T) + "_" + std::to_string(C); | ||
Kernel op; | ||
if (ctx.kernelPool.data.find(key) == ctx.kernelPool.data.end()) { | ||
Tensor input = createTensor(ctx, Shape{b * t}, ki32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be out of scope for this PR but eventually there shouldn't be any createTensor operations in ops - they should all be passed in.
In the ideal state (don't have to tackle this all in this PR):
- an op should take in any resources needed for a dispatch and submit the dispatch
- not do any GPU allocation or perform any CPU/GPU data movement.
- the inputs to the op function should probably be the GPU resources themselves instead of pointers to CPU resources.
}; | ||
|
||
typedef std::shared_ptr<RawKernel> Kernel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be a unique_ptr with operations requiring a non-owning views being a raw pointer?
shared_ptr often results in being unclear on what resource is responsible for ownership/lifetime. In this case, I think it should be clear that the ownership and lifetime is handled by the KernelPool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what if we skip the typedef indirection and kept RawKernel as Kernel? Leaving the pointer type visible makes it clearer how it's used (eg -> vs .)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what if we skip the typedef indirection and kept RawKernel as Kernel? Leaving the pointer type visible makes it clearer how it's used (eg -> vs .)
If you cache a RawKernel, it is owned by the KernelPool, so createTensor must return a reference to the Kernel, not the Kernel itself.
When we cache it, it is owned by the KernelPool, and when we do not cache it, it is not owned by the KernelPool, so separate calls are required to return a reference and not to return a reference, depending on whether we cache it or not. For now, I used shared_ptr to avoid having to prepare two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we might want to have a separate function that the user can put into the cache after calling createTensor, so we don't have to have two functions, one that returns a reference and one that returns the value.
Thanks a lot! Targeting dev so we can merge and follow-up on broader refactoring in the dev branch. Have a look at comments and we can go ahead and merge there. I feel like a lot of things will be much clearer what the function signature for an op is, it should eventually look pretty different than the current state (GPU resources as inputs, no allocations or data movement), but I probably need to implement a few examples of this to get the ball rolling (or find gaps/faws in my conceptualization). |
Thank you for your review! |
This PR implements shader caching to reduce the cost of
createKernel
.In case of matmul(#67), the cost is about 50 times that of a GPU operation.
To do the caching, it changes the
Kernel
data-type to a shared pointer.