-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentencepiece alternative #5
Comments
Thanks for the feedback!
|
1-3: Great, thanks for the info. |
I try both of them two months ago but still have some problems.
Hope it helps :) |
close it, reopen if it needs. |
BERT & DistilBERT Performers now work with pretrained transformers (https://colab.research.google.com/drive/1A9reiUZbA7DELuJ8keTo73sIQ4dJJVoT#scrollTo=F5k6jxicGf3E) Still working on autoregressive models like gpt .. but it seems the gains won't be that big for them; will keep you updated! |
Great work! @Muennighoff |
Yeah the problem seems to be the loop multiplication - We could try the causal multiplication from lucidrains - it shouldn't be too difficult to copy it over. I'm also still running into shape errors when trying GPT2 Performer - did you get it to work? |
It looks so cool! I'll try it and keep you updated! |
@Muennighoff Hi, I got an OOM error after using the |
Yes i think that's why he also has the other version above (https://github.com/lucidrains/performer-pytorch/blob/3bff14e39284e7dc82952153099a63dcd3561dc0/performer_pytorch/performer_pytorch.py#L142) Did you try that one as well? |
I didn't try it, but I think maybe rewriting the loop multiplication with the basic API of CUDA, is a way to work out the performance. Some like |
Hi all, I'm happy to join this conversation if there's anything to be done. Is this too inefficient? import numpy as np
q0 = np.array([[1, 2, 3], [4, 5, 6]])
k0 = np.array([[2, 3, 4], [5, 6, 7]])
v0 = np.array([[3, 4, 5, 6], [7, 8, 9, 10]])
q = np.expand_dims(q0, axis=1)
k = np.expand_dims(k0, axis=2)
v = np.expand_dims(v0, axis=2).transpose(0, 2, 1)
assert q.shape == (2, 1, 3)
assert k.shape == (2, 3, 1)
assert v.shape == (2, 1, 4)
kv = (k @ v)
assert kv.shape == (2, 3, 4)
kv_sum = np.cumsum(kv, axis=0)
assert kv_sum.shape == (2, 3, 4)
qkv = q @ kv_sum
assert qkv.shape == (2, 1, 4)
qkv = np.squeeze(qkv, axis=1)
assert (qkv == ((q0 @ k0.T) * np.array([[1, 0], [1, 1]])) @ v0).all() |
Welcome guy! @JamesDeAntonis
No, I think it can speed up the operation but will spend more memory. If so, I think the code above is equal to As I comment above, it will spend a lot of memory.
|
Yes, that's what I meant.
Don't you iterate across heads in the code, to save memory? This would mean |
Does this mean |
Great, that's a good idea.
Generally speaking, the value of D is in [64-128], if we have
Actually, I've no idea to speed up the causal multiplication while saving memory. 😷 |
isn't this only solvable by implementing the for-loop directly in the lower-level language? I imagine this is effectively what fast attention does |
Yes, but I'd not test yea, cause I can't find the tensorflow version and I've not mastered C++. 😂 |
Yes, I have the same guess. |
Can we talk over a call? I just emailed you |
I was just reading the fast attention code, and I think it does exactly what we want. Typing is really the only reason the c++ code is torch-specific. Otherwise, all the logic should directly transfer to TF. My issue is that I don't know tf. I think we only need to change the |
@JamesDeAntonis Cool! |
Cool! You are so efficient! @JamesDeAntonis Muennighoff seems to be solving the masking problem of the T5 decoder. |
@ice-americano (who I work with) ran if and it seemed to work to some degree. Compared to regular attention, he was getting significant improvements in memory usage, but a noticeable slowdown. We're not sure why the slowdown is occurring |
@mymusise I made some experiments with the fast-attention.CausalDotProduct on the SST notebook (https://colab.research.google.com/drive/1A9reiUZbA7DELuJ8keTo73sIQ4dJJVoT#scrollTo=f_R7D-mZyRXm) --- & the CausalDotProduct itself seems to work, however I get a shape mismatch in the denominator/normalization, since the sequence length of the key may not be the same as the query in a t5 enc-dec; As the keys come from the encoder, but the queries from the decoder input; If i force the shapes to fit or just skip the normalization, it goes through but always predicts the same value giving 50% accuracy whereas the equivalent transformer converges (check it under t5/pytorch/encdec in the notebook); this might tho be unrelated to the CausalDotProduct and could e.g. be caused by the attention mask or sth else --- @JamesDeAntonis did @ice-americano get it to converge? |
Yes, we got it to converge. On our task, in relation to regular attention, it was (i) a step worse in terms of loss, (ii) noticeably lower memory utilization on long sequences, and (iii) slower (this is the most perplexing to us) |
Amazing work! Currently the spmtrain in build_tokenizer doesn't work, cuz I think it needs a local installation of sentencepiece to be able to use the command. Is there a specific reason you chose Google's sentencepiece over just:
Also, I was wondering did fast attention work for you?
The text was updated successfully, but these errors were encountered: