Sentencepiece alternative #5

Muennighoff · 2021-02-22T18:44:26Z

Amazing work! Currently the spmtrain in build_tokenizer doesn't work, cuz I think it needs a local installation of sentencepiece to be able to use the command. Is there a specific reason you chose Google's sentencepiece over just:

from tokenizers import SentencePieceBPETokenizer
tokenizer = SentencePieceBPETokenizer()
tokenizer.train(files=paths, vocab_size=3000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.save(os.path.join(configs.data.path, "spiece.model"))

Also, I was wondering did fast attention work for you?

The text was updated successfully, but these errors were encountered:

mymusise · 2021-02-23T06:01:21Z

Thanks for the feedback!

Yes, the installations of the sentencepiece is required. Try pip install -r requirements.txt to update your environment.
It's ok to use tokenizers.SentencePieceBPETokenizer, but both of them require the input sentence should be pretokenized in no-spaces languages such as Chinese or Japanese.
I forgot to add pretoken in build_tokenizer.py, I will update it later.
fast attention is still not available in this repo, hope norabelrose's work coming soon.

Muennighoff · 2021-02-23T06:19:08Z

Thanks for the feedback!

Yes, the installations of the sentencepiece is required. Try pip install -r requirements.txt to update your environment.

It's ok to use tokenizers.SentencePieceBPETokenizer, but both of them require the input sentence should be pretokenized in no-spaces languages such as Chinese or Japanese.

I forgot to add pretoken in build_tokenizer.py, I will update it later.

fast attention is still not available in this repo, hope norabelrose's work coming soon.

1-3: Great, thanks for the info.
4: Yeah that would be awesome. I've also been working on it on a fork from transformers, but it's still unable to converge somehow, so I was thinking of using the original tensorflow implementation from g-research as you did. But I guess it didnt work out for you?

mymusise · 2021-02-24T16:26:57Z

so I was thinking of using the original tensorflow implementation from g-research as you did. But I guess it didnt work out for you?

I try both of them two months ago but still have some problems.
In my test:

transformer.PerformerAttention works but it train and predict slower than the normal Transformer.
the fast-attention of google train and predict faster, but the result of prediction is bad. It's confusing that the accuracy will stop at 66% with training a small corpus like dataset/test/raw.txt. Perhaps I build the model in the wrong way.

Hope it helps :)
and wait for your good news 🚀

mymusise · 2021-03-08T16:17:29Z

close it, reopen if it needs.

Muennighoff · 2021-03-09T03:01:46Z

close it, reopen if it needs.

BERT & DistilBERT Performers now work with pretrained transformers (https://colab.research.google.com/drive/1A9reiUZbA7DELuJ8keTo73sIQ4dJJVoT#scrollTo=F5k6jxicGf3E)

Still working on autoregressive models like gpt .. but it seems the gains won't be that big for them; will keep you updated!

mymusise · 2021-03-10T06:39:57Z

Great work! @Muennighoff
After trying your code, BERT & DistilBERT Performers does work better than the normal Transformer.
I tried GPT2 work with Performers a few days ago and got some clue, I think the reason of Performer will perform well in non-causal models like BERT and T5 but will have a negative effect in causal models like GPT,
maybe is the matrix multiplication of softmax(Q @ K) @ CausalMask @ V is faster then the loop multiplication of _headwise_causal_numerator.

Muennighoff · 2021-03-10T06:44:47Z

Great work! @Muennighoff
After trying your code, BERT & DistilBERT Performers does work better than the normal Transformer.
I tried GPT2 work with Performers a few days ago and got some clue, I think the reason of Performer will perform well in non-causal models like BERT and T5 but will have a negative effect in causal models like GPT,
maybe is the matrix multiplication of softmax(Q @ K) @ CausalMask @ V is faster then the loop multiplication of _headwise_causal_numerator.

Yeah the problem seems to be the loop multiplication - We could try the causal multiplication from lucidrains - it shouldn't be too difficult to copy it over.

I'm also still running into shape errors when trying GPT2 Performer - did you get it to work?

mymusise · 2021-03-10T06:57:38Z

Great work! @Muennighoff
After trying your code, BERT & DistilBERT Performers does work better than the normal Transformer.
I tried GPT2 work with Performers a few days ago and got some clue, I think the reason of Performer will perform well in non-causal models like BERT and T5 but will have a negative effect in causal models like GPT,
maybe is the matrix multiplication of softmax(Q @ K) @ CausalMask @ V is faster then the loop multiplication of _headwise_causal_numerator.

Yeah the problem seems to be the loop multiplication - We could try the causal multiplication from luciddrains - it shouldn't be too difficult to copy it over.

I'm also still running into shape errors when trying GPT2 Performer - did you get it to work?

It looks so cool! I'll try it and keep you updated!

mymusise · 2021-03-10T13:21:34Z

@Muennighoff Hi, I got an OOM error after using the causal_linear_attention_noncuda function from luciddrains.
I think it's a bad idea to generate a content matrix with torch.einsum('...nd,...ne->...nde', k, v).
If the shape of Q, K, V is [B, H, L, D], and the context will have a huge size [B, H, L, D, D]. In many applications, the value of D is in [64-128]. Maybe it has a little deviation from the design of performer.

Muennighoff · 2021-03-10T16:11:03Z

@Muennighoff Hi, I got an OOM error after using the causal_linear_attention_noncuda function from luciddrains.
I think it's a bad idea to generate a content matrix with torch.einsum('...nd,...ne->...nde', k, v).
If the shape of Q, K, V is [B, H, L, D], and the context will have a huge size [B, H, L, D, D]. In many applications, the value of D is in [64-128]. Maybe it has a little deviation from the design of performer.

Yes i think that's why he also has the other version above (https://github.com/lucidrains/performer-pytorch/blob/3bff14e39284e7dc82952153099a63dcd3561dc0/performer_pytorch/performer_pytorch.py#L142) Did you try that one as well?

mymusise · 2021-03-12T08:25:52Z

Yes i think that's why he also has the other version above (https://github.com/lucidrains/performer-pytorch/blob/3bff14e39284e7dc82952153099a63dcd3561dc0/performer_pytorch/performer_pytorch.py#L142) Did you try that one as well?

I didn't try it, but I think maybe rewriting the loop multiplication with the basic API of CUDA, is a way to work out the performance. Some like fast_transformers.causal_product.CausalDotProduct do.

JamesDeAntonis · 2021-03-12T21:07:29Z

Hi all, I'm happy to join this conversation if there's anything to be done.

Is this too inefficient?

import numpy as np
  
q0 = np.array([[1, 2, 3], [4, 5, 6]])
k0 = np.array([[2, 3, 4], [5, 6, 7]])
v0 = np.array([[3, 4, 5, 6], [7, 8, 9, 10]])

q = np.expand_dims(q0, axis=1)
k = np.expand_dims(k0, axis=2)
v = np.expand_dims(v0, axis=2).transpose(0, 2, 1)

assert q.shape == (2, 1, 3)
assert k.shape == (2, 3, 1)
assert v.shape == (2, 1, 4)

kv = (k @ v)

assert kv.shape == (2, 3, 4)

kv_sum = np.cumsum(kv, axis=0)

assert kv_sum.shape == (2, 3, 4)

qkv = q @ kv_sum

assert qkv.shape == (2, 1, 4)

qkv = np.squeeze(qkv, axis=1)

assert (qkv == ((q0 @ k0.T) * np.array([[1, 0], [1, 1]])) @ v0).all()

mymusise · 2021-03-14T09:39:29Z

Welcome guy! @JamesDeAntonis

Is this too inefficient?

No, I think it can speed up the operation but will spend more memory.
And I'm not sure the dimension of q0 (2, 3) mean (L, D)? A sequence with two tokens and each token has 3 dimensions?

If so, I think the code above is equal to torch.einsum('...nd,...ne->...nde', k, v) in a way.

As I comment above, it will spend a lot of memory.

If the shape of Q, K, V is [B, H, L, D], and the context will have a huge size [B, H, L, D, D]. In many applications, the value of D is in [64-128]. Maybe it has a little deviation from the design of performer.

JamesDeAntonis · 2021-03-15T16:57:40Z

And I'm not sure the dimension of q0 (2, 3) mean (L, D)? A sequence with two tokens and each token has 3 dimensions?

Yes, that's what I meant.

If the shape of Q, K, V is [B, H, L, D], and the context will have a huge size [B, H, L, D, D]. In many applications, the value of D is in [64-128]. Maybe it has a little deviation from the design of performer.

Don't you iterate across heads in the code, to save memory? This would mean [B, 1, L, D, D], whereas regular transformer has [B, H, L, L]. Generally speaking, I'm unsure what exactly you are trying to change.

JamesDeAntonis · 2021-03-15T18:03:36Z

Does this mean CausalDotProduct in fast-attention is what we want?

mymusise · 2021-03-15T18:05:06Z

Don't you iterate across heads in the code, to save memory?

Great, that's a good idea.

This would mean [B, 1, L, D, D], whereas regular transformer has [B, H, L, L]

Generally speaking, the value of D is in [64-128], if we have L = 2048, D = 128, then L*D*D=2048*128*128 is much bigger than the regular transformer L*L=2048*2048, if H=8, then H*L*L == L*D*D, it means we can save more memory when L>2048. But we still have to iterate each head if we do what you mention above.
I‘m not sure I've understood your mind。

I'm unsure what exactly you are trying to change.

Actually, I've no idea to speed up the causal multiplication while saving memory. 😷

JamesDeAntonis · 2021-03-15T18:08:47Z

isn't this only solvable by implementing the for-loop directly in the lower-level language? I imagine this is effectively what fast attention does

mymusise · 2021-03-15T18:09:30Z

Does this mean CausalDotProduct in fast-attention is what we want?

Yes, but I'd not test yea, cause I can't find the tensorflow version and I've not mastered C++. 😂

mymusise · 2021-03-15T18:10:38Z

I imagine this is effectively what fast attention does

Yes, I have the same guess.

JamesDeAntonis · 2021-03-15T18:29:02Z

Can we talk over a call? I just emailed you

JamesDeAntonis · 2021-03-15T22:30:13Z

I was just reading the fast attention code, and I think it does exactly what we want. Typing is really the only reason the c++ code is torch-specific. Otherwise, all the logic should directly transfer to TF. My issue is that I don't know tf. I think we only need to change the #include line at the top of the c++ file, the typing in the c++ file, and the wrapper code in __init__.py, then it will be TF compatible

mymusise · 2021-03-16T05:16:47Z

@JamesDeAntonis Cool!
I did not read the fast attention code seriously, sounds like it's worth trying.

JamesDeAntonis · 2021-03-16T17:40:25Z

I don't think I'm the guy to do this (I don't use c++ or tensorflow), but I think this is a pretty easy problem for someone who at least knows tensorflow

I started an implementation here. @mymusise will you take a look?

mymusise · 2021-03-17T04:01:20Z

Cool! You are so efficient! @JamesDeAntonis
Actually, before converting to a tf version, I prefer to test fast-attention.CausalDotProduct in PyTorch first.
At the same time, I will try to convert it base on your implementation.

Muennighoff seems to be solving the masking problem of the T5 decoder.
@Muennighoff Did you try fast-attention.CausalDotProduct? Dose it work?

JamesDeAntonis · 2021-03-17T15:45:06Z

@ice-americano (who I work with) ran if and it seemed to work to some degree. Compared to regular attention, he was getting significant improvements in memory usage, but a noticeable slowdown. We're not sure why the slowdown is occurring

Muennighoff · 2021-03-18T06:49:47Z

@mymusise I made some experiments with the fast-attention.CausalDotProduct on the SST notebook (https://colab.research.google.com/drive/1A9reiUZbA7DELuJ8keTo73sIQ4dJJVoT#scrollTo=f_R7D-mZyRXm) --- & the CausalDotProduct itself seems to work, however I get a shape mismatch in the denominator/normalization, since the sequence length of the key may not be the same as the query in a t5 enc-dec; As the keys come from the encoder, but the queries from the decoder input;

If i force the shapes to fit or just skip the normalization, it goes through but always predicts the same value giving 50% accuracy whereas the equivalent transformer converges (check it under t5/pytorch/encdec in the notebook);

this might tho be unrelated to the CausalDotProduct and could e.g. be caused by the attention mask or sth else --- @JamesDeAntonis did @ice-americano get it to converge?

JamesDeAntonis · 2021-03-18T16:19:20Z

Yes, we got it to converge.

On our task, in relation to regular attention, it was (i) a step worse in terms of loss, (ii) noticeably lower memory utilization on long sequences, and (iii) slower (this is the most perplexing to us)

mymusise closed this as completed Mar 8, 2021

mymusise reopened this Mar 10, 2021

Muennighoff mentioned this issue Mar 11, 2021

Add FAVOR+ / Performer attention huggingface/transformers#9325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentencepiece alternative #5

Sentencepiece alternative #5

Muennighoff commented Feb 22, 2021 •

edited

Loading

mymusise commented Feb 23, 2021 •

edited

Loading

Muennighoff commented Feb 23, 2021

mymusise commented Feb 24, 2021 •

edited

Loading

mymusise commented Mar 8, 2021

Muennighoff commented Mar 9, 2021

mymusise commented Mar 10, 2021

Muennighoff commented Mar 10, 2021 •

edited

Loading

mymusise commented Mar 10, 2021

mymusise commented Mar 10, 2021

Muennighoff commented Mar 10, 2021

mymusise commented Mar 12, 2021

JamesDeAntonis commented Mar 12, 2021 •

edited

Loading

mymusise commented Mar 14, 2021 •

edited

Loading

JamesDeAntonis commented Mar 15, 2021

JamesDeAntonis commented Mar 15, 2021

mymusise commented Mar 15, 2021

JamesDeAntonis commented Mar 15, 2021

mymusise commented Mar 15, 2021

mymusise commented Mar 15, 2021

JamesDeAntonis commented Mar 15, 2021 •

edited

Loading

JamesDeAntonis commented Mar 15, 2021

mymusise commented Mar 16, 2021

JamesDeAntonis commented Mar 16, 2021 •

edited

Loading

mymusise commented Mar 17, 2021 •

edited

Loading

JamesDeAntonis commented Mar 17, 2021

Muennighoff commented Mar 18, 2021

JamesDeAntonis commented Mar 18, 2021

Sentencepiece alternative #5

Sentencepiece alternative #5

Comments

Muennighoff commented Feb 22, 2021 • edited Loading

mymusise commented Feb 23, 2021 • edited Loading

Muennighoff commented Feb 23, 2021

mymusise commented Feb 24, 2021 • edited Loading

mymusise commented Mar 8, 2021

Muennighoff commented Mar 9, 2021

mymusise commented Mar 10, 2021

Muennighoff commented Mar 10, 2021 • edited Loading

mymusise commented Mar 10, 2021

mymusise commented Mar 10, 2021

Muennighoff commented Mar 10, 2021

mymusise commented Mar 12, 2021

JamesDeAntonis commented Mar 12, 2021 • edited Loading

mymusise commented Mar 14, 2021 • edited Loading

JamesDeAntonis commented Mar 15, 2021

JamesDeAntonis commented Mar 15, 2021

mymusise commented Mar 15, 2021

JamesDeAntonis commented Mar 15, 2021

mymusise commented Mar 15, 2021

mymusise commented Mar 15, 2021

JamesDeAntonis commented Mar 15, 2021 • edited Loading

JamesDeAntonis commented Mar 15, 2021

mymusise commented Mar 16, 2021

JamesDeAntonis commented Mar 16, 2021 • edited Loading

mymusise commented Mar 17, 2021 • edited Loading

JamesDeAntonis commented Mar 17, 2021

Muennighoff commented Mar 18, 2021

JamesDeAntonis commented Mar 18, 2021

Muennighoff commented Feb 22, 2021 •

edited

Loading

mymusise commented Feb 23, 2021 •

edited

Loading

mymusise commented Feb 24, 2021 •

edited

Loading

Muennighoff commented Mar 10, 2021 •

edited

Loading

JamesDeAntonis commented Mar 12, 2021 •

edited

Loading

mymusise commented Mar 14, 2021 •

edited

Loading

JamesDeAntonis commented Mar 15, 2021 •

edited

Loading

JamesDeAntonis commented Mar 16, 2021 •

edited

Loading

mymusise commented Mar 17, 2021 •

edited

Loading