Skip to content

Conversation

@no2chem
Copy link
Contributor

@no2chem no2chem commented Jun 22, 2023

An analysis of my previous patches suggests that use_kv_caching=True is not a win at all for the text model.

The current KV caching code has quite a bit of overhead. In particular, it (re)allocates large tensors with adding to the cache via a torch.cat call. With the coarse model it seems to result in some performance gain, but it is definitely not a win for the smaller text model. It seems that the KV cache was what was causing the bimodality in #366, I'm guessing sometimes we would get unlucky and the model would have to reallocate.

With this change I can consistently get about 280 it/second performance for the text model on an H100 after warmup.

@no2chem
Copy link
Contributor Author

no2chem commented Jun 22, 2023

Hm, nevermind here, I think I drew an invalid conclusion due to a mixup. It seems that caching still results in a small performance gain on the text model.

@no2chem no2chem closed this Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant