You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ppl script will use the full 2048 context length which on both the original CUDA kernel and Triton kernel uses about 8GB of GPU RAM. That's probably why you're getting OOM. You can modify the ppl script to use a different context length and then it should work fine. I didn't set up a CLI arg to adjust that yet, sorry.
What I don't understand: The GPTQ Cuda version works with 2048 context length (the benchmarks that output ppl). So does your version use a little bit more memory?
If I recall correctly the benchmarks in the GPTQ-for-LLaMA codebase do some caching and other tricks to lower the inference memory a little bit. Probably just enough to squeeze under that 8G threshold.
Thanks, I wanted to try your triton version. But I only have 8 GB RAM.
The GPTQ Cuda versions works (7B model). Your version (the ppl script) crashes with CUDA OOM).
Is that to be expected or can that be solved?
The text was updated successfully, but these errors were encountered: