Needs more VRAM than normal GPTQ CUDA version? #1

DanielWe2 · 2023-03-28T19:46:24Z

Thanks, I wanted to try your triton version. But I only have 8 GB RAM.

The GPTQ Cuda versions works (7B model). Your version (the ppl script) crashes with CUDA OOM).

Is that to be expected or can that be solved?

fpgaminer · 2023-03-28T20:47:29Z

Thank you for the bug report.

The ppl script will use the full 2048 context length which on both the original CUDA kernel and Triton kernel uses about 8GB of GPU RAM. That's probably why you're getting OOM. You can modify the ppl script to use a different context length and then it should work fine. I didn't set up a CLI arg to adjust that yet, sorry.

DanielWe2 · 2023-03-28T21:13:43Z

No, problem.

What I don't understand: The GPTQ Cuda version works with 2048 context length (the benchmarks that output ppl). So does your version use a little bit more memory?

fpgaminer · 2023-03-28T21:16:47Z

If I recall correctly the benchmarks in the GPTQ-for-LLaMA codebase do some caching and other tricks to lower the inference memory a little bit. Probably just enough to squeeze under that 8G threshold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Needs more VRAM than normal GPTQ CUDA version? #1

Needs more VRAM than normal GPTQ CUDA version? #1

DanielWe2 commented Mar 28, 2023

fpgaminer commented Mar 28, 2023

DanielWe2 commented Mar 28, 2023

fpgaminer commented Mar 28, 2023

Needs more VRAM than normal GPTQ CUDA version? #1

Needs more VRAM than normal GPTQ CUDA version? #1

Comments

DanielWe2 commented Mar 28, 2023

fpgaminer commented Mar 28, 2023

DanielWe2 commented Mar 28, 2023

fpgaminer commented Mar 28, 2023