Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Needs more VRAM than normal GPTQ CUDA version? #1

Open
DanielWe2 opened this issue Mar 28, 2023 · 3 comments
Open

Needs more VRAM than normal GPTQ CUDA version? #1

DanielWe2 opened this issue Mar 28, 2023 · 3 comments

Comments

@DanielWe2
Copy link
Contributor

Thanks, I wanted to try your triton version. But I only have 8 GB RAM.

The GPTQ Cuda versions works (7B model). Your version (the ppl script) crashes with CUDA OOM).

Is that to be expected or can that be solved?

@fpgaminer
Copy link
Owner

Thank you for the bug report.

The ppl script will use the full 2048 context length which on both the original CUDA kernel and Triton kernel uses about 8GB of GPU RAM. That's probably why you're getting OOM. You can modify the ppl script to use a different context length and then it should work fine. I didn't set up a CLI arg to adjust that yet, sorry.

@DanielWe2
Copy link
Contributor Author

No, problem.

What I don't understand: The GPTQ Cuda version works with 2048 context length (the benchmarks that output ppl). So does your version use a little bit more memory?

@fpgaminer
Copy link
Owner

If I recall correctly the benchmarks in the GPTQ-for-LLaMA codebase do some caching and other tricks to lower the inference memory a little bit. Probably just enough to squeeze under that 8G threshold.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants