Adding Self hosted LLM models #26

bmabir17 · 2025-04-03T21:00:30Z

bmabir17
Apr 3, 2025

Hi,
I was wondering if we can also add self hosted LLM model with ollama? It has openai compatible api end points. These will make running this model with zero cost and full privacy for end user.

Ornithopter-pilot · 2025-04-05T06:11:58Z

Ornithopter-pilot
Apr 5, 2025
Collaborator

@bmabi17

it's a fantastic idea that could align with CodeInterviewAssist’s goals of zero-cost and full privacy for end users..., but there are some challenges we should consider, especially given the hardware many in our community might have, like 4GB GTX cards, and even a comparison with 8GB VRAM RTX cards.

Challenges with 4GB GTX Cards
VRAM Limitations: Most open-source LLMs supported by Ollama (e.g., Llama or Mistral 7B models) require 4-6GB of VRAM when quantized to 4-bit precision. A 4GB GTX card might barely handle these, leaving little room for inference overhead, leading to crashes or forced CPU use, which is significantly slower.

Inference Performance: Even if a model fits, processing speeds could be sluggish...potentially taking 5-10 seconds per response , disrupting real-time features like debugging or solution generation during interviews.

Resource Contention: With limited system RAM (e.g., 8GB or 16GB total) and a 4GB GPU, multitasking with an IDE, browser, and interview tools (e.g., Zoom) could cause memory swapping or instability.

Comparison with 8GB VRAM RTX Cards
An 8GB RTX card (e.g., RTX 3060) offers more headroom, potentially supporting 7B or even 13B models with better quantization. However, larger models (e.g., 70B) still exceed 8GB VRAM, forcing CPU fallback or requiring offloading techniques that slow things down. Multitasking might be smoother, but during intensive use (e.g., screenshot analysis with GPT-4o-level vision tasks), performance could still lag compared to cloud APIs. The RTX’s newer architecture might help with efficiency, but the VRAM ceiling remains a bottleneck.

Additional Considerations
Setup Complexity: Users would need to install Ollama, download a compatible model, and configure it //. For those with 4GB or 8GB cards, picking the right quantized model and troubleshooting compatibility with ProcessingHelper.ts could be daunting.
Model Quality: Smaller models fitting within 4GB or 8GB might not match GPT-4o’s or other good models accuracy for problem extraction or debugging, potentially reducing the tool’s effectiveness.
Maintenance: Updating Ollama or models could introduce issues, requiring community support since this is a volunteer-driven project.

Collaborative Next Steps
Despite these hurdles, this is doable with community effort! I’d suggest testing a quantized 7B model on your hardware....start with something like Llama-7B-4bit..and share a pull request with your changes on ProcessingHelper.ts and setup instructions. Adding a hybrid mode (toggling between local Ollama and OpenAI) in SettingsDialog.tsx could let users adapt to their hardware. For 4GB GTX users, optimizing for the smallest viable model, and for 8GB RTX users, exploring 13B options, could be a great focus.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Self hosted LLM models #26

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Adding Self hosted LLM models #26

Uh oh!

bmabir17 Apr 3, 2025

Replies: 1 comment

Uh oh!

Ornithopter-pilot Apr 5, 2025 Collaborator

bmabir17
Apr 3, 2025

Ornithopter-pilot
Apr 5, 2025
Collaborator