Local, private, and fast LLM chat interface and orchestrator — with optional intelligent web search via SearxNG.
Built for self-hosting on everything from Raspberry Pi and low-power mini-PCs to full servers.
Lightweight, secure, and purpose-built to outperform bloated open-web UI stacks.
Go-LLama is a modern, high-performance chat interface for local LLM backends such as vLLM, llama.cpp, and llamafile.
It provides user accounts, persistent chat history, streaming responses, and context-aware web search for real-time information.
This project prioritizes:
- Low resource usage (~50 MB RAM typical)
- Fast response times, even on low-end hardware
- Privacy, simplicity, and stability
- Complete offline operation when web search is disabled
- Connects to local LLM APIs (vLLM, llama.cpp, llamafile, etc.)
- True streaming output with tokens-per-second metrics
- Session reuse for models that support incremental conversation memory
- Automatic context trimming to stay within model limits
- Per-user chat isolation and persistent history
Go-LLama’s search pipeline is more than a simple fetch-and-dump of results:
-
Smart Auto-Search Trigger
- Searches the web automatically when the user’s question implies a need for current or factual data (dates, tickers, “latest”, etc.)
- User can force or block searches naturally via phrases like “search the web for…” or “don’t search online”.
-
Result Ranking & Filtering
- Raw SearxNG results are ranked by semantic relevance to the query.
- Irrelevant or low-content hits are dropped automatically.
- Only the top 50 % of the most relevant results (respecting your configured limit) are retained.
-
Content Extraction & Enrichment
- Each remaining result is visited and its full HTML content extracted (not just the snippet).
- Boilerplate and noise are stripped; the core text is identified.
- Extracted text is summarised and compressed into a short, LLM-optimized snippet.
-
LLM-Optimised Context Assembly
- Summaries are formatted into a concise, numbered context block fed directly into the model prompt.
- The LLM is instructed to answer using those references and cite them inline (
[1],[2], …). - The user sees a clean, cited answer with expandable source links appended.
This process produces higher-quality responses than naive web-injection — fast, relevant, and grounded without overwhelming the model.
- JWT-based authentication
- Admin and standard user roles
- Built-in user management endpoints
- Private, per-user chat storage
- Clean Bootstrap-based interface
- Mobile-friendly layout
- Real-time streaming message bubbles
- Manual Stop Generation button
- Optional auto-search notification when triggered dynamically
- Runs under a custom sub-path (useful for reverse proxies)
- Docker-ready stack: PostgreSQL + Redis + Go-based backend
- OpenAPI spec included for integration and extension
Optimised for low-power devices with:
- Efficient Go concurrency for streaming and background fetches
- Minimal JavaScript footprint
- Very low RAM footprint (~50 MB idle)
- On-demand SearxNG queries only when needed
The result: an interface that feels instant even on hardware where most AI dashboards crawl.
Install without cloning the repository:
# Download templates
curl -L -o docker-compose.yml https://raw.githubusercontent.com/TheFozid/go-llama/main/docker-compose.yml.sample
curl -L -o config.json https://raw.githubusercontent.com/TheFozid/go-llama/main/config.sample.json
# Edit with your DB, Redis, LLM, and SearxNG settings
nano config.json
nano docker-compose.yml
# Launch
docker compose up -dApplication URL:
http://localhost:8070/go-llama
To update to the latest version:
docker compose pull
docker compose up -dAll user data in Docker volumes is preserved across updates.








