Package path: offdev/micro-agent-go/internal/provider/llamacpp
Last updated: 2026-03-26
llamacpp.Provider implements core.LLMProvider by speaking to a running llama-server
process over HTTP. It uses the OpenAI-compatible REST API exposed by llama-server. No CGo,
no embedding of llama.cpp, no API key required. User messages with core.Message.Images are
sent as multimodal content: a JSON array of text and image_url parts (data URLs with base64).
Default target in examples: llama-server on http://localhost:8080 (override with config / LLAMA_URL).
provider := llamacpp.New("") // uses http://localhost:8080
provider := llamacpp.New("http://localhost:8080")
// Synchronous completion (TopP, TopK, MinP, PresencePenalty, RepetitionPenalty optional; 0 = server default)
resp, err := provider.Complete(ctx, &core.CompletionRequest{
Model: "default",
Messages: conv.Messages,
Tools: toolDefs,
MaxTokens: 8192,
Temperature: 0.7,
})
// Streaming (content and optional thinking)
ch, err := provider.Stream(ctx, req)
for delta := range ch {
if delta.Done { break }
if delta.Thinking {
fmt.Print("[thinking] ", delta.Content)
} else {
fmt.Print(delta.Content)
}
}
// When delta.Done, delta.Final holds the accumulated Response (content + tool_calls).
// Optional: trace raw response shape when parsed content/tool_calls are empty (e.g. -vvv)
provider.SetTraceLogger(log)
// Embeddings (not part of core.LLMProvider interface)
vec, err := provider.Embed(ctx, "some text to embed")Requests are sent as JSON to /v1/chat/completions (OpenAI-compatible format).
| core type | wire field |
|---|---|
Message{Role: RoleAssistant, ToolCalls: [...]} |
"content": null, "tool_calls": [...] |
Message{Role: RoleTool, ...} |
"role": "tool", "tool_call_id": "..." |
ToolDef.Parameters |
function.parameters (passed through verbatim) |
Content nullability: Assistant messages with only tool calls send "content": null.
Messages with text content send "content": "<text>".
Response content: The provider accepts message.content as either a JSON string or an
array of parts (e.g. [{"type":"text","text":"..."}]). Some backends return content as an
array after tool-use turns; parsing both shapes ensures the agent loop continues correctly
(e.g. cron heartbeat with tool calls). When the server reports output tokens but parsed
content and tool_calls are empty, trace logging (-vvv) can log the raw message for
diagnostics (see SetTraceLogger).
Thinking / reasoning_content: When llama-server is run with thinking models (e.g. DeepSeek R1,
Command R7B) and --reasoning-format deepseek (or equivalent), the stream delta may include
reasoning_content in addition to content. The provider emits Delta{Thinking: true, Content: ...}
for reasoning fragments and Delta{Thinking: false, Content: ...} for the main reply. The
final delta has Done: true and Final set with accumulated content and tool_calls. If the
server does not send reasoning_content, only content deltas are emitted (no thinking).
When CompletionRequest.ThinkingEnabled is non-nil, the provider maps it to the request body (e.g.
chat-template / thinking flags) as supported by the server.
Provider.Embed calls POST /v1/embeddings and returns the first embedding vector. This
method is outside the core.LLMProvider interface. internal/app passes it to memory.OpenStore
as memory.EmbedFunc when MEMORY_EMBED is true, for Milvus vector storage and search.
The http.Client has no timeout by default. LLM inference duration is unbounded; callers
must use context cancellation to control per-request timeouts. The client is shared across all
calls (connection pooling via net/http default transport).
- Non-200 HTTP responses return
fmt.Errorf("llamacpp: http status %d", ...). - JSON decode failures return wrapped errors.
- Streaming goroutine errors are silently dropped (malformed SSE chunks are skipped); the channel is always closed cleanly.