Skip to content

Latest commit

 

History

History
108 lines (79 loc) · 4.18 KB

File metadata and controls

108 lines (79 loc) · 4.18 KB

provider/llamacpp — llama-server HTTP Provider

Package path: offdev/micro-agent-go/internal/provider/llamacpp
Last updated: 2026-03-26


Overview

llamacpp.Provider implements core.LLMProvider by speaking to a running llama-server process over HTTP. It uses the OpenAI-compatible REST API exposed by llama-server. No CGo, no embedding of llama.cpp, no API key required. User messages with core.Message.Images are sent as multimodal content: a JSON array of text and image_url parts (data URLs with base64).

Default target in examples: llama-server on http://localhost:8080 (override with config / LLAMA_URL).


Usage

provider := llamacpp.New("")  // uses http://localhost:8080
provider := llamacpp.New("http://localhost:8080")

// Synchronous completion (TopP, TopK, MinP, PresencePenalty, RepetitionPenalty optional; 0 = server default)
resp, err := provider.Complete(ctx, &core.CompletionRequest{
    Model:       "default",
    Messages:    conv.Messages,
    Tools:       toolDefs,
    MaxTokens:   8192,
    Temperature: 0.7,
})

// Streaming (content and optional thinking)
ch, err := provider.Stream(ctx, req)
for delta := range ch {
    if delta.Done { break }
    if delta.Thinking {
        fmt.Print("[thinking] ", delta.Content)
    } else {
        fmt.Print(delta.Content)
    }
}
// When delta.Done, delta.Final holds the accumulated Response (content + tool_calls).

// Optional: trace raw response shape when parsed content/tool_calls are empty (e.g. -vvv)
provider.SetTraceLogger(log)

// Embeddings (not part of core.LLMProvider interface)
vec, err := provider.Embed(ctx, "some text to embed")

Wire Format

Requests are sent as JSON to /v1/chat/completions (OpenAI-compatible format).

core type wire field
Message{Role: RoleAssistant, ToolCalls: [...]} "content": null, "tool_calls": [...]
Message{Role: RoleTool, ...} "role": "tool", "tool_call_id": "..."
ToolDef.Parameters function.parameters (passed through verbatim)

Content nullability: Assistant messages with only tool calls send "content": null. Messages with text content send "content": "<text>".

Response content: The provider accepts message.content as either a JSON string or an array of parts (e.g. [{"type":"text","text":"..."}]). Some backends return content as an array after tool-use turns; parsing both shapes ensures the agent loop continues correctly (e.g. cron heartbeat with tool calls). When the server reports output tokens but parsed content and tool_calls are empty, trace logging (-vvv) can log the raw message for diagnostics (see SetTraceLogger).

Thinking / reasoning_content: When llama-server is run with thinking models (e.g. DeepSeek R1, Command R7B) and --reasoning-format deepseek (or equivalent), the stream delta may include reasoning_content in addition to content. The provider emits Delta{Thinking: true, Content: ...} for reasoning fragments and Delta{Thinking: false, Content: ...} for the main reply. The final delta has Done: true and Final set with accumulated content and tool_calls. If the server does not send reasoning_content, only content deltas are emitted (no thinking).

When CompletionRequest.ThinkingEnabled is non-nil, the provider maps it to the request body (e.g. chat-template / thinking flags) as supported by the server.


Embed

Provider.Embed calls POST /v1/embeddings and returns the first embedding vector. This method is outside the core.LLMProvider interface. internal/app passes it to memory.OpenStore as memory.EmbedFunc when MEMORY_EMBED is true, for Milvus vector storage and search.


HTTP Client

The http.Client has no timeout by default. LLM inference duration is unbounded; callers must use context cancellation to control per-request timeouts. The client is shared across all calls (connection pooling via net/http default transport).


Error Handling

  • Non-200 HTTP responses return fmt.Errorf("llamacpp: http status %d", ...).
  • JSON decode failures return wrapped errors.
  • Streaming goroutine errors are silently dropped (malformed SSE chunks are skipped); the channel is always closed cleanly.