Embedding Gemma Microservice

A lightweight, high-performance Rust-based microservice for Google’s EmbeddingGemma-300M. It serves embeddings via both HTTP and gRPC and is packaged in small, optimized Docker images for both CPU and GPU.

Built on ONNX Runtime, it features dynamic request batching to maximize throughput.

Quick Start

🚀 GPU Version (Requires NVIDIA Container Toolkit)

docker run --gpus all -it --rm \
  -p 3000:3000 \
  -p 50051:50051 \
  -e MODEL_VARIANT=q4 \
  garvw/gemma-embedder:gpu

💻 CPU Version

docker run -it --rm \
  -p 3000:3000 \
  -p 50051:50051 \
  -e MODEL_VARIANT=q4 \
  garvw/gemma-embedder:cpu

The container will download the specified model variant on first startup.

Features

🚀 High Performance: Written in Rust for minimal overhead and memory safety.
📦 Optimized Docker Images: Small, secure images for both CPU and GPU.
Strict Execution: The GPU image requires a CUDA-enabled GPU and will exit if one is not found—no silent fallbacks.
🌐 Dual Endpoints: A simple JSON REST API (via Axum) and a high-performance gRPC endpoint (via Tonic).
⚙️ Configurable: Easily configure batch size, token length, and model variant via environment variables.
🧵 Dynamic Batching: Automatically batches incoming requests to maximize inference throughput.

Run with Docker

This service is designed to be run as a Docker container. Two separate images are provided.

Image Tags

garvw/gemma-embedder:gpu: For systems with an NVIDIA GPU and the NVIDIA Container Toolkit installed.
garvw/gemma-embedder:cpu: For CPU-only environments.

The MODEL_VARIANT environment variable controls which model weights are downloaded. The q4 (4-bit quantized) variant is recommended for a good balance of performance and quality.

API Endpoints

HTTP API

Endpoint: POST /v1/embed

Request Body:

{
  "text": "hello world"
}

Example with curl:

curl -X POST http://localhost:3000/v1/embed \
  -H "Content-Type: application/json" \
  -d '{"text":"hello world"}'

gRPC API

Service: inference.Inferencer Method: GetEmbedding

Example with grpcurl:

grpcurl -plaintext \
  -d '{"text":"hello world"}' \
  localhost:50051 inference.Inferencer/GetEmbedding

Environment Variables

Variable	Default	Description
`EXECUTION_PROVIDER`	`cpu`	Set to `gpu` in the GPU Dockerfile to enforce GPU execution.
`MODEL_VARIANT`	`q4`	Which model variant to download. `q4` or `fp32` (full-precision) are recommended.
`MODEL_PATH`	(auto)	Set automatically by the startup script based on `MODEL_VARIANT`.
`MAX_TOKENS`	`2048`	Max sequence length for the tokenizer.
`MAX_BATCH_SIZE`	`32`	Max number of requests to group into a single inference batch.
`MAX_WAIT_MS`	`5`	Time (in ms) to wait to fill a batch before running inference.

Build from Source

You can build the Docker images locally instead of pulling from Docker Hub.

Building the GPU Image

docker build -f Dockerfile.gpu -t gemma-embedder:gpu-local .

Building the CPU Image

docker build -f Dockerfile.cpu -t gemma-embedder:cpu-local .

License

The EmbeddingGemma model weights are licensed under Google’s Gemma Terms of Use. This project provides a service wrapper for the model, and by using the Docker images, you are responsible for complying with Google’s terms.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
proto		proto
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
README.md		README.md
build.rs		build.rs
download_models.sh		download_models.sh
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Embedding Gemma Microservice

Quick Start

🚀 GPU Version (Requires NVIDIA Container Toolkit)

💻 CPU Version

Features

Run with Docker

Image Tags

API Endpoints

HTTP API

gRPC API

Environment Variables

Build from Source

Building the GPU Image

Building the CPU Image

License

About

Uh oh!

Releases

Packages

Languages

goravaa/gemma-embedder-rust

Folders and files

Latest commit

History

Repository files navigation

Embedding Gemma Microservice

Quick Start

🚀 GPU Version (Requires NVIDIA Container Toolkit)

💻 CPU Version

Features

Run with Docker

Image Tags

API Endpoints

HTTP API

gRPC API

Environment Variables

Build from Source

Building the GPU Image

Building the CPU Image

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages