A comprehensive benchmark suite for testing Gemma3n language models (2.6B and 4.5B parameters) using llama.cpp native executables. This benchmark focuses on Spanish administrative tasks and compares CPU vs GPU performance.
- Model Support: Benchmarks both Gemma3n 2.6B and 4.5B models
- Performance Comparison: CPU-only vs GPU-accelerated inference
- Administrative Focus: 8 Spanish administrative prompts covering government/office workflows
- Comprehensive Metrics: Tokens/second, response time, success rates, and GPU speedup ratios
- Multiple Output Formats: Detailed JSON results and CSV summaries
- System Information: Captures hardware and software configuration for reproducible results
- llama.cpp executable:
build/bin/llama-cli - Model files:
models/gemma-3n-E2B-it-Q4_K_M.gguf(2.6B model)models/gemma-3n-E4B-it-Q4_K_M.gguf(4.5B model)
# Install minimal dependencies
pip install -r requirements.txt
# Or install manually
pip install pandas psutil
# All llama.cpp tool dependencies (optional)
pip install -r requirements/requirements-all.txt- Python 3.8+
- Built llama.cpp with CLI support
- CUDA-compatible GPU (optional, for GPU benchmarks)
- Sufficient RAM for model loading (8GB+ recommended)
# Clone and navigate to the repository
cd gemma3n-research
# Install dependencies
pip install -r requirements.txt
# Ensure llama.cpp is built
# (Follow llama.cpp build instructions if needed)# Run comprehensive benchmark (CPU + GPU)
./run_benchmark.sh
# Run CPU-only benchmark
./run_benchmark.sh --cpu-only
# Run GPU-only benchmark
./run_benchmark.sh --gpu-only
# Custom configuration
./run_benchmark.sh --threads 8 --gpu-layers 50Results are saved in the benchmark_results/ directory:
gemma3n_benchmark_YYYYMMDD_HHMMSS.json- Detailed resultsgemma3n_benchmark_summary_YYYYMMDD_HHMMSS.csv- Summary table
Options:
-c, --cpu-only Run CPU-only benchmark
-g, --gpu-only Run GPU-accelerated benchmark only
-a, --all Run comprehensive benchmark (CPU + GPU) [DEFAULT]
-t, --threads <N> Set number of CPU threads (default: 4)
-l, --gpu-layers <N> Set number of GPU layers (default: 99)
-o, --output-dir <DIR> Set output directory (default: benchmark_results)
-h, --help Show help messagepython benchmark_gemma3n.py [options]
Options:
--cpu-threads N Number of CPU threads to use
--gpu-layers N Maximum GPU layers to use (default: 32)
--output-dir DIR Output directory for results (default: benchmark_results)
--executable PATH Path to llama-cli executableThe benchmark tests models on 8 Spanish administrative tasks:
- Remote Work Policy: Draft remote work guidelines for government agencies
- Budget Proposal: Create IT infrastructure budget summary for municipal offices
- Incident Report: Develop security incident report template
- Document Approval: Generate standard operating procedure for document workflows
- Compliance Memo: Write data protection compliance requirements memo
- Employee Onboarding: Create public administration employee onboarding instructions
- Performance Evaluation: Develop performance evaluation checklist for administrative staff
- Procurement Request: Create equipment and supplies procurement form
Detailed results include:
- System information (CPU, memory, GPU, etc.)
- Per-prompt results with response text, timing, and token counts
- Configuration details for each test run
- Success/failure status for each inference
Summary table with:
- Model name and configuration
- Average tokens per second
- Average response time
- Total tokens generated
- Success rate
- GPU speedup ratios
- Tokens/Second: Inference speed measurement
- Response Time: Total time per prompt (seconds)
- Success Rate: Percentage of successful inferences
- GPU Speedup: Performance improvement ratio (GPU vs CPU)
- Total Tokens: Cumulative tokens generated across all prompts
- Quantization: Q4_K_M for both models
- Max Tokens: 150 per prompt
- Temperature: 0.7
- Top-p: 0.9
- Context Size: 2048 tokens
- Batch Size: 512
- CPU Threads: Configurable (default: auto-detect)
- GPU Layers: Configurable (default: 32-99 depending on script)
- Timeout: 120 seconds per inference
-
llama-cli not found
- Ensure llama.cpp is built:
make -C llama.cpp - Check executable path:
build/bin/llama-cli
- Ensure llama.cpp is built:
-
Model files missing
- Download or convert models to GGUF format
- Place in
models/directory with expected filenames
-
GPU not detected
- Verify CUDA installation
- Check GPU layers setting
- Ensure llama.cpp built with CUDA support
-
Memory issues
- Reduce GPU layers for large models
- Ensure sufficient system RAM
- Monitor memory usage during benchmarks
# Check system prerequisites
python benchmark_gemma3n.py --help
# Test llama.cpp installation
build/bin/llama-cli --help
# Verify model files
ls -la models/When contributing to this benchmark:
- Maintain focus on administrative/government use cases
- Test changes with both CPU and GPU configurations
- Ensure output format compatibility
- Update documentation for new features
This benchmark suite is part of the montevive research project. Please refer to the project's main license file for usage terms.