GitHub - llm-d/llm-d-benchmark: llm-d benchmark scripts and tooling

`llm-d`-benchmark

This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.

Goal

To provide a single source of automation for repeatable and reproducible experiments and performance evaluation on llm-d.

Architecture

The benchmarking system drives synthetic or trace-based traffic into an llm-d-powered inference stack, orchestrated via Kubernetes. Requests are routed through a scalable load generator, with results collected and visualized for latency, throughput, and cache effectiveness.

llm-d Logo

Goals

Reproducibility

Each benchmark run collects enough information to enable the execution on different clusters/environments with minimal setup effort

Flexibility

Multiple load generators and multiple load profiles available, in a plugabble architecture that allows expansion

Well defined set of Metrics

Define and measure a representative set of metrics that allows not only meaningful comparisons between different stacks, but also performance characterization for different components.

For a discussion of candidate relevant metrics, please consult this document

Category	Metric	Unit
Throughput	Output tokens / second	tokens / second
Throughput	Input tokens / second	tokens / second
Throughput	Requests / second	qps
Latency	Time per output token (TPOT)	ms per output token
Latency	Time to first token (TTFT)	ms
Latency	Time per request (TTFT + TPOT * output length)	seconds per request
Latency	Normalized time per output token (TTFT/output length +TPOT) aka NTPOT	ms per output token
Latency	Inter Token Latency (ITL) - Time between decode tokens within a request	ms per output token
Correctness	Failure rate	queries
Experiment	Benchmark duration	seconds

Relevant collection of Workloads

Define a mix of workloads that express real-world use cases, allowing for llm-d performance characterization, evaluation, stress investigation.

For a discussion of relevant workloads, please consult this document

Workload	Use Case	ISL	ISV	OSL	OSV	OSP	Latency
Interactive Chat	Chat agent	Medium	High	Medium	Medium	Medium	Per token
Classification of text	Sentiment analysis	Medium		Short	Low	High	Request
Classification of images	Nudity filter	Long	Low	Short	Low	High	Request
Summarization / Information Retrieval	Q&A from docs, RAG	Long	High	Short	Medium	Medium	Per token
Text generation		Short	High	Long	Medium	Low	Per token
Translation		Medium	High	Medium	Medium	High	Per token
Code completion	Type ahead	Long	High	Short	Medium	Medium	Request
Code generation	Adding a feature	Long	High	Medium	High	Medium	Request

Design and Roadmap

llm-d-benchmark follows the practice of its parent project (llm-d) by having also it is own Northstar design (a work in progress)

Main concepts (identified by specific directories)

Scenarios

Pieces of information identifying a particular cluster. This information includes, but it is not limited to, GPU model, llm model and llm-d parameters (an environment file, and optionally a values.yaml file for llm-d-deployer)

Harness

Load Generator (python code), written using software facilites available at https://github.com/fmperf-project/fmperf.

Note

This will be expanded with additional load generators in the future (e.g. inference-perf )

Workload

FMPerf workload specification, with load profile (e.g., share-gpt vs long-input) and load levels (e.g., QPS values). IMPORTANT: these definitions will be expanded with specifications for other load generators.

Important

The triple <scenario>,<harness>,<workload>, combined with the standup/teardown capabilities provided by llm-d-deployer (https://github.com/llm-d/llm-d-deployer) should provide enough information to allow an experiment to be reproduced.

Dependecies:

llm-d-deployer (https://github.com/llm-d/llm-d-deployer)
fm-perf: https://github.com/fmperf-project/fmperf

📦 Repository Setup

git clone https://github.com/llm-d/llm-d-benchmark.git
cd llm-d-benchmark

Quickstart

Standing up llm-d for experimentation and benchmarking

export LLMDBENCH_CLUSTER_HOST="https://api.fmaas-platform-eval.fmaas.res.ibm.com"
export LLMDBENCH_CLUSTER_TOKEN="..."

Tip

You can simply use your current context. After running kubectl/oc login, just set export LLMDBENCH_CLUSTER_HOST=auto (and leave LLMDBENCH_CLUSTER_TOKEN unconfigured)

Important

No matter which method used (i.e., fully specify LLMDBENCH_CLUSTER_HOST and LLMDBENCH_CLUSTER_TOKEN or simply use the current context), there is an additional variable which will always require definition: LLMDBENCH_HF_TOKEN

Caution

Please make sure the environment variable LLMDBENCH_VLLM_COMMON_PVC_STORAGE_CLASS points to a storage class specific to your cluster. The default value will most likely fail.

A complete list of available variables (and its default values) can be found by running cat setup/env.sh | grep "^export LLMDBENCH_" | sort

Note

The namespaces specified by the environment variables LLMDBENCH_VLLM_COMMON_NAMESPACE and LLMDBENCH_FMPERF_SERVICE_ACCOUNT will be automatically created.

Tip

If you want all generated yaml files and all data collected to reside on the same directory, set the environment variable LLMDBENCH_CONTROL_WORK_DIR explicitly before starting execution.

List of "standup steps"

Run the command line with the option -h in order to produce a list of steps

./setup/standup.sh -h

Note

Each individual "step file" is named in a way that briefly describes each one the multiple steps required for a full deployment.

Tip

Steps 0-5 can be considered "preparation" and can be skipped in most deployments.

to dry-run

./setup/standup.sh -n

Deployment

vLLM instances can be deployed by one of the following methods:

"standalone" (a simple deployment with services associated to the deployment)
"deployer" (invoking "llm-d-deployer").

This is controlled by the environment variable LLMDBENCH_DEPLOY_METHODS (default "deployer"). The value of the environment variable can be overriden by the paraemeter -t/--types (applicable for both teardown.sh and standup.sh)

Warning

At this time, only one simultaneous deployment method is supported

All available models are listed and controlled by the variable LLMDBENCH_DEPLOY_MODEL_LIST. The value of the above mentioned environment variable can be overriden by the paraemeter -m/--model (applicable for both teardown.sh and standup.sh).

Warning

At this time, only one simultaneous model is supported

Tip

The following aliases can be used in place of the full mode name, for convenience (llama-3b -> meta-llama/Llama-3.2-3B-Instruct, llama-8b -> meta-llama/Llama-3.1-8B-Instruct, llama-70b -> meta-llama/Llama-3.1-70B-Instruct, llama-17b -> RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic)

Scenarios

All relevant variables to a particular experiment are stored in a "scenario" (folder aptly named).

The expectation is that an experiment is run by initially executing:

source scenario/<scenario name>

Lifecycle (Standup/Run/Teardown)

At this point, with all the environment variables set (tip, env | grep ^LLMDBENCH_ | sort) you should be ready to deploy and test

./setup/standup.sh

Note

The scenario can also be indicated as part of the command line optios for standup.sh (e.g. ./setup/standup.sh -c ocp_H100MIG_deployer_llama-3b)

To re-execute only individual steps (full name or number):

./setup/standup.sh --step 08_smoketest.sh
./setup/standup.sh -s 7
./setup/standup.sh -s 3-5
./setup/standup.sh -s 5,7

Once llm-d is fully deployed, an experiment can be run

./run.sh

Important

This command will run an experiment, collect data and perform an initial analysis (generating statistics and plots). One can go straight to the analysis by adding the option -z/--skip to the above command

Note

The scenario can also be indicated as part of the command line optios for run.sh (e.g., ./run.sh -c ocp_L40_standalone_llama-8b)

Finally, cleanup everything

./setup/teardown.sh

Note

The scenario can also be indicated as part of the command line optios for teardown.sh (e.g., ./teardown.sh -c kubernetes_H200_deployer_llama-8b)

Reproducibility

All the information collected inside the directory pointed by the environment variable LLMDBENCH_CONTROL_WORK_DIR should be enough to allow others to reproduce the experiment with the same parameters. In particular, all the parameters - always exposed as environment variables - applied to llm-d or vllm stacks can be found at ${LLMDBENCH_CONTROL_WORK_DIR}/environment/variables

A sample output of the contentx of ${LLMDBENCH_CONTROL_WORK_DIR} for a very simple experiment is shown here

./analysis
./analysis/data
./analysis/data/stats.txt
./analysis/plots
./analysis/plots/latency_analysis.png
./analysis/plots/README.md
./analysis/plots/throughput_analysis.png
./setup
./setup/yamls
./setup/yamls/05_pvc_workload-pvc.yaml
./setup/yamls/pod_benchmark-launcher.yaml
./setup/yamls/05_b_service_access_to_fmperf_data.yaml
./setup/yamls/07_deployer_values.yaml
./setup/yamls/05_namespace_sa_rbac_secret.yaml
./setup/yamls/04_prepare_namespace_llama-3b.yaml
./setup/yamls/05_a_pod_access_to_fmperf_data.yaml
./setup/yamls/03_cluster-monitoring-config_configmap.yaml
./setup/commands
./setup/commands/1748350741979704000_command.log
...
./setup/commands/1748350166902915000_command.log
./setup/sed-commands
./results
./results/LMBench_short_input_qps0.5.csv
./results/pod_log_response.txt
./environment
./environment/context.ctx
./environment/variables
./workload
./workload/harnesses
./workload/profiles
./workload/profiles/sanity_short-input.yaml

Observability

As of today, observability, via Grafana dashboards, is considered to be outside of the scope for llm-d-benchmark. Please refer to the installation guide on llm-d-deployer for instructions on how to enable it.

Examples

These plots, automatically generated, were used to showcase the difference between a baseline vLLM deployment and llm-d (for models Llama 4 Scout and Lllama 3.1 70B)

vllm vs llm-d comparison

Quickstart K8s Benchmark Launcher for Existing Stacks

For a simplified workflow that includes analysis of benchmark results, check out the quickstart-existing-stack-benchmark launcher. This workflow provides:

Easy deployment and execution of benchmarks on Kubernetes
Support for comparing multiple LLM models
Generation of comprehensive performance visualizations

Quickstart Workflows

Single Model Benchmark: Run benchmarks for a single model with automated analysis
- See Single Model Quickstart for details
Multi-Model Comparison: Compare performance across multiple LLM models
- See Multi-Model Comparison Quickstart for details

To get started, navigate to the quickstart-existing-stack-benchmark directory and follow the instructions in the respective README files.

Contribute

Instructions on how to contribute including details on our development process and governance.
We use Slack to discuss development across organizations. Please join: Slack. There is a sig-benchmarking channel there.

License

This project is licensed under Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
analysis		analysis
build		build
collected		collected
deploy		deploy
docs/images		docs/images
quickstart-existing-stack-benchmark		quickstart-existing-stack-benchmark
scenarios		scenarios
setup		setup
util		util
workload		workload
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.pre-commit_requirements.txt		.pre-commit_requirements.txt
.secrets.baseline		.secrets.baseline
.version.json		.version.json
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
run.sh		run.sh

License

llm-d/llm-d-benchmark

Folders and files

Latest commit

History

Repository files navigation

llm-d-benchmark

Goal

Architecture

Goals

Reproducibility

Flexibility

Well defined set of Metrics

Relevant collection of Workloads

Design and Roadmap

Main concepts (identified by specific directories)

Scenarios

Harness

Workload

Dependecies:

📦 Repository Setup

Quickstart

Standing up llm-d for experimentation and benchmarking

List of "standup steps"

to dry-run

Deployment

Scenarios

Lifecycle (Standup/Run/Teardown)

Reproducibility

Observability

Examples

Quickstart K8s Benchmark Launcher for Existing Stacks

Quickstart Workflows

Contribute

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 6

Uh oh!

Languages

`llm-d`-benchmark

Packages