Skip to content

Benchmarking

Laurent Mazare edited this page Jul 1, 2023 · 4 revisions

There are two main metrics of interest, the time to process a prompt (for large prompts) and the time to generate each subsequent token once the initial prompt has been processed.

Prompt Processing Time

Subsequent Per Token Time

CPU benchmarking The following command can be used to benchmark the per token generation time (note that this uses f16 and a single thread).

OMP_NUM_THREADS=1 RAYON_NUM_THREADS=1 cargo run --release --example llama -- \
    --cpu --npy llama.npz --prompt "the answer to life in the universe and everything is"

On a Ryzen 5 2600X, this results in a time of ~2s per token, flamegraph.

Clone this wiki locally