Benchmarking

There are two main metrics of interest, the time to process a prompt (for large prompts) and the time to generate each subsequent token once the initial prompt has been processed.

Prompt Processing Time

Subsequent Per Token Time

CPU benchmarking The following command can be used to benchmark the per token generation time (note that this uses f16 and a single thread).

OMP_NUM_THREADS=1 RAYON_NUM_THREADS=1 cargo run --release --example llama -- \
    --cpu --npy llama.npz --prompt "the answer to life in the universe and everything is"

On a Ryzen 5 2600X, this results in a time of ~2s per token, flamegraph.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking

Prompt Processing Time

Subsequent Per Token Time

Clone this wiki locally