-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Hi @hemingkx,
Thanks for this benchmark, it's super easy to get started with.
I am trying out an experiment with Speculative Decoding (SpS) with Llama-2-7b as the target model and a compressed version of it as the draft model. I was able to run the codebase successfully and have 2 JSON files; one for the base model autoregressive and other for the SpS.
Command I used to get the files (used all the default settings; temperature 0, float 16 etc;)
## Baseline
CUDA_VISIBLE_DEVICES=0 python -m evaluation.inference_baseline --model-path $Base_PATH --model-id ${MODEL_NAME}-vanilla-${torch_dtype}-temp-${TEMP} --bench-name $bench_NAME --temperature $TEMP --dtype $torch_dtype
## Speculative decoding, Assume I've a compressed version of target model as the drafter model
CUDA_VISIBLE_DEVICES=0 python -m evaluation.inference_sps --model-path $Base_PATH --drafter-path $Drafter_PATH --model-id ${Drafter_PATH}-${torch_dtype}-temp-${TEMP} --bench-name $bench_NAME --temperature $TEMP --dtype $torch_dtype
Now, when I tried the speed comparison using the below command, I was able to get the metrics needed but have some confusion (and/or limitation)
python evaluation/speed.py --file-path speculative_decoded_output.jsonl --base-path data/spec_bench/model_answer/llama-2-7b-vanilla-float16-temp-0.0.jsonl --tokenizer-path meta-llama/Llama-2-7b-chat-hf
============================== Task: mt_bench ==============================
#Mean accepted tokens: 3.4082441537851764
Tokens per second: 44.742442571612415
Tokens per second for the baseline: 72.6118430396252
Speedup ratio: 0.6161865709316302
============================== Task: translation ==============================
#Mean accepted tokens: 3.0985100649865274
Tokens per second: 45.278612095378655
Tokens per second for the baseline: 73.46145049723718
Speedup ratio: 0.616358808448542
============================== Task: summarization ==============================
#Mean accepted tokens: 3.0398410434073773
Tokens per second: 41.48063751699865
Tokens per second for the baseline: 72.62313028604987
Speedup ratio: 0.5711766671804649
============================== Task: qa ==============================
#Mean accepted tokens: 3.0859972202918695
Tokens per second: 45.30407242996252
Tokens per second for the baseline: 73.28590707879088
Speedup ratio: 0.6181825979346256
============================== Task: math_reasoning ==============================
#Mean accepted tokens: 3.467482859941234
Tokens per second: 45.61103650979495
Tokens per second for the baseline: 73.51347244171974
Speedup ratio: 0.6204445932812468
============================== Task: rag ==============================
#Mean accepted tokens: 3.3533095723014257
Tokens per second: 41.75790560615537
Tokens per second for the baseline: 72.55333236627544
Speedup ratio: 0.5755477280539834
============================== Task: overall ==============================
#Mean accepted tokens: 3.2440101726676485
Tokens per second: 44.029117788317095
Tokens per second for the baseline: 73.00818928494972
Speedup ratio: 0.6030709461437564
Question: I can see that the Mean Accepted tokens is >1 in all the datasets but the tokens per second and the speedup ratios are way lower for speculative decoding compared with baseline. I tried it multiple times with different models/settings but this pattern is well observed!!
I don't have an idea of why that might be the reason. Any thoughts and help would be super appreciated!
P.S: Less likely there might be an issue with speedup computations at your end, but I wanted to bring to your notice.