Skip to content

Possible issue with speed computation - Higher #Mean Accepted tokens but lower speedups. #29

@NamburiSrinath

Description

@NamburiSrinath

Hi @hemingkx,

Thanks for this benchmark, it's super easy to get started with.

I am trying out an experiment with Speculative Decoding (SpS) with Llama-2-7b as the target model and a compressed version of it as the draft model. I was able to run the codebase successfully and have 2 JSON files; one for the base model autoregressive and other for the SpS.

Command I used to get the files (used all the default settings; temperature 0, float 16 etc;)

## Baseline
CUDA_VISIBLE_DEVICES=0 python -m evaluation.inference_baseline --model-path $Base_PATH --model-id ${MODEL_NAME}-vanilla-${torch_dtype}-temp-${TEMP} --bench-name $bench_NAME --temperature $TEMP --dtype $torch_dtype

## Speculative decoding, Assume I've a compressed version of target model as the drafter model
CUDA_VISIBLE_DEVICES=0 python -m evaluation.inference_sps --model-path $Base_PATH --drafter-path $Drafter_PATH --model-id ${Drafter_PATH}-${torch_dtype}-temp-${TEMP} --bench-name $bench_NAME --temperature $TEMP --dtype $torch_dtype 

Now, when I tried the speed comparison using the below command, I was able to get the metrics needed but have some confusion (and/or limitation)

python evaluation/speed.py --file-path speculative_decoded_output.jsonl --base-path data/spec_bench/model_answer/llama-2-7b-vanilla-float16-temp-0.0.jsonl --tokenizer-path meta-llama/Llama-2-7b-chat-hf

============================== Task:  mt_bench ==============================
#Mean accepted tokens:  3.4082441537851764
Tokens per second:  44.742442571612415
Tokens per second for the baseline:  72.6118430396252
Speedup ratio:  0.6161865709316302
============================== Task:  translation ==============================
#Mean accepted tokens:  3.0985100649865274
Tokens per second:  45.278612095378655
Tokens per second for the baseline:  73.46145049723718
Speedup ratio:  0.616358808448542
============================== Task:  summarization ==============================
#Mean accepted tokens:  3.0398410434073773
Tokens per second:  41.48063751699865
Tokens per second for the baseline:  72.62313028604987
Speedup ratio:  0.5711766671804649
============================== Task:  qa ==============================
#Mean accepted tokens:  3.0859972202918695
Tokens per second:  45.30407242996252
Tokens per second for the baseline:  73.28590707879088
Speedup ratio:  0.6181825979346256
============================== Task:  math_reasoning ==============================
#Mean accepted tokens:  3.467482859941234
Tokens per second:  45.61103650979495
Tokens per second for the baseline:  73.51347244171974
Speedup ratio:  0.6204445932812468
============================== Task:  rag ==============================
#Mean accepted tokens:  3.3533095723014257
Tokens per second:  41.75790560615537
Tokens per second for the baseline:  72.55333236627544
Speedup ratio:  0.5755477280539834
============================== Task:  overall ==============================
#Mean accepted tokens:  3.2440101726676485
Tokens per second:  44.029117788317095
Tokens per second for the baseline:  73.00818928494972
Speedup ratio:  0.6030709461437564 

Question: I can see that the Mean Accepted tokens is >1 in all the datasets but the tokens per second and the speedup ratios are way lower for speculative decoding compared with baseline. I tried it multiple times with different models/settings but this pattern is well observed!!

I don't have an idea of why that might be the reason. Any thoughts and help would be super appreciated!

P.S: Less likely there might be an issue with speedup computations at your end, but I wanted to bring to your notice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions