Possible issue with speed computation - Higher #Mean Accepted tokens but lower speedups.

Hi @hemingkx,

Thanks for this benchmark, it's super easy to get started with.

I am trying out an experiment with Speculative Decoding (SpS) with `Llama-2-7b` as the target model and a compressed version of it as the draft model. I was able to run the codebase successfully and have 2 JSON files; one for the base model autoregressive and other for the SpS. 

Command I used to get the files (used all the default settings; temperature 0, float 16 etc;)

```
## Baseline
CUDA_VISIBLE_DEVICES=0 python -m evaluation.inference_baseline --model-path $Base_PATH --model-id ${MODEL_NAME}-vanilla-${torch_dtype}-temp-${TEMP} --bench-name $bench_NAME --temperature $TEMP --dtype $torch_dtype

## Speculative decoding, Assume I've a compressed version of target model as the drafter model
CUDA_VISIBLE_DEVICES=0 python -m evaluation.inference_sps --model-path $Base_PATH --drafter-path $Drafter_PATH --model-id ${Drafter_PATH}-${torch_dtype}-temp-${TEMP} --bench-name $bench_NAME --temperature $TEMP --dtype $torch_dtype 
```

Now, when I tried the speed comparison using the below command, I was able to get the metrics needed but have some confusion (and/or limitation)

`python evaluation/speed.py --file-path speculative_decoded_output.jsonl --base-path data/spec_bench/model_answer/llama-2-7b-vanilla-float16-temp-0.0.jsonl --tokenizer-path meta-llama/Llama-2-7b-chat-hf`

```
============================== Task:  mt_bench ==============================
#Mean accepted tokens:  3.4082441537851764
Tokens per second:  44.742442571612415
Tokens per second for the baseline:  72.6118430396252
Speedup ratio:  0.6161865709316302
============================== Task:  translation ==============================
#Mean accepted tokens:  3.0985100649865274
Tokens per second:  45.278612095378655
Tokens per second for the baseline:  73.46145049723718
Speedup ratio:  0.616358808448542
============================== Task:  summarization ==============================
#Mean accepted tokens:  3.0398410434073773
Tokens per second:  41.48063751699865
Tokens per second for the baseline:  72.62313028604987
Speedup ratio:  0.5711766671804649
============================== Task:  qa ==============================
#Mean accepted tokens:  3.0859972202918695
Tokens per second:  45.30407242996252
Tokens per second for the baseline:  73.28590707879088
Speedup ratio:  0.6181825979346256
============================== Task:  math_reasoning ==============================
#Mean accepted tokens:  3.467482859941234
Tokens per second:  45.61103650979495
Tokens per second for the baseline:  73.51347244171974
Speedup ratio:  0.6204445932812468
============================== Task:  rag ==============================
#Mean accepted tokens:  3.3533095723014257
Tokens per second:  41.75790560615537
Tokens per second for the baseline:  72.55333236627544
Speedup ratio:  0.5755477280539834
============================== Task:  overall ==============================
#Mean accepted tokens:  3.2440101726676485
Tokens per second:  44.029117788317095
Tokens per second for the baseline:  73.00818928494972
Speedup ratio:  0.6030709461437564 
```

**Question:** I can see that the `Mean Accepted tokens` is >1 in all the datasets but the tokens per second and the speedup ratios are way lower for speculative decoding compared with baseline. I tried it multiple times with different models/settings but this pattern is well observed!!

I don't have an idea of why that might be the reason. Any thoughts and help would be super appreciated! 

**P.S:** Less likely there might be an issue with speedup computations at your end, but I wanted to bring to your notice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible issue with speed computation - Higher #Mean Accepted tokens but lower speedups. #29

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible issue with speed computation - Higher #Mean Accepted tokens but lower speedups. #29

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions