Poor/inconsistent results from Phi-3-mini-128k

Phi-3-mini-128k runs, but the result that I'm seeing from TGI are not very good and inconsistent with other inference methods.

For each result below, I supply the same pairwise compare prompt:
```
<|user|>
Consider this information about Johnny: 

Johnny likes apples.  Johnny hates peanuts.

Now consider these options:

Option A: Pecan Pie
Option B: Peach Pie

Consider the information and brainstorm on items Johnny might like.  Be thorough and consider all possibilities.

Next, use the brainstorm to pick which option Johnny would prefer.

End your response with the option you choose (Either **Option A** or **Option B**).<|end|>
<|assistant|>
```

## Running on "Other methods"
I get great results from NIM https://build.nvidia.com/microsoft/phi-3-mini:
```
Given the information, we know Johnny likes apples and hates peanuts. Let's brainstorm on items Johnny might like based on this information:

Apple-based desserts: Since Johnny likes apples, he might enjoy desserts that contain apples. This could include apple pies, apple crumble, apple tarts, apple crisp, apple cake, etc.

Non-peanut based desserts: Since Johnny hates peanuts, he would likely not enjoy any desserts that contain peanuts. This eliminates any desserts that have peanuts as an ingredient, such as peanut butter pie, peanut brittle, peanut cookies, etc.

Now, let's consider the two options:

Option A: Pecan Pie - Pecan pie is a dessert that contains pecans, which are a type of nut. Although Johnny doesn't have a specific dislike for all nuts, he does hate peanuts. It's possible that he might not like pecans as well, but we don't have enough information to be certain.

Option B: Peach Pie - Peach pie is a dessert that contains peaches, which are a type of fruit. Since Johnny likes apples, it's reasonable to assume that he might also enjoy other fruits, such as peaches.

Based on the information provided and the brainstorming, Option B: Peach Pie seems to be the more likely choice for Johnny, as it aligns more closely with his preference for fruit (apples) and doesn't contain any peanuts.
```

I get great results from `Transformers` (running `python` on the TGI:2.0.4 container):
```
Given Johnny's preferences, we know he likes apples and hates peanuts. Let's brainstorm on items Johnny might like based on this information:

1. Apple-based desserts or dishes: Since Johnny likes apples, he might enjoy any dish or dessert that features apples as a primary ingredient. This could include apple pies, apple crisps, apple tarts, apple cakes, apple sauce, and more.

2. Non-peanut-based desserts or dishes: Johnny hates peanuts, so he would likely not enjoy any dish or dessert that contains peanuts or peanut-based ingredients. This eliminates any desserts or dishes that include peanuts, such as peanut butter cookies, peanut butter pies, or peanut brittle.

Now, let's consider the two options provided:

Option A: Pecan Pie
Pecan pie is a dessert that contains pecans, which are a type of nut. Although Johnny doesn't have a stated preference for or against nuts in general, we know he hates peanuts. Since pecans are different from peanuts, it's possible that Johnny might not have an issue with pecan pie. However, without more information about Johnny's feelings towards other nuts, we can't definitively say he would prefer this option.

Option B: Peach Pie
Peach pie is a dessert that features peaches as the primary fruit ingredient. Given that Johnny likes apples, it's reasonable to assume that he might also enjoy other fruits, such as peaches. Since there are no peanuts or peanut-based ingredients in peach pie, this option aligns well with Johnny's preferences.

Based on the information provided and the brainstorming process, **Option B: Peach Pie** seems to be the more likely choice for Johnny, as it aligns with his preference for fruits (apples and potentially peaches) and his dislike for peanuts.
```

## Running on TGI 2.0.4
I get okay-ish results when I run TGI:2.0.4 with the default parameters:
```
Given Johnny's preferences, we know he likes apples and dislikes peanuts. Neither pecan nor peach pies directly relate to apples, but we can infer preferences based on the given information.

- **Option A: Pecan Pie** - Pecans are not directly related to apples, but pecans are a type of nut, and Johnny hates peanuts, which are also nuts. This might make him less likely to prefer a pecan pie due to his aversion to nuts.

- **Option B: Peach Pie** - Peaches are fruits, and there's no direct relation to apples in the information provided. However, since there's no negative association mentioned, it's a safer guess that Johnny might not have an issue with peaches.

Given the information, **Option B: Peach Pie** seems to be the safer choice, as it doesn't directly conflict with any of Johnny's stated preferences.
```

I get poor (but fluent) results when I use increase the token sizes to 5k (input) 7k (total) (but using the same prompt):
```
Given the information provided, Johnny likes apples and dislikes peanuts. Neither pecan nor peach is directly related to apples or peanuts. However, if we consider the options as food items, a peach pie could be more closely related to apples as both are fruits. Therefore, Johnny might have a slight preference for the peach pie. But this is a very weak preference and not a strong preference. The information given does not strongly indicate a clear preference for either option. However, if we must choose one, Option B: Peach Pie could be a slightly more likely choice.
 ```


### Information

- [X] Docker
- [ ] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

Note: I'm running these commands on an AWS g6.2xlarge instance

### Reproducing the "Transformers" results

1. Run python on a TGI container
```
docker run -it --rm --name tgi --gpus all --shm-size 2g  \
    --entrypoint python \
    ghcr.io/huggingface/text-generation-inference:2.0.4 
```

2. Run this python code:
```
prompt = """<|user|>
Consider this information about Johnny: 

Johnny likes apples.  Johnny hates peanuts.

Now consider these options:

Option A: Pecan Pie
Option B: Peach Pie

Consider the information and brainstorm on items Johnny might like.  Be thorough and consider all possibilities.

Next, use the brainstorm to pick which option Johnny would prefer.

End your response with the option you choose (Either **Option A** or **Option B**).<|end|>
<|assistant|>"""

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "microsoft/Phi-3-mini-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    device_map="cuda",
    torch_dtype="auto", 
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "do_sample": False,
}

output = pipe(prompt, **generation_args)
print(output[0]['generated_text'])
```

### Reproducing the "TGI" results
1. Run TGI
Either:
```
docker run -d --rm --name tgi -p 8080:80 --gpus all --shm-size 2g \
    ghcr.io/huggingface/text-generation-inference:2.0.4 \
    --model-id microsoft/Phi-3-mini-128k-instruct 
```
or
```
docker run -d --rm --name tgi -p 8080:80 --gpus all --shm-size 2g \
    ghcr.io/huggingface/text-generation-inference:2.0.4 \
    --model-id microsoft/Phi-3-mini-128k-instruct \
    --max-batch-prefill-tokens=5000 --max-total-tokens=7000 --max-input-tokens=5000
```
2. Query the endpoint and print the results
```
curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs": "Consider this information about Johnny:\n\nJohnny likes apples. Johnny hates peanuts.\n\nNow consider these options:\n\nOption A: Pecan Pie\nOption B: Peach Pie\n\nConsider the information and brainstorm on items Johnny might like. Be thorough and consider all possibilities.\n\nNext, use the brainstorm to pick which option Johnny would prefer.\n\nEnd your response with the option you choose (Either **Option A** or **Option B**).",
        "parameters": {
            "max_new_tokens": 1000,
            "do_sample": false
        }
    }' \
    -H 'Content-Type: application/json' | jq -r '.generated_text'
```

### Expected behavior

I expect TGI to produce results consistent with other inference servers and Transformers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor/inconsistent results from Phi-3-mini-128k #2055

Running on "Other methods"

Running on TGI 2.0.4

Information

Tasks

Reproduction

Reproducing the "Transformers" results

Reproducing the "TGI" results

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor/inconsistent results from Phi-3-mini-128k #2055

Description

Running on "Other methods"

Running on TGI 2.0.4

Information

Tasks

Reproduction

Reproducing the "Transformers" results

Reproducing the "TGI" results

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions