Generation stopped too early without hitting stop condition #223

minmin-intel · 2024-09-18T18:49:35Z

System Info

tgi-gaudi 2.0.4
Used below docker compose yaml to launch tgi-gaudi
Serve llama3.1-70B-instruct model
--top_k 10
--max_new_tokens 8192
--temperature 0.01
--top_p 0.95
--repetition_penalty 1.03
--return_full_text false

services:
tgi-service:
image: ghcr.io/huggingface/tgi-gaudi:2.0.4
container_name: tgi-server
ports:
- "8085:80"
volumes:
- ${HF_CACHE_DIR}:/data
environment:
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
HF_HUB_DISABLE_PROGRESS_BARS: 1
HF_HUB_ENABLE_HF_TRANSFER: 0
HABANA_VISIBLE_DEVICES: 0,1,2,3
OMPI_MCA_btl_vader_single_copy_mechanism: none
PT_HPU_ENABLE_LAZY_COLLECTIVES: true
runtime: habana
cap_add:
- SYS_NICE
ipc: host
command: --model-id ${LLM_MODEL_ID} --max-input-length 8192 --max-total-tokens 16384 --sharded true --num-shard 4

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Use this script to reproduce: https://github.com/minmin-intel/GenAIComps/blob/ragagent-v1.1-dev/comps/agent/langchain/test_llama.sh

An example output from tgi-gaudi:
"To answer this question, let's review the execution history. We find two relevant pieces of information:

The Foo Fighters headlined the Reading and Leeds festivals in 2012.
The Foo Fighters have headlined the Reading and Leeds festivals at least once, in 2019.

However, we can see that the knowledge cutoff is 2019, and the question was asked in 2024. So, there might be information that is not available in the execution history.

From the"

The generation stopped at "From the" without hitting max_new_tokens = 4096.

Another example output from tgi-gaudi:
Unfortunately, the tools available do not directly provide information about the number of number one hits on the US Billboard Hot 100 chart for specific artists. To answer this question accurately, I would need to know this information. However, I can try to find a workaround.

My approach is to search the knowledge base for Michael Jackson's number one hits and Elvis Presley's number one hits and then compare the counts.

{"tool":"search_knowledge_base", "args":{"query": "Michael Jackson number one

The generation stopped in the middle of an incomplete json generation.

Expected behavior

generation should continue till max new tokens or hit an apparent stop token.

minmin-intel · 2024-09-18T21:16:51Z

setting TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true does not solve this issue

minmin-intel · 2024-09-20T16:23:44Z

Also tgi-gaudi 2.0.5 has the same issue.

yuanwu2017 · 2024-10-17T02:01:47Z

Try this PR 8ae5d4c.

minmin-intel · 2024-10-23T21:56:09Z

@yuanwu2017 I tried the habana-main branch that has the PR your mentioned above, but I got the same error. The generation stopped too early. Example below, the generation stopped in the middle of generating a SQL query.

Example generation that stopped too early is below, I used 4k max input, 8k max total:

To answer this question, we need to find the school with the lowest average score in reading in Southern California, and then find its telephone number.

However, Southern California is not explicitly defined in the given tables. We can consider counties like Los Angeles, Ventura, Orange, San Diego, and Imperial as part of Southern California for this problem.

Here is the SQL query to solve the problem:

SELECT T2 Phone, T2 Ext
FROM satscores AS T1
INNER JOIN

yuanwu2017 · 2024-10-24T06:30:26Z

You can see that the outputs have different length for different max_new_tokens. You can use the https://platform.openai.com/tokenizer to count the tokens. The tokenizer is different, but the result is similar.

model=meta-llama/Meta-Llama-3.1-70B-Instruct

docker run -p $port:80 \
   --runtime=habana \
   -v $volume:/data \
   -e HABANA_VISIBLE_DEVICES=3,4,5,6 \
   -e HUGGING_FACE_HUB_TOKEN=$hf_token \
   -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
   -e http_proxy=${http_proxy}     -e https_proxy=${https_proxy} -e no_proxy=${no_proxy} \
   -e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
   -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
   -e MAX_TOTAL_TOKENS=16384 \
   -e BATCH_BUCKET_SIZE=8 \
   -e PREFILL_BATCH_BUCKET_SIZE=2 \
   -e PAD_SEQUENCE_TO_MULTIPLE_OF=64 \
   -e ENABLE_HPU_GRAPH=true \
   -e LIMIT_HPU_GRAPH=true \
   -e USE_FLASH_ATTENTION=true \
   -e FLASH_ATTENTION_RECOMPUTE=true \
   --cap-add=sys_nice \
   --ipc=host \
   $image \
   --model-id $model \
   --sharded true --num-shard 4 \
   --max-input-length 8192 --max-total-tokens 16384 \
   --max-batch-prefill-tokens 16384 --max-batch-total-tokens 262144 \
   --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512

minmin-intel · 2024-10-28T16:57:19Z

@yuanwu2017 I tried the same docker commands as the one you provided above with tgi-gaudi 2.0.5 and tgi-gaudi built from source using the habana-main branch. I still got the stop too early problem. Below is the code snippet that I used. I used the ChatHuggingFace API in Langchain to send requests to tgi-gaudi. I used the llama3.1-70B-instruct model running on 4 Gaudi2 cards.
---------------------Code snippet -------------------

def setup_tgi(args):
    from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
 
    generation_params = {
        "max_new_tokens": args.max_new_tokens,
        "top_k": args.top_k,
        "top_p": args.top_p,
        "temperature": args.temperature,
        "repetition_penalty": args.repetition_penalty,
        "return_full_text": False,
        "streaming": False,
    }
 
    print(generation_params)
 
    llm = HuggingFaceEndpoint(
        endpoint_url=args.llm_endpoint_url,
        task="text-generation",
        **generation_params,
    )
 
    chat_model = ChatHuggingFace(llm=llm, model_id=args.model)
    return chat_model
 
if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, default="meta-llama/Llama-3.1-70B-Instruct")
    parser.add_argument("--llm_endpoint_url", type=str, default="http://localhost:8085")
    parser.add_argument("--max_new_tokens", type=int, default=8192)
    parser.add_argument("--top_k", type=int, default=10)
    parser.add_argument("--top_p", type=float, default=0.95)
    parser.add_argument("--temperature", type=float, default=0.01)
    parser.add_argument("--repetition_penalty", type=float, default=1.03)
 
    args = parser.parse_args()
 
    llm = setup_tgi(args)
    llm.invoke("Tell me about Socrates. Give me a long answer.")

----------------------------The versions of the library that I used:-----------------------------------------
langchain 0.3.4
langchain-community 0.3.3
langchain-core 0.3.12
langchain-huggingface 0.1.0
langchain-openai 0.2.3
langchain-text-splitters 0.3.0
langgraph 0.2.39
langgraph-checkpoint 2.0.1
langgraph-sdk 0.1.33
langsmith 0.1.136
huggingface-hub 0.26.1
langchain-huggingface 0.1.0

-------------------------Gaudi driver version ------------------------------
HL-SMI Version: hl-1.17.0-fw-51.3.0 |
| Driver Version: 1.17.0-28a11ca |

-----------------Observations:-------------------------
tgi-gaudi seemed to have bugs counting tokens with chat completion API.
I set the max new tokens to be 8k, but tgi-gaudi stopped generation at 100 tokens due to the reason "length". See logs below for running the code snippet above.

----------------- Log ---------------------------------
{'max_new_tokens': 8192, 'top_k': 10, 'top_p': 0.95, 'temperature': 0.01, 'repetition_penalty': 1.03, 'return_full_text': False, 'streaming': False}
Input to TGI:
[{'role': 'user', 'content': 'Tell me about Socrates. Give me a long answer.'}]

content='Socrates (469/470 BCE - 399 BCE) was a Greek philosopher from Athens, widely regarded as one of the founders of Western philosophy. He is best known for his contributions to the development of Western philosophy, particularly in the areas of ethics and epistemology. Through his method of questioning, known as the Socratic method, he encouraged critical thinking and explored the nature of knowledge, reality, and human existence.\n\nLife and Background\n\nSocrates was born in Athens, Greece, to' additional_kwargs={} response_metadata={'token_usage': ChatCompletionOutputUsage(
completion_tokens=100, prompt_tokens=5, total_tokens=105
), 'model': '',
'finish_reason': 'length'
} id='run-953a7ac4-ef26-44d3-9b27-92bd96946e11-0'

yuanwu2017 · 2024-10-31T12:02:39Z

langchain issue.
langchain-ai/langchain#27719

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation stopped too early without hitting stop condition #223

Generation stopped too early without hitting stop condition #223

minmin-intel commented Sep 18, 2024 •

edited

Loading

minmin-intel commented Sep 18, 2024

minmin-intel commented Sep 20, 2024

yuanwu2017 commented Oct 17, 2024 •

edited

Loading

minmin-intel commented Oct 23, 2024 •

edited

Loading

yuanwu2017 commented Oct 24, 2024 •

edited

Loading

minmin-intel commented Oct 28, 2024 •

edited

Loading

yuanwu2017 commented Oct 31, 2024

Generation stopped too early without hitting stop condition #223

Generation stopped too early without hitting stop condition #223

Comments

minmin-intel commented Sep 18, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

minmin-intel commented Sep 18, 2024

minmin-intel commented Sep 20, 2024

yuanwu2017 commented Oct 17, 2024 • edited Loading

minmin-intel commented Oct 23, 2024 • edited Loading

yuanwu2017 commented Oct 24, 2024 • edited Loading

minmin-intel commented Oct 28, 2024 • edited Loading

-------------------------Gaudi driver version ------------------------------ HL-SMI Version: hl-1.17.0-fw-51.3.0 | | Driver Version: 1.17.0-28a11ca |

----------------- Log --------------------------------- {'max_new_tokens': 8192, 'top_k': 10, 'top_p': 0.95, 'temperature': 0.01, 'repetition_penalty': 1.03, 'return_full_text': False, 'streaming': False} Input to TGI: [{'role': 'user', 'content': 'Tell me about Socrates. Give me a long answer.'}]

yuanwu2017 commented Oct 31, 2024

minmin-intel commented Sep 18, 2024 •

edited

Loading

yuanwu2017 commented Oct 17, 2024 •

edited

Loading

minmin-intel commented Oct 23, 2024 •

edited

Loading

yuanwu2017 commented Oct 24, 2024 •

edited

Loading

minmin-intel commented Oct 28, 2024 •

edited

Loading

-------------------------Gaudi driver version ------------------------------
HL-SMI Version: hl-1.17.0-fw-51.3.0 |
| Driver Version: 1.17.0-28a11ca |

----------------- Log ---------------------------------
{'max_new_tokens': 8192, 'top_k': 10, 'top_p': 0.95, 'temperature': 0.01, 'repetition_penalty': 1.03, 'return_full_text': False, 'streaming': False}
Input to TGI:
[{'role': 'user', 'content': 'Tell me about Socrates. Give me a long answer.'}]