Skip to content

vLLM request crashes if cfg is specified #534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
viktor-ferenczi opened this issue Jan 14, 2024 · 2 comments
Closed

vLLM request crashes if cfg is specified #534

viktor-ferenczi opened this issue Jan 14, 2024 · 2 comments
Labels

Comments

@viktor-ferenczi
Copy link

viktor-ferenczi commented Jan 14, 2024

Describe the issue as clearly as possible:

vLLM request crashes with 500 Internal Server Error if the cfg request parameter is specified.

Steps/code to reproduce the bug:

Follow the vLLM tutorial:
https://outlines-dev.github.io/outlines/reference/vllm/

Since the outline[serve] installs old versions, install vLLM 0.2.6 and latest outline, plus pydantic 2.0 using pip. (See the actual list of package versions below.)

Host this model (24GB VRAM): deepseek-ai/deepseek-coder-6.7b-instruct

Command:

python -O -u -m outlines.serve.serve \
  --model=deepseek-ai/deepseek-coder-6.7b-instruct \
  --host=127.0.0.1 \
  --port=8000 \
  --max-model-len=16384 \
  --max-num-seqs=16 \
  --swap-space=8 \
  --gpu-memory-utilization=0.95

Request URL: http://127.0.0.1:8000/generate

Send a request with this body:

{
  "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list.\n\n### Response:\n",
  "n": 1,
  "best_of": 1,
  "presence_penalty": 0.0,
  "frequency_penalty": 0.0,
  "repetition_penalty": 1.0,
  "temperature": 0.0,
  "top_p": 1.0,
  "top_k": -1,
  "min_p": 0.0,
  "use_beam_search": false,
  "length_penalty": 1.0,
  "early_stopping": false,
  "stop": [],
  "stop_token_ids": [],
  "include_stop_str_in_output": false,
  "ignore_eos": false,
  "max_tokens": 50,
  "logprobs": null,
  "prompt_logprobs": null,
  "skip_special_tokens": true,
  "spaces_between_special_tokens": true,
  "cfg": "\\\n?start: DIGIT+ ( \",\" DIGIT+ )* _WS?\n%import common.DIGIT\n%import common.WS -> _WS\n"
}

The above request will crash with 500 Internal Server Error, see the server side exception below.

The same grammar has been tested OK with Lark without vLLM. The grammar includes optional white-space at the end, because this model seems to require it to stop.



### Expected result:

```shell
2,3,5,7,11,13,17,19,23,29

Also, please rename the cfg request parameter to grammar or lark_grammar, thanks!

Error message:

INFO:     192.168.1.70:56873 - "POST /generate HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 762, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 782, in app
    await route.handle(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/viktor/env/outlines/lib/python3.10/site-packages/outlines/serve/serve.py", line 75, in generate
    sampling_params = SamplingParams(
TypeError: SamplingParams.__init__() got an unexpected keyword argument 'cfg'

Outlines/Python version information:

Package                   Version
------------------------- ------------
accelerate                0.26.1
aiofiles                  23.2.1
aiohttp                   3.9.1
aioprometheus             23.12.0
aiosignal                 1.3.1
altair                    5.2.0
annotated-types           0.6.0
anyio                     4.2.0
appdirs                   1.4.4
asttokens                 2.4.1
async-timeout             4.0.3
attrs                     23.2.0
beartype                  0.16.4
certifi                   2023.11.17
charset-normalizer        3.3.2
click                     8.1.7
cloudpickle               3.0.0
contourpy                 1.2.0
cycler                    0.12.1
docker-pycreds            0.4.0
exceptiongroup            1.2.0
fastapi                   0.109.0
ffmpy                     0.3.1
filelock                  3.13.1
fonttools                 4.47.2
frozenlist                1.4.1
fschat                    0.2.3
fsspec                    2023.12.2
gitdb                     4.0.11
GitPython                 3.1.41
gradio                    3.23.0
h11                       0.14.0
httpcore                  1.0.2
httptools                 0.6.1
httpx                     0.26.0
huggingface-hub           0.20.2
icontract                 2.6.6
idna                      3.6
interegular               0.3.3
Jinja2                    3.1.3
joblib                    1.3.2
jsonschema                4.20.0
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
lark                      1.1.9
linkify-it-py             2.0.2
llvmlite                  0.41.1
markdown-it-py            2.2.0
markdown2                 2.4.12
MarkupSafe                2.1.3
matplotlib                3.8.2
mdit-py-plugins           0.3.3
mdurl                     0.1.2
mpmath                    1.3.0
msgpack                   1.0.7
multidict                 6.0.4
nest-asyncio              1.5.8
networkx                  3.2.1
ninja                     1.11.1.1
numba                     0.58.1
numpy                     1.26.3
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.3.101
nvidia-nvtx-cu12          12.1.105
orjson                    3.9.10
outlines                  0.0.23
packaging                 23.2
pandas                    2.1.4
perscache                 0.6.1
pillow                    10.2.0
pip                       22.0.2
prompt-toolkit            3.0.43
protobuf                  4.25.2
psutil                    5.9.7
pyarrow                   14.0.2
pydantic                  2.5.3
pydantic_core             2.14.6
pydub                     0.25.1
Pygments                  2.17.2
pyparsing                 3.1.1
python-dateutil           2.8.2
python-dotenv             1.0.0
python-multipart          0.0.6
pytz                      2023.3.post1
PyYAML                    6.0.1
quantile-python           1.1
ray                       2.9.0
referencing               0.32.1
regex                     2023.12.25
requests                  2.31.0
rich                      13.7.0
rpds-py                   0.17.1
safetensors               0.4.1
scipy                     1.11.4
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                1.39.2
setproctitle              1.3.3
setuptools                59.6.0
shortuuid                 1.0.11
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.0
starlette                 0.35.1
svgwrite                  1.4.3
sympy                     1.12
tokenizers                0.15.0
toolz                     0.12.0
torch                     2.1.2
tqdm                      4.66.1
transformers              4.36.2
triton                    2.1.0
typing_extensions         4.9.0
tzdata                    2023.4
uc-micro-py               1.0.2
urllib3                   2.1.0
uvicorn                   0.25.0
uvloop                    0.19.0
vllm                      0.2.6
wandb                     0.16.2
watchfiles                0.21.0
wavedrom                  2.0.3.post3
wcwidth                   0.2.13
websockets                12.0
xformers                  0.0.23.post1
yarl                      1.9.4

Context for the issue:

Due to this bug it is not possible to use a grammar with vLLM at all.

I can only use vLLM, because it has the best throughput and correctness. The GBNF grammar seems to have problems in llama.cpp and the througput is ~10x worse.

Grammar support would help avoid repeated queries due to non-conformance of LLM output, which is an essential efficiency improvement in most real-world tasks.

@viktor-ferenczi viktor-ferenczi changed the title The cfg request parameter crashes the vLLM request vLLM request crashes if cfg is specified Jan 14, 2024
@viktor-ferenczi
Copy link
Author

Root cause is that cfg is not implemented in version 0.0.23, only later on main in fde61a80. Instructions are misleading on the Web page.

@rlouf
Copy link
Member

rlouf commented Jan 16, 2024

I'm closing as I reverted this PR. #541 is another attempt at adding grammar-guided generation to Outlines.

What do you use grammar-guided generation for?

@rlouf rlouf closed this as completed Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
@viktor-ferenczi @rlouf and others