-
Notifications
You must be signed in to change notification settings - Fork 594
VLLM tensor-parallel and RegexLogitsProcessor #524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have the same issue. Check the changes in this vLLM PR (not merged yet): It integrates grammar into vLLM itself and contains code which works with Ray, which may give some ideas how to solve this. This is the specific code: if request.grammar:
if engine.worker_use_ray:
grammar_logits_processor = RayRemoteGrammarLogitsProcessor(
tokenizer=tokenizer, grammar=request.grammar)
else:
grammar_logits_processor = GrammarLogitsProcessor(
tokenizer=tokenizer, grammar=request.grammar)
logits_processors = [grammar_logits_processor]
else:
logits_processors = [] I guess we need a class similar to |
I'm considering closing the vllm PR and moving the work over here since outlines has a more mature, fleshed out implementation. I'm seeing a substantial performance degradation with I will experiment with the actor only taking the tokens as inputs, and returning a boolean mask. This would prevent the tensor going back and forth through ray. https://numpy.org/doc/stable/reference/generated/numpy.ma.MaskedArray.tobytes.html |
I agree with closing the vLLM ticket. I think vLLM needs another, much smaller PR. That should include only small changes to pass enough information to outlines, so no tricky patches are needed. Namely vLLM needs to pass the Regarding Ray it seems to be a good idea, indeed. Try to pass |
We can think of other ways to track the FSM state, like using |
Logit processors aren't associated with one specific sequence so this is challenging.
However patching also isn't necessary if you have correct cached-state lookups. It's necessary to send a representation of the token sequence. Either the minimal The actor can send back bytes form of the mask.
This is necessary to get concurrent generation within a sequence group in vLLM working. Otherwise the generations tokens won't necessarily be added to the correct sequence. I got beam search working with outlines by using the token_ids as the FSM state key. |
It explains why my attempt to use beam search with outlines regex failed yesterday. I like the approach of taking a hash of previous tokens. |
@viktor-ferenczi could you try this PR and tell me if it also works for you? |
Sure, I will try this tomorrow. |
Tested both in Failed with errors:
Full tracebacks for The cause looks like that Appending this line to self.fsm_state: DefaultDict[int, int] = defaultdict(int) But the constraint does not work as expected, it generates only 1-2 characters then stops. GPUs: 2x4090 (2x24GB) vLLM command: python -O -u -m outlines.serve.serve \
--model=TheBloke/deepseek-coder-33B-instruct-AWQ \
--quantization=awq \
--dtype=float16 \
--host=0.0.0.0 \
--port=8000 \
--max-model-len=16384 \
--max-num-seqs=16 \
--tensor-parallel-size=2 \
--swap-space=8 \
--gpu-memory-utilization=0.95 \
--enforce-eager \
--disable-log-requests vLLM request: {
"prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list.\n\n### Response:\n",
"n": 1,
"best_of": 1,
"presence_penalty": 0.0,
"frequency_penalty": 0.0,
"repetition_penalty": 1.0,
"temperature": 0.0,
"top_p": 1.0,
"top_k": -1,
"min_p": 0.0,
"use_beam_search": false,
"length_penalty": 1.0,
"early_stopping": false,
"stop": [],
"stop_token_ids": [],
"include_stop_str_in_output": false,
"ignore_eos": false,
"max_tokens": 50,
"logprobs": null,
"prompt_logprobs": null,
"skip_special_tokens": true,
"spaces_between_special_tokens": true,
"regex": "\\d+(\\s*,\\s*\\d+)*\\s*"
} Response: {
"text": [
"You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list.\n\n### Response:\n1,"
]
} |
Thanks for running, I'll write some more test cases, I think it got messed up on a rebase, my mistake. I believe the early termination is a result of the issue being fixed in #544 |
Thank you, please let me know when I can test it again. I will try to cherry-pick the early termination fix as well. |
@lapp0 Please find the fixed I keep the working code in my UPDATE: Tested OK (with my fix) for multiple sequence generation and beam search as well. All seems to work well. |
Do we expect this issue to be solved by #539? In this case we should link it to the PR. |
Nope, that PR just prevents corruption when generating multiple sequences concurrently in vLLM. Tensor parallel doesn't work with that PR. |
I actually use the changes from #539 with minor modifications (see there) with |
@viktor-ferenczi Could you test how many tokens are parsed on each step with your recursive solution? My concern is that the entire generated sequence is being re-parsed at each step if you don't have a shared cache between actors. Regarding using a ray actor, I did some performance testing:
https://gist.github.com/lapp0/409a7c3a7f9880b606626bb283f0b01c |
Sure, I can test it after work if I have any energy left. |
@viktor-ferenczi can confirm, the recursive solution re-parses the entire sequence from start every time https://gist.github.com/lapp0/8370bde4d977088487c34bc7501b78af We should ensure the cache is saved in a stable object to address tensor parallel. |
@simon-mo do you have any recommendations for the problem we're facing here? We maintain the parser state within the logits processor. Without making any adjustments, the logits processor object will replicate and have inconsistent state between ray processes. In the vllm grammars PR I resolved this by creating a ray actor, but the implementation had some issues with performance. Do you think putting wrapping the logits processor with a ray actor is the appropriate solution, or would you suggest an alternative? |
@lapp0 Please check out the code in my https://github.com/viktor-ferenczi/outlines/commits/dev/ It contains the changes from your branch, then a solution for caching the Tested OK for single and parallel regex and JSON schema constraints. There seem to be negligible performance impact, so the caching seems to be efficient and persistent across sequences. Also double-checked that the logits processor is always called from the same process and the same thread, even if tensor parallel is greater than one, therefore there is not need to care about the thread safety of the cache. Please let me know whether this solution works for you. I start to use it myself to test it more. |
Running benchmarks on The cost is one Thanks for working this out! I'll work on integrating your changes into the PR. My branch (beam search working, tensor parallel not working)
Viktor's branch (beam search AND tensor parallel working)
|
@lapp0 in vLLM I recall the sampling procedure is only done in a single process (driver process). I'm quite confused why would it happen. |
Has or is this close to being resolved? I am getting TypeError: RegexLogitsProcessor.call() missing 1 required positional argument: 'scores' error still |
@wdhitchc we just got some important prerequisites merged, I hope to have it ready for review this weekend. |
@wdhitchc @BenoitHardier my smoke tests suggest |
For `outlines/vllm` previously FSM-sequence correspondence was broken, resulting FSM state being mixed between sequences, corrupting output. To alleviate this, we have `_patched_apply_logits_processor` which passes a stable sequence ID to the logits processor. In this PR we eliminate `_patched_apply_logits_processor` and cache FSM state based on the states input IDs. Continuation of #539 but much simpler because vllm upgrade fixed a lot of the issues being addressed there. Related discussions: - #624 Fixes: - Fixes #605 - Fixes #610 Already fixed: - #524 (this one can be closed, as it's was addressed previously by upgrading vllm) @viktor-ferenczi can you please confirm whether this branch fixes either #610 or #605 # Smoke tests ### basic parallel passed <details> ``` import json import vllm from pydantic import BaseModel from typing import List import torch import pandas as pd from outlines.serve.vllm import JSONLogitsProcessor class ConceptsList(BaseModel): concepts: List[str] BASE_MODEL = "microsoft/phi-2" llm = vllm.LLM(model=BASE_MODEL, tensor_parallel_size=1, dtype=torch.float16, max_model_len=2048) logits_processor = JSONLogitsProcessor(ConceptsList, llm.llm_engine) full_prompts = [ f"Provide me a list of {i} strings with key 'concepts'" for i in range(20) ] batch_results = llm.generate( full_prompts, sampling_params=vllm.SamplingParams( max_tokens=2048, logits_processors=[logits_processor] ), ) for result in batch_results: for output in result.outputs: json.loads(output.text) ``` </details> ### never ending regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "Sequence of numbers and letters:", "regex": "([123]-[abc]-([def]-)?)*", "n": 7 }' {"text":["Sequence of numbers and letters:1-a-1-b-1-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-1-a-","Sequence of numbers and letters:1-a-2-b-3-c-d-1-","Sequence of numbers and letters:2-a-1-b-2-c-1-b-","Sequence of numbers and letters:2-b-3-c-d-2-b-3-","Sequence of numbers and letters:2-a-3-b-2-b-1-c-","Sequence of numbers and letters:2-a-3-b-d-2-a-3-"]} # rules for the above to validate correct FSM-sequence correspondence: # [123] always followed by [abc], [def] only ever preceded by [abc] # 1-a-1-b-1-c-1-a- # 1-a-2-b-3-c-1-a- # 1-a-2-b-3-c-d-1- # 2-a-1-b-2-c-1-b- # 2-b-3-c-d-2-b-3- # 2-a-3-b-2-b-1-c- # 2-a-3-b-d-2-a-3- ``` </details> ### sometimes ending early regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "Sequence of numbers and letters:", "regex": "([123]-[abc]-([def]-)?){3}", "n": 16 }' ``` output ``` {"text":["Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:1-a-2-b-3-c-d-","Sequence of numbers and letters:3-a-1-b-2-c-d-","Sequence of numbers and letters:2-a-1-b-3-c-d-","Sequence of numbers and letters:1-a-1-b-1-c-d-","Sequence of numbers and letters:2-a-3-b-d-1-c-e-","Sequence of numbers and letters:1-b-3-a-2-c-d-","Sequence of numbers and letters:3-a-d-1-b-e-2-c-","Sequence of numbers and letters:1-a-3-b-1-b-d-","Sequence of numbers and letters:3-a-f-2-b-d-1-c-","Sequence of numbers and letters:1-b-d-3-a-e-2-c-","Sequence of numbers and letters:3-c-1-b-d-1-a-e-","Sequence of numbers and letters:1-c-1-c-e-1-b-e-"]} ``` analysis: ``` 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 1-a-2-b-3-c-d- 3-a-1-b-2-c-d- 2-a-1-b-3-c-d- 1-a-1-b-1-c-d- 2-a-3-b-d-1-c-e- 1-b-3-a-2-c-d- 3-a-d-1-b-e-2-c- 1-a-3-b-1-b-d- 3-a-f-2-b-d-1-c- 1-b-d-3-a-e-2-c- 3-c-1-b-d-1-a-e- 1-c-1-c-e-1-b-e- ``` Observations: - All patterns are correct - Patterns don't "borrow" FSM state from one-another, they retain their own independent state - Some patterns produced more tokens than others successfully </details> ### Viktor's regex passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n", "n": 1, "best_of": 1, "presence_penalty": 0.0, "frequency_penalty": 0.0, "repetition_penalty": 1.0, "temperature": 0.0, "top_p": 1.0, "top_k": -1, "min_p": 0.0, "use_beam_search": false, "length_penalty": 1.0, "early_stopping": false, "stop": [], "stop_token_ids": [], "include_stop_str_in_output": false, "ignore_eos": false, "max_tokens": 50, "logprobs": null, "prompt_logprobs": null, "skip_special_tokens": true, "spaces_between_special_tokens": true, "regex": "\\d+(\\s*,\\s*\\d+)*\\s*" }' ``` output: ``` {"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite down the first 10 prime numbers as a comma separated list, starting with 2.\n\n### Response:\n2, 3, 5, 7, 11, 13, 17, 19, 23, 29\n"]} ``` </details> ### Viktors schema passed <details> `python3 -m outlines.serve.serve --model="microsoft/phi-2"` ``` curl http://127.0.0.1:8000/generate \ -d '{ "prompt": "You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n", "n": 5, "best_of": 5, "presence_penalty": 0.0, "frequency_penalty": 0.0, "repetition_penalty": 1.0, "temperature": 1.0, "top_p": 1.0, "top_k": -1, "min_p": 0.0, "use_beam_search": false, "length_penalty": 1.0, "early_stopping": false, "stop": [], "stop_token_ids": [], "include_stop_str_in_output": false, "ignore_eos": false, "max_tokens": 200, "logprobs": null, "prompt_logprobs": null, "skip_special_tokens": true, "spaces_between_special_tokens": true, "schema": { "properties": { "kind": { "title": "Kind", "type": "string" }, "color": { "title": "Color", "type": "string" }, "count": { "title": "Count", "type": "integer" }, "weight": { "title": "Weight", "type": "number" }, "sweet": { "title": "Sweet", "type": "boolean" } }, "required": [ "kind", "color", "count", "weight", "sweet" ], "title": "Fruit", "type": "object" } }' ``` output: ``` {"text":["You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.2,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"Apple\",\n \"color\": \"Red\",\n \"count\": 10,\n \"weight\": 0.2,\n \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"apple\",\n \"color\": \"red\",\n \"count\": 5,\n \"weight\": 0.1,\n \"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n\"kind\": \"Apple\",\n\"color\": \"Red\",\n\"count\": 10,\n\"weight\": 0.24,\n\"sweet\": true\n}","You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer.\n\nYou are a helpful AI assistant. You give concise answers. If you do not know something, then say so.\n### Instruction:\nWrite a JSON describing a random fruit. It must conform to the following JSON schema: {\"properties\": {\"kind\": {\"title\": \"Kind\", \"type\": \"string\"}, \"color\": {\"title\": \"Color\", \"type\": \"string\"}, \"count\": {\"title\": \"Count\", \"type\": \"integer\"}, \"weight\": {\"title\": \"Weight\", \"type\": \"number\"}, \"sweet\": {\"title\": \"Sweet\", \"type\": \"boolean\"}}, \"required\": [\"kind\", \"color\", \"count\", \"weight\", \"sweet\"], \"title\": \"Fruit\", \"type\": \"object\"}\n\n### Response:\n{\n \"kind\": \"Apple\",\n \"color\": \"red\",\n \"count\": 5,\n \"weight\": 0.3,\n \"sweet\": true\n}"]} ``` </details> --------- Co-authored-by: Andrew Lapp <[email protected]>
Describe the issue as clearly as possible:
Hi,
I recently tried to use the RegexLogitsProcessor with VLLM introduced by #481.
When using it with a "small" model like 7B one on a unique GPU it works fine but when I try with a big one, namely Mixtral, on multiple GPUs with the vllm engine argument tensor-parallel, I ran into several problems (monkey patching not working and fsm_state in the Processor not initialize). I suspect the multiple workers of Ray to be the cause (monkey patching may not be propagated to all the workers same for the fsm_states).
I could have missed some relevant information but It seems that #481 only checks without tensor-parallel.
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Context for the issue:
No response
The text was updated successfully, but these errors were encountered: