-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
Add Grammars #2105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Grammars #2105
Conversation
… fixing WS and SIGNED_NUMBER in addition
Thanks for putting this together @lapp0 .
|
Appreciate your review, fix, and interest @xuy. Will integrate that after I'm done with some bug fixes! |
Does this only work with the OpenAI API at the moment? If so, could it be added to the vllm api as well? |
Works nicely so far. I noticed the preprocessing for batching being done on only one core and hence significantly stalling the process. Is that due to grammar implementation? And is there a way to fix that, to either use GPU or more than a single core? |
@lapp0 Could you post your multiprocessing branch, even if its incomplete? I've been trying to implement it myself, but it seems I can't get it quite right. |
@brucethemoose It's pretty poorly implemented, but here you go: https://github.com/lapp0/vllm/tree/grammar-multiprocessing I've been working on integrating some of my caching changes into https://github.com/outlines-dev/outlines which already has regex-based guidance for vLLM. |
Tested the grammar support from your branch. Additional changes I made:
Model: Without the grammar the model gives this response: "2, 3, 5, 7, 11, 13, 17, 19, 23, 29" So in the grammar I intentionally denied any use of white-space, so the expected output must be: "2,3,5,7,11,13,17,19,23,29" Grammar: ?start: SIGNED_NUMBER ( "," SIGNED_NUMBER )*
%import common.SIGNED_NUMBER While it conforms to the grammar it fails to produce the two digit prime numbers: "2,3,5,7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1," It may happen that the grammar code somehow denies it from writing Changed the grammar to be more strict and simpler: ?start: DIGIT+ ( "," DIGIT+ )*
%import common.DIGIT With this grammar the model produces the primes, but cannot stop. Therefore there is a problem in the code denying it to generate the EOS token. Generating the stop token should be allowed wherever it is consistent with the grammar. "2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61," Grammar support would be really awesome to have for my use case. I actually started implementing Lark support and already figured out the PR's algorithm in my tests outside vLLM when I found your PR. It is really great that you had this much progress already, so there is a chance to have grammar support soon. Even if we cannot use regex in our grammar, having any kind of grammar support would still be a huge win. Also, the grammar support would allow for reliable function calling, which is also in the works (#2360). They refactor the spaghetti code in the OpenAI compatible server in that first PR. In llama.cpp there is a grammar named GBNF, which is an EBNF variant. That already works and its integration with the sampling code can give us some ideas on how to optimize this in vLLM. |
When I change the grammar to allow whitespace, then it can generate the primes properly: "2, 3, 5, 7, 11, 13, 17, 19, 23, 29" Grammar: ?start: _WS? DIGIT+ ( _WS? "," _WS? DIGIT+ )* _WS?
%import common.DIGIT
%import common.WS -> _WS It worked without code changes and could stop. I do not know why it cannot stop if no white-space is allowed. The model writes spaces only after the commas and no newline is generated at the end of completion. Narrowed it down to this grammar. It works, but has to produce a newline at the end, so it can stop: ?start: DIGIT+ ( "," DIGIT+ )* _WS?
%import common.DIGIT
%import common.WS -> _WS Output as a Python string: "2,3,5,7,11,13,17,19,23,29\n" So the actual bug is that the grammar does not let the LLM to generate a stop token if the grammar does not allow white-space at the very end. At least that seems to be the case based on the few tests I've done. I'm not sure whether it is a limitation of this specific LLM due to how it was trained (must write a newline before EOS) or a bug in the sampler integration of the grammar (it does not allow EOS in that case). The CPU overhead of the grammar is indeed horrible. Speed is down from 32T/s to 8.4T/s with the above very simple grammar. |
The parser doesn't handle ambiguous terminals well. Could you try converting them to a rule? Something along the lines of
And yes, the speed is bad. Outlines addresses this by precompiling the regex FSM and using Numba. I'm leaning heavily towards thinking vLLM should be a strong, simple inference engine and outlines should be a wrapper on top for grammars. Outlines vLLM CFG implementation merged yesterday dottxt-ai/outlines#517 |
The grammar you suggested crashes vLLM with this exception:
The traceback is useless because of the use of Ray (2 GPUs). Performance: I was running vLLM with cProfile and executed the completion some 50 times in about 2 minutes. Found the grammar responsible for only ~550ms of CPU runtime, so I don't see from the profiling data where the experienced slowdown is. Grammar's CPU load is 60-70% of a core, so it does not seem to be CPU bound there. I guess the load does not show up on the CPU or the Python profiler, but introduced by the use of Tensor (GPU RAM access?) or similar. I don't know enough Torch and CUDA yet to tell exactly.
Where would you put the grammar support? If we keep it inside vLLM, then it can be used via the REST APIs. That's what I prefer, at least for my use case. It allows for hosting the LLM separately from the application and better scalability, all without having to write a custom server for each application or forcing the application to run the LLM directly in-process. |
The exception due to the grammar:
So it cannot relay Lark's |
Are you using multiple GPUs? I'm seeing a substantial slowdown when passing the tensors to the logits processor ray actor.
|
@lapp0 Tried the |
There are already libraries actively maintained for guided generation that can integrate with vLLM, like Outlines. I would be wary of introducing code that is tangentially related to this library and will require a substantial amount of maintenance when this can be solved by an import. Why not contribute this code to these libraries and import them here instead? |
@jqueguiner The custom logits processors need some more information to be passed to avoid having to patch vLLM the hard way. Primary example is a way to identify the sequence ( |
The |
This experimental change where the state is cached by the hash of the prior token ids is working for me so far: dottxt-ai/outlines@8b1ff9a#diff-f65ffb5f52b2e358c713ccb8f32a700769426c6c8b655f689e3cdccae07d22ac |
A hash on preceding tokens is even better than |
Hi everyone, thank you so much for the very active discussion here. As vLLM maintainer, I want to express my sincere thanks for your enthusiasm. vLLM as a project is focused on optimizing LLM inference and provide a fully compatible OpenAI API; constrained decoding is not our strong suit, and we don't have the expertise to maintain it. @lapp0 would you be able to consider closing this PR and merge into outlines instead? I think you mentioned it here. I would very much like to use outlines directly in vLLM after #2488 is merged. (or sooner, adding it to completion API is another option). @lapp0 and @viktor-ferenczi, please let us know what interface and scheduling change on the vLLM side is needed to better support this functionality. |
Sure @simon-mo will follow up with you for any changes to vLLM which are necessary. Thanks for your enthusiastic support! Closing in favor of outlines. A few changes necessary in outlines to consider guidance ready for vLLM: |
Fixes #1229
Implement incremental LALR / Regex parser to determine legal-next-token set.
Rendered documentation
Try it
I smoke tested with
dockerfile: https://hub.docker.com/r/lapp0/vllm_grammar_branch (commit 2b2b024)
TODO
InteractivePredictiveLALRParser
TokenTrie
NextTokenValidator
GrammarLogitProcessor
def __call__(self, token_ids, logits)
, which updates the parsers state with new tokens and filters based onNextTokenValidator
's valid token IDsIncrementalParserState
Ramblings from previous implementation:
Grammar Token Filter Algorithm
The GTF algorithm involves calculating the set of valid next-tokens given an incomplete sequence and a grammar.
Its core components are
Lazy Approach
The simplest algorithm is
This approach has two core inefficiencies:
base_sequence
once for every tokenThis PR's Approach
The current GTF algorithm improves on the lazy approach in two aspects:
token_vocabulary
is a trie, allowing us to check "foo", and if invalid, we know "foobar", "foobaz", and "foobarbaz" are also invalid.parser
is interactive, meaning it doesn't need to recalculate thebase_sequence
each time.Current approach algorithm is a depth first search of the token trie with a
base_sequence
-warmed parser.The main weakness of this implementation involves regular expressions. If a terminal rule is a regular expression, an incomplete match must be searched for redundantly.
Optimal (Future) Approach
The optimal GTF approach involves all terminal rules being a single character. All terminals, including regular expressions must be decomposed.
For example the regular expression
\d{5}(-\d{4})?
Must be decomposed into
Additionally we can use a helper function
legal_chars(character_expr)
which retrieves all characters legal within a character regexp, e.g.legal_chars("\d") = set(["0", "1", "2", ...]
)legal_chars("[ae") = se["a", "e"]
With this optimization the GTF algorithm would be as follows:
This function requires only applying a state transition once for every transition which is legal within the token set. As opposed to the current implementation which applies a state transition once for each unique token trie node.
Breaking down into single character terminals provides another advantage: we don't have to recompute a regular expression partial redundantly, if
foo
matches(foo|bar)(bazbif)
, we don't need to recalculate the entire regex forfoobaz
again. In fact, we don't compute regular expresisons at all, we simply generate the valid character set for a given atomic character expression and intersect it with the tries valid token prefix set.Example
I use a simple Thompson's-style regex to generate the eNFA dict via
automata_toolkit
.Sample code which assigns random values to logits and generates a grammer-constrained completion:
Output:
Please observe that
["la", "rg", "e"]
and["large"]
are both valid tokens within the grammar, and either may be generated.