OpenAI API refactoring + Functions calling #2210

FlorianJoncour · 2023-12-20T01:40:17Z

Hello,

I had to implement function call handling, so I directly implemented it in vLLM.
To do this, there are two major changes.

Firstly, the OpenAI API has been refactored to separate the server and generation. This makes the whole thing a bit clearer and also allows for creating an OpenAI server outside of Uvicorn.
Next, I implemented something similar to the "tools" in the OpenAI API. (Tools are more general replacements for direct function calls, which are now deprecated).
For now, only function calls are supported.

The system (like OpenAI apparently) works by injecting a prompt and capturing the result.
https://platform.openai.com/docs/guides/function-calling

I developed with the NeuralHermes-2.5-Mistral-7B model, and it works very well.

Some notes:

It is disabled by default to avoid unexpected issues. An enable-api-tools parameter has been added.
Besides refactoring and adding code, there are no regressions, the rest of the code hasn't changed much.
This only applies to chat/completions.
The result depends on the model's ability to process the prompt.
Can handle multiple function calls.
If a function call is requested by the model but there's an error, regular generation takes over.
Tested in a multilingual situation (French in my case), the English injected prompt does not affect user instructions (although this may depend on the quality of the model).
Adds about 250 tokens to the prompt + the tokens for user-declared function.

Improvement points:
Most current models are not fine-tuned for making function calls, and I haven't found a reference for a specific token indicating a function call by the model.
So, I defined something that should works good.
But since prompts now use templates, maybe a future update could use them to define a function call token and thus force future models to also use one, which should make capturing function calls more reliable.

And maybe the injected prompt could also be put into a template.

esmeetu · 2023-12-20T10:35:11Z

Hey, @FlorianJoncour. Thanks for your work! I have tested this, but it doesn't works by using openai's official example. Because current api server doesn't support messages with object which is the second request in that example.

esmeetu · 2023-12-20T10:39:34Z

For the min_p typo error, i think you could commit in a separate PR.

FlorianJoncour · 2023-12-20T13:41:35Z

I'm not sure which example you're exactly referring to, so I'm not sure if I've solved your issue, but there was indeed an error when the generation was in stream mode, it should return an array.

btw it should work now (using official OpenAI library):

Prompt:

{'role': 'assistant', 'content': 'Tu es un assistant intelligent qui répond aux questions.'}
{'role': 'user', 'content': 'Quel temps fait il à Bordeaux ?'}
{'type': 'function', 'function': {'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}}

Batch mode:

Request (stream=False) : Quel temps fait il à Bordeaux ?
Tool call : ChatCompletionMessageToolCall(id='call_get_current_weather', function=Function(arguments='{"location": "Bordeaux, FR"}', name='get_current_weather'), type='function')

Stream mode:

Request (stream=True) : Quel temps fait il à Bordeaux ?
Tool call : ChoiceDeltaToolCall(index=0, id='call_get_current_weather', function=ChoiceDeltaToolCallFunction(arguments='{"location": "Bordeaux, FR"}', name='get_current_weather'), type='function')

May I add an example since there are already examples to test the API ?

esmeetu · 2023-12-20T14:01:21Z

@FlorianJoncour This code copied from https://platform.openai.com/docs/guides/function-calling:

from openai import OpenAI
import json

client = OpenAI()

# Example dummy function hard coded to return the same weather
# In production, this could be your backend API or an external API
def get_current_weather(location, unit="fahrenheit"):
    """Get the current weather in a given location"""
    if "tokyo" in location.lower():
        return json.dumps({"location": "Tokyo", "temperature": "10", "unit": unit})
    elif "san francisco" in location.lower():
        return json.dumps({"location": "San Francisco", "temperature": "72", "unit": unit})
    elif "paris" in location.lower():
        return json.dumps({"location": "Paris", "temperature": "22", "unit": unit})
    else:
        return json.dumps({"location": location, "temperature": "unknown"})

def run_conversation():
    # Step 1: send the conversation and available functions to the model
    messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                    },
                    "required": ["location"],
                },
            },
        }
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
        tools=tools,
        tool_choice="auto",  # auto is default, but we'll be explicit
    )
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    # Step 2: check if the model wanted to call a function
    if tool_calls:
        # Step 3: call the function
        # Note: the JSON response may not always be valid; be sure to handle errors
        available_functions = {
            "get_current_weather": get_current_weather,
        }  # only one function in this example, but you can have multiple
        messages.append(response_message)  # extend conversation with assistant's reply
        # Step 4: send the info for each function call and function response to the model
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name]
            function_args = json.loads(tool_call.function.arguments)
            function_response = function_to_call(
                location=function_args.get("location"),
                unit=function_args.get("unit"),
            )
            messages.append(
                {
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": function_response,
                }
            )  # extend conversation with function response
        second_response = client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            messages=messages,
        )  # get a new response from the model where it can see the function response
        return second_response
print(run_conversation())

Can you execute this example smoothly? The initial request is fine, but the second one isn't supported.

FlorianJoncour · 2023-12-20T19:17:30Z

So you was right in your first message ! Thank you.

I added all types for ChatCompletionRequest.messages as described on the OpenAI doc and requests dont crash anymore.
But we have troubles with the tokenizer, it raise an exception because message.content may be empty when message.tool_calls get calls data.
So I added a trick to format and copy that data into message.content.

It should work now.

esmeetu · 2023-12-21T00:40:27Z

Great! It seems this PR fix #1869.
Another thought here, it would be better to separate refactor openai server and function call feature with two PRs since it's too difficult to review.
cc @simon-mo

esmeetu · 2023-12-21T07:58:28Z

Hey, @FlorianJoncour. I tested the latest commit again. The second_response seems wrong, and i found that the server didn't handle results of tools in messges because the chat template doesn't recognize tool role. Could you show me your chat template?

FlorianJoncour · 2023-12-21T12:48:55Z

I use the default template.
However, there was an issue with Pydantic, where the wrong type was selected during queries.
Now, roles are literals, and elements of messages are a union (and not a union of lists, which was of course an error).
So, now the messages have the correct types.

When I display the prompt server side right after tokenizer.apply_chat_template, I get this:

<|im_start|>user
What's the weather like in San Francisco, Tokyo, and Paris?<|im_end|>
<|im_start|>assistant
call_get_current_weather was called with arguments : {"location": "San Francisco"}
call_get_current_weather was called with arguments : {"location": "Tokyo"}
call_get_current_weather was called with arguments : {"location": "Paris"}
<|im_end|>
<|im_start|>tool
call_get_current_weather -> {"location": "San Francisco", "temperature": "72", "unit": null}<|im_end|>
<|im_start|>tool
call_get_current_weather -> {"location": "Tokyo", "temperature": "10", "unit": null}<|im_end|>
<|im_start|>tool
call_get_current_weather -> {"location": "Paris", "temperature": "22", "unit": null}<|im_end|>
<|im_start|>assistant

And so the final response is:

ChatCompletion(id='cmpl-2d769e9116354c8595feb2bc4672a657', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The current weather in San Francisco is 72 degrees Fahrenheit, Tokyo is 10 degrees Fahrenheit, and Paris is 22 degrees Fahrenheit.', role='assistant', function_call=None, tool_calls=None))], created=12717, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=37, prompt_tokens=199, total_tokens=236))

Don't mind about model="gpt-3.5-turbo", I set this name for simplicity with various tools.
If this not what you expect, can you tell my more informations ?

esmeetu · 2023-12-21T16:18:24Z

@FlorianJoncour I made some modifications on your code, and work now. I might post my changes in the future. For llm models, i think codellama-sft models are better(like phind-codellama).
I love this feature, thanks and hope being merged ASAP.🤩

leoterry-ulrica · 2023-12-25T06:34:35Z

Whether the merge has been completed? @FlorianJoncour @esmeetu

Tostino · 2023-12-25T16:27:34Z

vllm/entrypoints/openai/serving.py

+                            schema = json.dumps(tool.function.parameters,
+                                                indent=4)
+                            text_inject += f"```\njsonschema\n{schema}\n```"
+                text_inject += (


Should this be hard coded? It seems like depending on how the model was trained for function calling, the specific injected text could have a large impact on performance of the model. Seems like a good idea to allow this to be configurable IMO.

Following what was suggested by @Tostino , I think that it could be very handly to incorporate a tool_template.json path to personalize the tool response (probably preserving the current as default)

Particularly, the way the json_schema_params in constructed IMO it's ok. I wouldn't change it.
Nevertheless, incorporating a tool_template.json like:

{ "prefix": "str", "suffix": "str", "order": "top" "func_call_token": "str" }

could provide enough freedom in regards of models' adaptation when inject_prompt is performed. ( For ease of understanding please refer to image below )

What's do you think about @simon-mo ?

Image:

It's a good idea and easy to implement.

If we need to implement templates, it means that ideally we should remove all text instructions in serving.py.
So, we need to define a path to store a default tool_template.json.

The examples/ folder is not suitable.

There are four options:

Hardcode it as a string in serving.py (which I don't like)

The vllm/entrypoints/openai/ folder (I'm undecided)

Define a vllm/data/ folder or something similar (I'm undecided)

Add a command-line argument to allow users to define a configuration directory (or $HOME/.config/vllm/openai by default) and store data there.

I prefer the last option, knowing that a general-purpose directory might be used for another type of templates.
Besides function calls, LLMs (starting with ChatGPT) can be extended by embeddings (which could potentially be implemented in the future in vLLM), dynamic Python interpreter, etc...

Again, I am not sure if this is the primary goal of vLLM, but having a compatible OpenAI server with most features ready to use seems as important as inference performance!

The OpenAI's documentation gives an idea of what could potentially be implemented.
https://platform.openai.com/docs/assistants/tools

Edit: Another option could be a command-line argument to define the path to a single file, as it is already the case with chat-template. Maybe that would be preferred, but I think it's less general, and if we take this approach, in a year, we might end up with 4 or 5 similar arguments of the same kind.

I'd follow the same approach as chat-template, command-line arg with a path.

AguirreNicolas · 2023-12-29T08:26:54Z

vllm/entrypoints/openai/protocol.py

@@ -52,9 +52,60 @@ class UsageInfo(BaseModel):
    completion_tokens: Optional[int] = 0


+class FunctionCall(BaseModel):


Probably should be cleaner to use the same class name as OAI both for FunctionCall and ToolCallsMessage

https://github.com/openai/openai-python/blob/main/src/openai/types/chat/chat_completion_message_tool_call.py

FlorianJoncour · 2024-01-03T18:15:38Z

I apologize for the delay, but I am French and vacations are important!

I have significantly improved the parsing to capture function calls even if the model doesn't start its message with !function_call, which greatly reduces the number of false negatives.

I also added an example usage based on OpenAI's documentation (mentioned by @esmeetu), extented to use two functions.

Finally, I also renamed some types to match the types defined in OpenAI's implementation, as suggested by @AguirreNicolas.

The current implementation seems good to me, but I'm not sure if it will be merged, as it's not the primary goal of vLLM. If it's not merged, I will publish it in a separate repository.

simon-mo · 2024-01-03T18:28:53Z

Thank you for your contribution! I will make a pass soon.

The current implementation seems good to me, but I'm not sure if it will be merged, as it's not the primary goal of vLLM. If it's not merged, I will publish it in a separate repository.

Yes I will make sure this gets into vLLM. It is one of our priorities in fact.

copied from vllm-project#2210

simon-mo · 2024-01-05T13:12:36Z

it would be better to separate refactor openai server and function call feature with two PRs since it's too difficult to review.

I would agree with @esmeetu's suggestion here. Can you keep this PR to just function call implementation? It is complex enough to be reviewed. The refactoring make it hard to see the diff that's brought by function call.

If we need to implement templates, it means that ideally we should remove all text instructions in serving.py.

I think vLLM can definitely ship with default set of templates if needed, as part of package data. Then a user configurable override should always be possible. See other model (ChatGLM) that's tuned on tool use have different format: https://github.com/THUDM/ChatGLM3/blob/main/PROMPT_en.md#tool-calling. Airoboros also have slightly different input acceptance: https://github.com/cpacker/MemGPT/blob/febee38db315b17da393bc0849392225ee5cd7b4/memgpt/local_llm/llm_chat_completion_wrappers/airoboros.py#L293-L316

FlorianJoncour · 2024-01-06T01:12:57Z

Ok, so I close this PR and I will do it in multiple stages.

Edit: #2360

simon-mo · 2024-01-06T01:48:48Z

Thank you ❤️

viktor-ferenczi · 2024-01-12T18:21:05Z

This PR should help in generating function calls which are always valid according to their schema:

#2105: Add grammars

wybartel · 2024-01-16T15:38:58Z

vllm/entrypoints/openai/serving.py

@FlorianJoncour it is worth noting that trying this out yesterday with Autogen, I had to manually add "tool_choice": "auto" to my llm_config using the latest autogen 0.2.6 in order to get it to pick up the tools and inject them into the prompt. It seems like Autogen doesn't pass a tool choice by default

if request.tool_choice is not None

#Had to add tool_choice for Autogen 0.2.6 llm_config = { "seed": 42, "config_list": config_list, "temperature": 0.0, "timeout": 3000, "tool_choice":"auto" }

@FlorianJoncour I was facing the same problem, but using the ChatOpenAI class in Langchain. Until @wybartel comments (b.t.w, thanks a lot!) I coudn't find the reason why your tool feature was not called.

Pernekhan · 2024-01-16T20:17:22Z

vllm/entrypoints/openai/serving.py

+    def inject_prompt(self, request: ChatCompletionRequest):
+        """ Tested with :
+                https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B/discussions/3 """
+        if request.tool_choice is not None and request.tools is not None and request.tool_choice == "auto":


Would be nice to have a support for tool_choice with function name to force the model to call that function. It'd probably need a different promt for that.

tool_choice definition from https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. Specifying a particular function via {"type": "function", "function": {"name": "my_function"}} forces the model to call that function.
none is the default when no functions are present. auto is the default if functions are present.

FlorianJoncour · 2024-01-18T12:46:03Z

@Pernekhan @AguirreNicolas @wybartel The new PR fixe all these issues : #2488

FlorianJoncour added 5 commits December 18, 2023 22:57

refactor to separate Uvicorn and llm generation

429a1e2

refactor to separate Uvicorn and llm generation

fad0214

OpenAI compatible functions calling

d2c6fe5

typo

588785b

format

480845c

Fix: Response was not an array in streaming mode, format

e7fa1ea

FlorianJoncour added 3 commits December 20, 2023 18:36

dict from chat messages replaced by types

d2bd220

Fix: "tools_calls" transfered on "content" when needed

ad63060

format

d1bdf41

FlorianJoncour added 2 commits December 21, 2023 13:12

Fix: messages roles are now literals with a discriminator

5a0e4ea

format

1c3233e

Tostino reviewed Dec 25, 2023

View reviewed changes

AguirreNicolas reviewed Dec 29, 2023

View reviewed changes

titu1994 mentioned this pull request Jan 3, 2024

Add support for batched completion (offline) with OpenAI server #2191

Closed

FlorianJoncour added 5 commits January 3, 2024 17:38

rewrite of the output parsing, bugfixes

e70bc84

API types renamed to OpenAI defined types

fe45424

Format

f0d82fa

Exemple to use tools calls

f14ae0e

Format

c327343

simon-mo self-assigned this Jan 3, 2024

joennlae added a commit to joennlae/vllm that referenced this pull request Jan 3, 2024

feat(function-calling): function calling works

cabc57d

copied from vllm-project#2210

FlorianJoncour closed this Jan 6, 2024

FlorianJoncour mentioned this pull request Jan 6, 2024

OpenAI refactoring #2360

Merged

wybartel reviewed Jan 16, 2024

View reviewed changes

Pernekhan reviewed Jan 16, 2024

View reviewed changes

FlorianJoncour mentioned this pull request Jan 18, 2024

OpenAI Tools / function calling #2488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenAI API refactoring + Functions calling #2210

OpenAI API refactoring + Functions calling #2210

FlorianJoncour commented Dec 20, 2023 •

edited

Loading

esmeetu commented Dec 20, 2023

esmeetu commented Dec 20, 2023

FlorianJoncour commented Dec 20, 2023 •

edited

Loading

esmeetu commented Dec 20, 2023 •

edited

Loading

FlorianJoncour commented Dec 20, 2023 •

edited

Loading

esmeetu commented Dec 21, 2023 •

edited

Loading

esmeetu commented Dec 21, 2023

FlorianJoncour commented Dec 21, 2023

esmeetu commented Dec 21, 2023

leoterry-ulrica commented Dec 25, 2023

Tostino Dec 25, 2023

AguirreNicolas Jan 4, 2024

FlorianJoncour Jan 4, 2024 •

edited

Loading

AguirreNicolas Jan 5, 2024

AguirreNicolas Dec 29, 2023

FlorianJoncour commented Jan 3, 2024

simon-mo commented Jan 3, 2024

simon-mo commented Jan 5, 2024

FlorianJoncour commented Jan 6, 2024 •

edited

Loading

simon-mo commented Jan 6, 2024

viktor-ferenczi commented Jan 12, 2024

wybartel Jan 16, 2024 •

edited

Loading

AguirreNicolas Jan 17, 2024 •

edited

Loading

Pernekhan Jan 16, 2024

FlorianJoncour commented Jan 18, 2024

		@@ -52,9 +52,60 @@ class UsageInfo(BaseModel):
		completion_tokens: Optional[int] = 0


		class FunctionCall(BaseModel):

OpenAI API refactoring + Functions calling #2210

OpenAI API refactoring + Functions calling #2210

Conversation

FlorianJoncour commented Dec 20, 2023 • edited Loading

esmeetu commented Dec 20, 2023

esmeetu commented Dec 20, 2023

FlorianJoncour commented Dec 20, 2023 • edited Loading

esmeetu commented Dec 20, 2023 • edited Loading

FlorianJoncour commented Dec 20, 2023 • edited Loading

esmeetu commented Dec 21, 2023 • edited Loading

esmeetu commented Dec 21, 2023

FlorianJoncour commented Dec 21, 2023

esmeetu commented Dec 21, 2023

leoterry-ulrica commented Dec 25, 2023

Tostino Dec 25, 2023

Choose a reason for hiding this comment

AguirreNicolas Jan 4, 2024

Choose a reason for hiding this comment

FlorianJoncour Jan 4, 2024 • edited Loading

Choose a reason for hiding this comment

AguirreNicolas Jan 5, 2024

Choose a reason for hiding this comment

AguirreNicolas Dec 29, 2023

Choose a reason for hiding this comment

FlorianJoncour commented Jan 3, 2024

simon-mo commented Jan 3, 2024

simon-mo commented Jan 5, 2024

FlorianJoncour commented Jan 6, 2024 • edited Loading

simon-mo commented Jan 6, 2024

viktor-ferenczi commented Jan 12, 2024

wybartel Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

AguirreNicolas Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Pernekhan Jan 16, 2024

Choose a reason for hiding this comment

FlorianJoncour commented Jan 18, 2024

FlorianJoncour commented Dec 20, 2023 •

edited

Loading

FlorianJoncour commented Dec 20, 2023 •

edited

Loading

esmeetu commented Dec 20, 2023 •

edited

Loading

FlorianJoncour commented Dec 20, 2023 •

edited

Loading

esmeetu commented Dec 21, 2023 •

edited

Loading

FlorianJoncour Jan 4, 2024 •

edited

Loading

FlorianJoncour commented Jan 6, 2024 •

edited

Loading

wybartel Jan 16, 2024 •

edited

Loading

AguirreNicolas Jan 17, 2024 •

edited

Loading