Skip to content

OpenAI API refactoring + Functions calling #2210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed

OpenAI API refactoring + Functions calling #2210

wants to merge 16 commits into from

Conversation

FlorianJoncour
Copy link
Contributor

@FlorianJoncour FlorianJoncour commented Dec 20, 2023

Hello,

I had to implement function call handling, so I directly implemented it in vLLM.
To do this, there are two major changes.

Firstly, the OpenAI API has been refactored to separate the server and generation. This makes the whole thing a bit clearer and also allows for creating an OpenAI server outside of Uvicorn.
Next, I implemented something similar to the "tools" in the OpenAI API. (Tools are more general replacements for direct function calls, which are now deprecated).
For now, only function calls are supported.

The system (like OpenAI apparently) works by injecting a prompt and capturing the result.
https://platform.openai.com/docs/guides/function-calling

I developed with the NeuralHermes-2.5-Mistral-7B model, and it works very well.

Some notes:

  • It is disabled by default to avoid unexpected issues. An enable-api-tools parameter has been added.
  • Besides refactoring and adding code, there are no regressions, the rest of the code hasn't changed much.
  • This only applies to chat/completions.
  • The result depends on the model's ability to process the prompt.
  • Can handle multiple function calls.
  • If a function call is requested by the model but there's an error, regular generation takes over.
  • Tested in a multilingual situation (French in my case), the English injected prompt does not affect user instructions (although this may depend on the quality of the model).
  • Adds about 250 tokens to the prompt + the tokens for user-declared function.

Improvement points:
Most current models are not fine-tuned for making function calls, and I haven't found a reference for a specific token indicating a function call by the model.
So, I defined something that should works good.
But since prompts now use templates, maybe a future update could use them to define a function call token and thus force future models to also use one, which should make capturing function calls more reliable.

And maybe the injected prompt could also be put into a template.

@esmeetu
Copy link
Member

esmeetu commented Dec 20, 2023

Hey, @FlorianJoncour. Thanks for your work! I have tested this, but it doesn't works by using openai's official example. Because current api server doesn't support messages with object which is the second request in that example.

@esmeetu
Copy link
Member

esmeetu commented Dec 20, 2023

For the min_p typo error, i think you could commit in a separate PR.

@FlorianJoncour
Copy link
Contributor Author

FlorianJoncour commented Dec 20, 2023

I'm not sure which example you're exactly referring to, so I'm not sure if I've solved your issue, but there was indeed an error when the generation was in stream mode, it should return an array.

btw it should work now (using official OpenAI library):

Prompt:

{'role': 'assistant', 'content': 'Tu es un assistant intelligent qui répond aux questions.'}
{'role': 'user', 'content': 'Quel temps fait il à Bordeaux ?'}
{'type': 'function', 'function': {'name': 'get_current_weather', 'description': 'Get the current weather in a given location', 'parameters': {'type': 'object', 'properties': {'location': {'type': 'string', 'description': 'The city and state, e.g. San Francisco, CA'}, 'unit': {'type': 'string', 'enum': ['celsius', 'fahrenheit']}}, 'required': ['location']}}}

Batch mode:

Request (stream=False) : Quel temps fait il à Bordeaux ?
Tool call : ChatCompletionMessageToolCall(id='call_get_current_weather', function=Function(arguments='{"location": "Bordeaux, FR"}', name='get_current_weather'), type='function')

Stream mode:

Request (stream=True) : Quel temps fait il à Bordeaux ?
Tool call : ChoiceDeltaToolCall(index=0, id='call_get_current_weather', function=ChoiceDeltaToolCallFunction(arguments='{"location": "Bordeaux, FR"}', name='get_current_weather'), type='function')

May I add an example since there are already examples to test the API ?

@esmeetu
Copy link
Member

esmeetu commented Dec 20, 2023

@FlorianJoncour This code copied from https://platform.openai.com/docs/guides/function-calling:

from openai import OpenAI
import json

client = OpenAI()

# Example dummy function hard coded to return the same weather
# In production, this could be your backend API or an external API
def get_current_weather(location, unit="fahrenheit"):
    """Get the current weather in a given location"""
    if "tokyo" in location.lower():
        return json.dumps({"location": "Tokyo", "temperature": "10", "unit": unit})
    elif "san francisco" in location.lower():
        return json.dumps({"location": "San Francisco", "temperature": "72", "unit": unit})
    elif "paris" in location.lower():
        return json.dumps({"location": "Paris", "temperature": "22", "unit": unit})
    else:
        return json.dumps({"location": location, "temperature": "unknown"})

def run_conversation():
    # Step 1: send the conversation and available functions to the model
    messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
    tools = [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                    },
                    "required": ["location"],
                },
            },
        }
    ]
    response = client.chat.completions.create(
        model="gpt-3.5-turbo-1106",
        messages=messages,
        tools=tools,
        tool_choice="auto",  # auto is default, but we'll be explicit
    )
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    # Step 2: check if the model wanted to call a function
    if tool_calls:
        # Step 3: call the function
        # Note: the JSON response may not always be valid; be sure to handle errors
        available_functions = {
            "get_current_weather": get_current_weather,
        }  # only one function in this example, but you can have multiple
        messages.append(response_message)  # extend conversation with assistant's reply
        # Step 4: send the info for each function call and function response to the model
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name]
            function_args = json.loads(tool_call.function.arguments)
            function_response = function_to_call(
                location=function_args.get("location"),
                unit=function_args.get("unit"),
            )
            messages.append(
                {
                    "tool_call_id": tool_call.id,
                    "role": "tool",
                    "name": function_name,
                    "content": function_response,
                }
            )  # extend conversation with function response
        second_response = client.chat.completions.create(
            model="gpt-3.5-turbo-1106",
            messages=messages,
        )  # get a new response from the model where it can see the function response
        return second_response
print(run_conversation())

Can you execute this example smoothly? The initial request is fine, but the second one isn't supported.

@FlorianJoncour
Copy link
Contributor Author

FlorianJoncour commented Dec 20, 2023

So you was right in your first message ! Thank you.

I added all types for ChatCompletionRequest.messages as described on the OpenAI doc and requests dont crash anymore.
But we have troubles with the tokenizer, it raise an exception because message.content may be empty when message.tool_calls get calls data.
So I added a trick to format and copy that data into message.content.

It should work now.

@esmeetu
Copy link
Member

esmeetu commented Dec 21, 2023

Great! It seems this PR fix #1869.
Another thought here, it would be better to separate refactor openai server and function call feature with two PRs since it's too difficult to review.
cc @simon-mo

@esmeetu
Copy link
Member

esmeetu commented Dec 21, 2023

Hey, @FlorianJoncour. I tested the latest commit again. The second_response seems wrong, and i found that the server didn't handle results of tools in messges because the chat template doesn't recognize tool role. Could you show me your chat template?

@FlorianJoncour
Copy link
Contributor Author

I use the default template.
However, there was an issue with Pydantic, where the wrong type was selected during queries.
Now, roles are literals, and elements of messages are a union (and not a union of lists, which was of course an error).
So, now the messages have the correct types.

When I display the prompt server side right after tokenizer.apply_chat_template, I get this:

<|im_start|>user
What's the weather like in San Francisco, Tokyo, and Paris?<|im_end|>
<|im_start|>assistant
call_get_current_weather was called with arguments : {"location": "San Francisco"}
call_get_current_weather was called with arguments : {"location": "Tokyo"}
call_get_current_weather was called with arguments : {"location": "Paris"}
<|im_end|>
<|im_start|>tool
call_get_current_weather -> {"location": "San Francisco", "temperature": "72", "unit": null}<|im_end|>
<|im_start|>tool
call_get_current_weather -> {"location": "Tokyo", "temperature": "10", "unit": null}<|im_end|>
<|im_start|>tool
call_get_current_weather -> {"location": "Paris", "temperature": "22", "unit": null}<|im_end|>
<|im_start|>assistant

And so the final response is:

ChatCompletion(id='cmpl-2d769e9116354c8595feb2bc4672a657', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The current weather in San Francisco is 72 degrees Fahrenheit, Tokyo is 10 degrees Fahrenheit, and Paris is 22 degrees Fahrenheit.', role='assistant', function_call=None, tool_calls=None))], created=12717, model='gpt-3.5-turbo', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=37, prompt_tokens=199, total_tokens=236))

Don't mind about model="gpt-3.5-turbo", I set this name for simplicity with various tools.
If this not what you expect, can you tell my more informations ?

@esmeetu
Copy link
Member

esmeetu commented Dec 21, 2023

@FlorianJoncour I made some modifications on your code, and work now. I might post my changes in the future. For llm models, i think codellama-sft models are better(like phind-codellama).
I love this feature, thanks and hope being merged ASAP.🤩

@leoterry-ulrica
Copy link

Whether the merge has been completed? @FlorianJoncour @esmeetu

schema = json.dumps(tool.function.parameters,
indent=4)
text_inject += f"```\njsonschema\n{schema}\n```"
text_inject += (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be hard coded? It seems like depending on how the model was trained for function calling, the specific injected text could have a large impact on performance of the model. Seems like a good idea to allow this to be configurable IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following what was suggested by @Tostino , I think that it could be very handly to incorporate a tool_template.json path to personalize the tool response (probably preserving the current as default)

Particularly, the way the json_schema_params in constructed IMO it's ok. I wouldn't change it.
Nevertheless, incorporating a tool_template.json like:

{
    "prefix": "str",
    "suffix": "str",
    "order": "top"
    "func_call_token": "str"
}

could provide enough freedom in regards of models' adaptation when inject_prompt is performed. ( For ease of understanding please refer to image below )

What's do you think about @simon-mo ?

Image:
tool_template

Copy link
Contributor Author

@FlorianJoncour FlorianJoncour Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea and easy to implement.

If we need to implement templates, it means that ideally we should remove all text instructions in serving.py.
So, we need to define a path to store a default tool_template.json.

The examples/ folder is not suitable.

There are four options:

  • Hardcode it as a string in serving.py (which I don't like)
  • The vllm/entrypoints/openai/ folder (I'm undecided)
  • Define a vllm/data/ folder or something similar (I'm undecided)
  • Add a command-line argument to allow users to define a configuration directory (or $HOME/.config/vllm/openai by default) and store data there.

I prefer the last option, knowing that a general-purpose directory might be used for another type of templates.
Besides function calls, LLMs (starting with ChatGPT) can be extended by embeddings (which could potentially be implemented in the future in vLLM), dynamic Python interpreter, etc...

Again, I am not sure if this is the primary goal of vLLM, but having a compatible OpenAI server with most features ready to use seems as important as inference performance!

The OpenAI's documentation gives an idea of what could potentially be implemented.
https://platform.openai.com/docs/assistants/tools

Edit: Another option could be a command-line argument to define the path to a single file, as it is already the case with chat-template. Maybe that would be preferred, but I think it's less general, and if we take this approach, in a year, we might end up with 4 or 5 similar arguments of the same kind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd follow the same approach as chat-template, command-line arg with a path.

@@ -52,9 +52,60 @@ class UsageInfo(BaseModel):
completion_tokens: Optional[int] = 0


class FunctionCall(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be cleaner to use the same class name as OAI both for FunctionCall and ToolCallsMessage

https://github.com/openai/openai-python/blob/main/src/openai/types/chat/chat_completion_message_tool_call.py

@FlorianJoncour
Copy link
Contributor Author

I apologize for the delay, but I am French and vacations are important!

I have significantly improved the parsing to capture function calls even if the model doesn't start its message with !function_call, which greatly reduces the number of false negatives.

I also added an example usage based on OpenAI's documentation (mentioned by @esmeetu), extented to use two functions.

Finally, I also renamed some types to match the types defined in OpenAI's implementation, as suggested by @AguirreNicolas.

The current implementation seems good to me, but I'm not sure if it will be merged, as it's not the primary goal of vLLM. If it's not merged, I will publish it in a separate repository.

@simon-mo simon-mo self-assigned this Jan 3, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Jan 3, 2024

Thank you for your contribution! I will make a pass soon.

The current implementation seems good to me, but I'm not sure if it will be merged, as it's not the primary goal of vLLM. If it's not merged, I will publish it in a separate repository.

Yes I will make sure this gets into vLLM. It is one of our priorities in fact.

joennlae added a commit to joennlae/vllm that referenced this pull request Jan 3, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Jan 5, 2024

it would be better to separate refactor openai server and function call feature with two PRs since it's too difficult to review.

I would agree with @esmeetu's suggestion here. Can you keep this PR to just function call implementation? It is complex enough to be reviewed. The refactoring make it hard to see the diff that's brought by function call.

If we need to implement templates, it means that ideally we should remove all text instructions in serving.py.

I think vLLM can definitely ship with default set of templates if needed, as part of package data. Then a user configurable override should always be possible. See other model (ChatGLM) that's tuned on tool use have different format: https://github.com/THUDM/ChatGLM3/blob/main/PROMPT_en.md#tool-calling. Airoboros also have slightly different input acceptance: https://github.com/cpacker/MemGPT/blob/febee38db315b17da393bc0849392225ee5cd7b4/memgpt/local_llm/llm_chat_completion_wrappers/airoboros.py#L293-L316

@FlorianJoncour
Copy link
Contributor Author

FlorianJoncour commented Jan 6, 2024

Ok, so I close this PR and I will do it in multiple stages.

Edit: #2360

@simon-mo
Copy link
Collaborator

simon-mo commented Jan 6, 2024

Thank you ❤️

@viktor-ferenczi
Copy link
Contributor

This PR should help in generating function calls which are always valid according to their schema:

#2105: Add grammars

Copy link

@wybartel wybartel Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FlorianJoncour it is worth noting that trying this out yesterday with Autogen, I had to manually add "tool_choice": "auto" to my llm_config using the latest autogen 0.2.6 in order to get it to pick up the tools and inject them into the prompt. It seems like Autogen doesn't pass a tool choice by default

if request.tool_choice is not None

#Had to add tool_choice for Autogen 0.2.6
llm_config = {
"seed": 42,
"config_list": config_list,
"temperature": 0.0,
"timeout": 3000,
"tool_choice":"auto"
}

Copy link
Contributor

@AguirreNicolas AguirreNicolas Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FlorianJoncour I was facing the same problem, but using the ChatOpenAI class in Langchain. Until @wybartel comments (b.t.w, thanks a lot!) I coudn't find the reason why your tool feature was not called.

def inject_prompt(self, request: ChatCompletionRequest):
""" Tested with :
https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B/discussions/3 """
if request.tool_choice is not None and request.tools is not None and request.tool_choice == "auto":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a support for tool_choice with function name to force the model to call that function. It'd probably need a different promt for that.

tool_choice definition from https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. Specifying a particular function via {"type": "function", "function": {"name": "my_function"}} forces the model to call that function.
none is the default when no functions are present. auto is the default if functions are present.

@FlorianJoncour
Copy link
Contributor Author

@Pernekhan @AguirreNicolas @wybartel The new PR fixe all these issues : #2488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants