Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Introducing Realtime Clients for OpenAI and Azure OpenAI #10127

Merged
merged 51 commits into from
Mar 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
798d082
draft initial implementation of Realtime API
eavanvalkenburg Jan 8, 2025
7308bcb
major update
eavanvalkenburg Jan 9, 2025
d9ce937
updated note
eavanvalkenburg Jan 9, 2025
20ea3dc
reverted some changes
eavanvalkenburg Jan 9, 2025
eff4765
WIP ADR
eavanvalkenburg Jan 10, 2025
7cde9da
small updates
eavanvalkenburg Jan 10, 2025
fe1be54
webrtc WIP
eavanvalkenburg Jan 14, 2025
6faf93f
updated ADR
eavanvalkenburg Jan 16, 2025
43bc2f3
webrtc working!
eavanvalkenburg Jan 17, 2025
8132882
added dependency
eavanvalkenburg Jan 17, 2025
b5c5443
added dep
eavanvalkenburg Jan 17, 2025
6120ba1
added nd
eavanvalkenburg Jan 17, 2025
ecdb16a
renamed
eavanvalkenburg Jan 17, 2025
8a2a525
changed import
eavanvalkenburg Jan 17, 2025
4bef21a
restructured
eavanvalkenburg Jan 20, 2025
a6d317d
fix import
eavanvalkenburg Jan 20, 2025
b8ff264
small optimization in code
eavanvalkenburg Jan 21, 2025
89f1988
updates to the ADR
eavanvalkenburg Jan 22, 2025
ee1ce02
import improvements
eavanvalkenburg Jan 23, 2025
da370c3
updated code and ADR
eavanvalkenburg Jan 28, 2025
9fb0eb7
wip on redoing the api
eavanvalkenburg Jan 29, 2025
f02e5d8
WIP
eavanvalkenburg Jan 30, 2025
7434c70
removed built-in audio players, split for websocket and rtc
eavanvalkenburg Jan 31, 2025
0911c04
add image event import
eavanvalkenburg Jan 31, 2025
d9e5fe6
naming updates and added call
eavanvalkenburg Feb 12, 2025
43e5fb1
redid realtimeevents
eavanvalkenburg Feb 13, 2025
b4d5482
WIP azure
eavanvalkenburg Feb 13, 2025
ca80839
working azure realtime websockets
eavanvalkenburg Feb 14, 2025
1643196
added call automation sample
eavanvalkenburg Feb 14, 2025
363f9db
added function calling sample with azure
eavanvalkenburg Feb 17, 2025
ba7e312
much improvement to the call automation sample
eavanvalkenburg Feb 17, 2025
b0334f2
remove computed field
eavanvalkenburg Feb 17, 2025
b35661a
cleanup
eavanvalkenburg Feb 17, 2025
acc7e20
small fix in sample
eavanvalkenburg Feb 17, 2025
35ce793
fix for binary content
eavanvalkenburg Feb 17, 2025
7c6ad49
additional experimental markers
eavanvalkenburg Feb 17, 2025
ac856d4
fixed mypy
eavanvalkenburg Feb 18, 2025
6314756
binary content fix
eavanvalkenburg Feb 18, 2025
9d26bfa
addressed comments
eavanvalkenburg Feb 20, 2025
7e4c88f
moved events into a file
eavanvalkenburg Feb 21, 2025
7903a13
updated lock
eavanvalkenburg Feb 21, 2025
f5e24ec
fix typo
eavanvalkenburg Feb 21, 2025
9249877
restructured realtime
eavanvalkenburg Feb 24, 2025
aed7380
first set of tests
eavanvalkenburg Feb 25, 2025
d398695
added more tests
eavanvalkenburg Feb 25, 2025
39739b7
added audio callback to receive
eavanvalkenburg Feb 26, 2025
db01504
added tests and improved samples
eavanvalkenburg Feb 28, 2025
f238ee8
updated names of the samples and added readme
eavanvalkenburg Mar 3, 2025
83dbe68
typo
eavanvalkenburg Mar 3, 2025
ec90ca7
updated sample instructions
eavanvalkenburg Mar 3, 2025
1b4f3ef
Merge branch 'main' into realtime
moonbox3 Mar 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion python/.cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@
"logprobs",
"mistralai",
"mongocluster",
"nd",
"ndarray",
"nopep",
"NOSQL",
Expand All @@ -73,4 +74,4 @@
"vertexai",
"Weaviate"
]
}
}
2 changes: 1 addition & 1 deletion python/.vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"request": "launch",
"program": "${file}",
"console": "integratedTerminal",
"justMyCode": true
"justMyCode": false
},
{
"name": "Python FastAPI app with Dapr",
Expand Down
6 changes: 4 additions & 2 deletions python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,10 @@ dapr = [
"dapr-ext-fastapi>=1.14.0",
"flask-dapr>=1.14.0"
]
realtime = [
"websockets >= 13, < 15",
"aiortc>=1.9.0",
]

[tool.uv]
prerelease = "if-necessary-or-explicit"
Expand Down Expand Up @@ -225,5 +229,3 @@ name = "semantic_kernel"
[build-system]
requires = ["flit-core >= 3.9,<4.0"]
build-backend = "flit_core.buildapi"


50 changes: 50 additions & 0 deletions python/samples/concepts/realtime/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Realtime Multi-modal API Samples

These samples are more complex then most because of the nature of these API's. They are designed to be run in real-time and require a microphone and speaker to be connected to your computer.

To run these samples, you will need to have the following setup:

- Environment variables for OpenAI (websocket or WebRTC), with your key and OPENAI_REALTIME_MODEL_ID set.
- Environment variables for Azure (websocket only), set with your endpoint, optionally a key and AZURE_OPENAI_REALTIME_DEPLOYMENT_NAME set. The API version needs to be at least `2024-10-01-preview`.
- To run the sample with a simple version of a class that handles the incoming and outgoing sound you need to install the following packages in your environment:
- semantic-kernel[realtime]
- pyaudio
- sounddevice
- pydub
e.g. pip install pyaudio sounddevice pydub semantic_kernel[realtime]

The samples all run as python scripts, that can either be started directly or through your IDE.

All demos have a similar output, where the instructions are printed, each new *response item* from the API is put into a new `Mosscap (transcript):` line. The nature of these api's is such that the transcript arrives before the spoken audio, so if you interrupt the audio the transcript will not match the audio.

The realtime api's work by sending event from the server to you and sending events back to the server, this is fully asynchronous. The samples show you can listen to the events being sent by the server and some are handled by the code in the samples, others are not. For instance one could add a clause in the match case in the receive loop that logs the usage that is part of the `response.done` event.

For more info on the events, go to our documentation, as well as the documentation of [OpenAI](https://platform.openai.com/docs/guides/realtime) and [Azure](https://learn.microsoft.com/en-us/azure/ai-services/openai/realtime-audio-quickstart?tabs=keyless%2Cmacos&pivots=programming-language-python).

## Simple chat samples

### [Simple chat with realtime websocket](./simple_realtime_chat_websocket.py)

This sample uses the websocket api with Azure OpenAI to run a simple interaction based on voice. If you want to use this sample with OpenAI, just change AzureRealtimeWebsocket into OpenAIRealtimeWebsocket.

### [Simple chat with realtime WebRTC](./simple_realtime_chat_webrtc.py)

This sample uses the WebRTC api with OpenAI to run a simple interaction based on voice. Because of the way the WebRTC protocol works this needs a different player and recorder than the websocket version.

## Function calling samples

The following two samples use function calling with the following functions:

- get_weather: This function will return the weather for a given city, it is randomly generated and not based on any real data.
- get_time: This function will return the current time and date.
- goodbye: This function will end the conversation.

A line is logged whenever one of these functions is called.

### [Chat with function calling Websocket](./realtime_chat_with_function_calling_websocket.py)

This sample uses the websocket api with Azure OpenAI to run the interaction with the voice model, but now with function calling.

### [Chat with function calling WebRTC](./realtime_chat_with_function_calling_webrtc.py)

This sample uses the WebRTC api with OpenAI to run the interaction with the voice model, but now with function calling.
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Copyright (c) Microsoft. All rights reserved.

import asyncio
import logging
from datetime import datetime
from random import randint

from samples.concepts.realtime.utils import AudioPlayerWebRTC, AudioRecorderWebRTC, check_audio_devices
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai import FunctionChoiceBehavior
from semantic_kernel.connectors.ai.open_ai import (
ListenEvents,
OpenAIRealtimeExecutionSettings,
OpenAIRealtimeWebRTC,
TurnDetection,
)
from semantic_kernel.contents import ChatHistory
from semantic_kernel.contents.realtime_events import RealtimeTextEvent
from semantic_kernel.functions import kernel_function

logging.basicConfig(level=logging.WARNING)
utils_log = logging.getLogger("samples.concepts.realtime.utils")
utils_log.setLevel(logging.INFO)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

"""
This simple sample demonstrates how to use the OpenAI Realtime API to create
a chat bot that can listen and respond directly through audio.
It requires installing:
- semantic-kernel[realtime]
- pyaudio
- sounddevice
- pydub
e.g. pip install pyaudio sounddevice pydub semantic_kernel[realtime]

For more details of the exact setup, see the README.md in the realtime folder.
"""

# The characterics of your speaker and microphone are a big factor in a smooth conversation
# so you may need to try out different devices for each.
# you can also play around with the turn_detection settings to get the best results.
# It has device id's set in the AudioRecorderStream and AudioPlayerAsync classes,
# so you may need to adjust these for your system.
# you can disable the check for available devices by commenting the line below
check_audio_devices()


@kernel_function
def get_weather(location: str) -> str:
"""Get the weather for a location."""
weather_conditions = ("sunny", "hot", "cloudy", "raining", "freezing", "snowing")
weather = weather_conditions[randint(0, len(weather_conditions) - 1)] # nosec
logger.info(f"@ Getting weather for {location}: {weather}")
return f"The weather in {location} is {weather}."


@kernel_function
def get_date_time() -> str:
"""Get the current date and time."""
logger.info("@ Getting current datetime")
return f"The current date and time is {datetime.now().isoformat()}."


@kernel_function
def goodbye():
"""When the user is done, say goodbye and then call this function."""
logger.info("@ Goodbye has been called!")
raise KeyboardInterrupt


async def main() -> None:
print_transcript = True
# create the Kernel and add a simple function for function calling.
kernel = Kernel()
kernel.add_functions(plugin_name="helpers", functions=[goodbye, get_weather, get_date_time])

# create the audio player and audio track
# both take a device_id parameter, which is the index of the device to use, if None the default device is used
audio_player = AudioPlayerWebRTC()
# create the realtime client and optionally add the audio output function, this is optional
# and can also be passed in the receive method
realtime_client = OpenAIRealtimeWebRTC(audio_track=AudioRecorderWebRTC())

# Create the settings for the session
# The realtime api, does not use a system message, but takes instructions as a parameter for a session
# Another important setting is to tune the server_vad turn detection
# if this is turned off (by setting turn_detection=None), you will have to send
# the "input_audio_buffer.commit" and "response.create" event to the realtime api
# to signal the end of the user's turn and start the response.
# manual VAD is not part of this sample
# for more info: https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-turn_detection
settings = OpenAIRealtimeExecutionSettings(
instructions="""
You are a chat bot. Your name is Mosscap and
you have one goal: figure out what people need.
Your full name, should you need to know it, is
Splendid Speckled Mosscap. You communicate
effectively, but you tend to answer with long
flowery prose.
""",
voice="alloy",
turn_detection=TurnDetection(type="server_vad", create_response=True, silence_duration_ms=800, threshold=0.8),
function_choice_behavior=FunctionChoiceBehavior.Auto(),
)
# and we can add a chat history to conversation after starting it
chat_history = ChatHistory()
chat_history.add_user_message("Hi there, who are you?")
chat_history.add_assistant_message("I am Mosscap, a chat bot. I'm trying to figure out what people need.")

# the context manager calls the create_session method on the client and starts listening to the audio stream
async with (
audio_player,
realtime_client(
settings=settings,
chat_history=chat_history,
kernel=kernel,
create_response=True,
),
):
async for event in realtime_client.receive(audio_output_callback=audio_player.client_callback):
match event:
case RealtimeTextEvent():
if print_transcript:
print(event.text.text, end="")
case _:
# OpenAI Specific events
match event.service_type:
case ListenEvents.RESPONSE_CREATED:
if print_transcript:
print("\nMosscap (transcript): ", end="")
case ListenEvents.ERROR:
logger.error(event.service_event)


if __name__ == "__main__":
print(
"Instructions: The model will start speaking immediately,"
"this can be turned off by removing `create_response=True` above."
"The model will detect when you stop and automatically generate a response. "
"Press ctrl + c to stop the program."
)
asyncio.run(main())
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Copyright (c) Microsoft. All rights reserved.

import asyncio
import logging
from datetime import datetime
from random import randint

from samples.concepts.realtime.utils import AudioPlayerWebsocket, AudioRecorderWebsocket
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai import FunctionChoiceBehavior
from semantic_kernel.connectors.ai.open_ai import (
AzureRealtimeExecutionSettings,
AzureRealtimeWebsocket,
ListenEvents,
TurnDetection,
)
from semantic_kernel.contents import ChatHistory
from semantic_kernel.contents.realtime_events import RealtimeTextEvent
from semantic_kernel.functions import kernel_function

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

"""
This simple sample demonstrates how to use the OpenAI Realtime API to create
a chat bot that can listen and respond directly through audio.
It requires installing:
- semantic-kernel[realtime]
- pyaudio
- sounddevice
- pydub
e.g. pip install pyaudio sounddevice pydub semantic_kernel[realtime]

For more details of the exact setup, see the README.md in the realtime folder.
"""


@kernel_function
def get_weather(location: str) -> str:
"""Get the weather for a location."""
weather_conditions = ("sunny", "hot", "cloudy", "raining", "freezing", "snowing")
weather = weather_conditions[randint(0, len(weather_conditions) - 1)] # nosec
logger.info(f"@ Getting weather for {location}: {weather}")
return f"The weather in {location} is {weather}."


@kernel_function
def get_date_time() -> str:
"""Get the current date and time."""
logger.info("@ Getting current datetime")
return f"The current date and time is {datetime.now().isoformat()}."


@kernel_function
def goodbye():
"""When the user is done, say goodbye and then call this function."""
logger.info("@ Goodbye has been called!")
raise KeyboardInterrupt


async def main() -> None:
print_transcript = True
# create the Kernel and add a simple function for function calling.
kernel = Kernel()
kernel.add_functions(plugin_name="helpers", functions=[goodbye, get_weather, get_date_time])

# create the realtime client, in this the Azure Websocket client, there are also OpenAI Websocket and WebRTC clients
# See 02b-chat_with_function_calling_webrtc.py for an example of the WebRTC client
realtime_client = AzureRealtimeWebsocket()
# create the audio player and audio track
# both take a device_id parameter, which is the index of the device to use, if None the default device is used
audio_player = AudioPlayerWebsocket()
audio_recorder = AudioRecorderWebsocket(realtime_client=realtime_client)

# Create the settings for the session
# The realtime api, does not use a system message, but takes instructions as a parameter for a session
# Another important setting is to tune the server_vad turn detection
# if this is turned off (by setting turn_detection=None), you will have to send
# the "input_audio_buffer.commit" and "response.create" event to the realtime api
# to signal the end of the user's turn and start the response.
# manual VAD is not part of this sample
# for more info: https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-turn_detection
settings = AzureRealtimeExecutionSettings(
instructions="""
You are a chat bot. Your name is Mosscap and
you have one goal: figure out what people need.
Your full name, should you need to know it, is
Splendid Speckled Mosscap. You communicate
effectively, but you tend to answer with long
flowery prose.
""",
# see https://platform.openai.com/docs/api-reference/realtime-sessions/create#realtime-sessions-create-voice for the full list of voices # noqa: E501
voice="alloy",
turn_detection=TurnDetection(type="server_vad", create_response=True, silence_duration_ms=800, threshold=0.8),
function_choice_behavior=FunctionChoiceBehavior.Auto(),
)
# and we can add a chat history to conversation to seed the conversation
chat_history = ChatHistory()
chat_history.add_user_message("Hi there, I'm based in Amsterdam.")
chat_history.add_assistant_message(
"I am Mosscap, a chat bot. I'm trying to figure out what people need, "
"I can tell you what the weather is or the time."
)

# the context manager calls the create_session method on the client and starts listening to the audio stream
async with (
audio_player,
audio_recorder,
realtime_client(
settings=settings,
chat_history=chat_history,
kernel=kernel,
create_response=True,
),
):
# the audio_output_callback can be added here or in the client constructor
# using this gives the smoothest experience
async for event in realtime_client.receive(audio_output_callback=audio_player.client_callback):
match event:
case RealtimeTextEvent():
if print_transcript:
print(event.text.text, end="")
case _:
# OpenAI Specific events
match event.service_type:
case ListenEvents.RESPONSE_CREATED:
if print_transcript:
print("\nMosscap (transcript): ", end="")
case ListenEvents.ERROR:
print(event.service_event)
logger.error(event.service_event)


if __name__ == "__main__":
print(
"Instructions: The model will start speaking immediately,"
"this can be turned off by removing `create_response=True` above."
"The model will detect when you stop and automatically generate a response. "
"Press ctrl + c to stop the program."
)
asyncio.run(main())
Loading
Loading