Introduction
The "gpt-4o-realtime-preview" has been released. In addition to text and audio input/output, it also allows custom function calling via function calling.
As of October 2, 2024, there are issues such as 403 errors, and it seems the API is not usable. This article will be updated once it becomes available.
OpenAI has provided a JavaScript code sample on its website. Additionally, Azure has also published a Python code sample on GitHub.
In this article, we will analyze Azure's sample code, "low_level_sample.py," to understand how it works.
Libraries
The required libraries are as follows:
python-dotenv
soundfile
numpy
scipy
Code Explanation
main
Function
In the main
function, it first loads the dotenv file to retrieve the API key and endpoint:
load_dotenv()
Next, it checks the arguments. This file is executed using the command python low_level_sample.py <audio file> <azure|openai>
. You can choose either OpenAI or Azure OpenAI as the API:
if len(sys.argv) < 2:
print("Usage: python sample.py <audio file> <azure|openai>")
print("If second argument is not provided, it will default to azure")
sys.exit(1)
Then, it uses asyncio
to run the process asynchronously:
file_path = sys.argv[1]
if len(sys.argv) == 3 and sys.argv[2] == "openai":
asyncio.run(with_openai(file_path))
else:
asyncio.run(with_azure_openai(file_path))
Next, let's look at the with_openai
function.
with_openai
Function
The API key and model name are retrieved from environment variables.
Then, an instance of RTLowLevelClient
is created:
async with RTLowLevelClient(key_credential=AzureKeyCredential(key), model=model) as client:
Next, a message is added:
await client.send(
SessionUpdateMessage(session=SessionUpdateParams(turn_detection=ServerVAD(type="server_vad")))
)
Here, we specify "server_vad" for Voice Activity Detection (VAD). Although "server_vad" is the only option currently available, you can set options like detection threshold and allowable silence duration:
class ServerVAD(BaseModel):
type: Literal["server_vad"] = "server_vad"
threshold: Optional[Annotated[float, Field(strict=True, ge=0.0, le=1.0)]] = None
prefix_padding_ms: Optional[int] = None
silence_duration_ms: Optional[int] = None
The message is then converted to JSON before being sent:
async def send(self, message: UserMessageType):
message_json = message.model_dump_json()
await self.ws.send_str(message_json)
The model_dump_json
method is defined in Pydantic.BaseModel
and converts the model into a JSON string. The resulting JSON looks like this:
{
"event_id": null,
"type": "session.update",
"session": {
"model": null,
"modalities": null,
"voice": null,
"instructions": null,
"input_audio_format": null,
"output_audio_format": null,
"input_audio_transcription": null,
"turn_detection": {
"type": "server_vad",
"threshold": null,
"prefix_padding_ms": null,
"silence_duration_ms": null
},
"tools": null,
"tool_choice": null,
"temperature": null,
"max_response_output_tokens": null
}
}
This is sent to session.update to configure the session. You can specify system instructions in the "instructions" field. For example, to set a system prompt, you can modify the code like this:
await client.send(
SessionUpdateMessage(
session=SessionUpdateParams(
instructions="<your system instructions>",
turn_detection=ServerVAD(type="server_vad")
)
)
)
Next, asyncio.gather
is used to run both send_audio
and receive_messages
functions simultaneously:
await asyncio.gather(send_audio(client, audio_file_path), receive_messages(client))
In the send_audio
function, the audio file is read using soundfile
, base64 encoded, and then sent as InputAudioBufferAppendMessage
:
...
audio_data, original_sample_rate = sf.read(audio_file_path, dtype="int16", **extra_params)
...
audio_bytes = audio_data.tobytes()
for i in range(0, len(audio_bytes), bytes_per_chunk):
chunk = audio_bytes[i : i + bytes_per_chunk]
base64_audio = base64.b64encode(chunk).decode("utf-8")
await client.send(InputAudioBufferAppendMessage(audio=base64_audio))
The audio data is sent to input_audio_buffer.append.
In the receive_messages
function, responses based on the processed audio data from the send_audio
function are received.
The session is established at "/openai/realtime", and messages are received asynchronously:
message = await client.recv()
The case structure handles different message types. The message types are explained here. Below are some of the important ones:
input_audio_buffer.committed
When the server-side Voice Activity Detection (VAD) detects that the user's speech has ended, the input_audio_buffer.committed
message is sent.
input_audio_buffer.speech_started
When the AI response begins, input_audio_buffer.speech_started
is sent. You can retrieve the start time using message.audio_start_ms
.
input_audio_buffer.speech_stopped
When the AI response finishes, input_audio_buffer.speech_stopped
is sent. You can retrieve the end time using message.audio_end_ms
.
By monitoring speech events, itโs possible to trigger spontaneous responses. For instance, using response.create, the AI can generate a response without waiting for further user input when a period of silence is detected.
conversation.item.created
This can be used to manage conversation history.
response.created
When a response is created, response.created
is sent. For streaming processing, you can use response.text.delta and response.audio.delta.
The low_level_sample.py
script does not handle audio output. To output audio, you need to retrieve the audio data and use tools like pyaudio
for playback. Here's how to handle the audio data:
audio_bytes = base64.b64decode(chunk.data)
audio_data.extend(audio_bytes)
if audio_data is not null:
print(prefix, f"Audio received with length: {len(audio_data)}")
with open(os.path.join(out_dir, f"{item.id}.wav"), "wb") as out:
audio_array = np.frombuffer(audio_data, dtype=np.int16)
I hope this article helps with your development.
If you found it useful, I would appreciate a positive rating.
Thank you!
Top comments (0)