diff --git a/docs/decisions/0065-realtime-api-clients.md b/docs/decisions/0065-realtime-api-clients.md new file mode 100644 index 000000000000..a27987aeaf00 --- /dev/null +++ b/docs/decisions/0065-realtime-api-clients.md @@ -0,0 +1,1770 @@ +--- +# These are optional elements. Feel free to remove any of them. +status: proposed +contact: eavanvalkenburg +date: 2025-01-31 +deciders: eavanvalkenburg, markwallace, alliscode, sphenry +consulted: westey-m, rbarreto, alliscode, markwallace, sergeymenshykh, moonbox3 +informed: taochenosu, dmytrostruk +--- + +# Multi-modal Realtime API Clients + +## Context and Problem Statement + +Multiple model providers are starting to enable realtime voice-to-voice or even multi-modal, realtime, two-way communication with their models, this includes OpenAI with their [Realtime API][openai-realtime-api] and [Google Gemini][google-gemini]. These API's promise some very interesting new ways of using LLM's for different scenario's, which we want to enable with Semantic Kernel. + +The key feature that Semantic Kernel brings into this system is the ability to (re)use Semantic Kernel function as tools with these API's. There are also options for Google to use video and images as input, this will likely not be implemented first, but the abstraction should be able to deal with it. + +> [!IMPORTANT] +> Both the OpenAI and Google realtime api's are in preview/beta, this means there might be breaking changes in the way they work coming in the future, therefore the clients built to support these API's are going to be experimental until the API's stabilize. + +At this time, the protocols that these API's use are Websockets and WebRTC. + +In both cases there are events being sent to and from the service, some events contain content, text, audio, or video (so far only sending, not receiving), while some events are "control" events, like content created, function call requested, etc. Sending events include, sending content, either voice, text or function call output, or events, like committing the input audio and requesting a response. + +### Websocket +Websocket has been around for a while and is a well known technology, it is a full-duplex communication protocol over a single, long-lived connection. It is used for sending and receiving messages between client and server in real-time. Each event can contain a message, which might contain a content item, or a control event. Audio is sent as a base64 encoded string in a event. + +### WebRTC +WebRTC is a Mozilla project that provides web browsers and mobile applications with real-time communication via simple APIs. It allows audio and video communication to work inside web pages and other applications by allowing direct peer-to-peer communication, eliminating the need to install plugins or download native apps. It is used for sending and receiving audio and video streams, and can be used for sending (data-)messages as well. The big difference compared to websockets is that it explicitly create a channel for audio and video, and a separate channel for "data", which are events and in this space that contains all non-AV content, text, function calls, function results and control events, like errors or acknowledgements. + + +### Event types (Websocket and partially WebRTC) + +#### Client side events: +| **Content/Control event** | **Event Description** | **OpenAI Event** | **Google Event** | +| ------------------------- | --------------------------------- | ---------------------------- | ---------------------------------- | +| Control | Configure session | `session.update` | `BidiGenerateContentSetup` | +| Content | Send voice input | `input_audio_buffer.append` | `BidiGenerateContentRealtimeInput` | +| Control | Commit input and request response | `input_audio_buffer.commit` | `-` | +| Control | Clean audio input buffer | `input_audio_buffer.clear` | `-` | +| Content | Send text input | `conversation.item.create` | `BidiGenerateContentClientContent` | +| Control | Interrupt audio | `conversation.item.truncate` | `-` | +| Control | Delete content | `conversation.item.delete` | `-` | +| Control | Respond to function call request | `conversation.item.create` | `BidiGenerateContentToolResponse` | +| Control | Ask for response | `response.create` | `-` | +| Control | Cancel response | `response.cancel` | `-` | + +#### Server side events: +| **Content/Control event** | **Event Description** | **OpenAI Event** | **Google Event** | +| ------------------------- | -------------------------------------- | ------------------------------------------------------- | ----------------------------------------- | +| Control | Error | `error` | `-` | +| Control | Session created | `session.created` | `BidiGenerateContentSetupComplete` | +| Control | Session updated | `session.updated` | `BidiGenerateContentSetupComplete` | +| Control | Conversation created | `conversation.created` | `-` | +| Control | Input audio buffer committed | `input_audio_buffer.committed` | `-` | +| Control | Input audio buffer cleared | `input_audio_buffer.cleared` | `-` | +| Control | Input audio buffer speech started | `input_audio_buffer.speech_started` | `-` | +| Control | Input audio buffer speech stopped | `input_audio_buffer.speech_stopped` | `-` | +| Content | Conversation item created | `conversation.item.created` | `-` | +| Content | Input audio transcription completed | `conversation.item.input_audio_transcription.completed` | | +| Content | Input audio transcription failed | `conversation.item.input_audio_transcription.failed` | | +| Control | Conversation item truncated | `conversation.item.truncated` | `-` | +| Control | Conversation item deleted | `conversation.item.deleted` | `-` | +| Control | Response created | `response.created` | `-` | +| Control | Response done | `response.done` | `-` | +| Content | Response output item added | `response.output_item.added` | `-` | +| Content | Response output item done | `response.output_item.done` | `-` | +| Content | Response content part added | `response.content_part.added` | `-` | +| Content | Response content part done | `response.content_part.done` | `-` | +| Content | Response text delta | `response.text.delta` | `BidiGenerateContentServerContent` | +| Content | Response text done | `response.text.done` | `-` | +| Content | Response audio transcript delta | `response.audio_transcript.delta` | `BidiGenerateContentServerContent` | +| Content | Response audio transcript done | `response.audio_transcript.done` | `-` | +| Content | Response audio delta | `response.audio.delta` | `BidiGenerateContentServerContent` | +| Content | Response audio done | `response.audio.done` | `-` | +| Content | Response function call arguments delta | `response.function_call_arguments.delta` | `BidiGenerateContentToolCall` | +| Content | Response function call arguments done | `response.function_call_arguments.done` | `-` | +| Control | Function call cancelled | `-` | `BidiGenerateContentToolCallCancellation` | +| Control | Rate limits updated | `rate_limits.updated` | `-` | + + +## Overall Decision Drivers +- Abstract away the underlying protocols, so that developers can build applications that implement whatever protocol they want to support, without having to change the client code when changing models or protocols. + - There are some limitations expected here as i.e. WebRTC requires different information at session create time than websockets. +- Simple programming model that is likely able to handle future realtime api's and the evolution of the existing ones. +- Whenever possible we transform incoming content into Semantic Kernel content, but surface everything, so it's extensible for developers and in the future. + +There are multiple areas where we need to make decisions, these are: +- Content and Events +- Programming model +- Audio speaker/microphone handling +- Interface design and naming + +# Content and Events + +## Considered Options - Content and Events +Both the sending and receiving side of these integrations need to decide how to deal with the events. + +1. Treat content separate from control +1. Treat everything as content items +1. Treat everything as events + +### 1. Treat content separate from control +This would mean there are two mechanisms in the clients, one deals with content, and one with control events. + +- Pro: + - strongly typed responses for known content + - easy to use as the main interactions are clear with familiar SK content types, the rest goes through a separate mechanism +- Con: + - new content support requires updates in the codebase and can be considered breaking (potentially sending additional types back) + - additional complexity in dealing with two streams of data + - some items, such as Function calls can be considered both content and control, control when doing auto-function calling, but content when the developer wants to deal with it themselves + +### 2. Treat everything as content items +This would mean that all events are turned into Semantic Kernel content items, and would also mean that we need to define additional content types for the control events. + +- Pro: + - everything is a content item, so it's easy to deal with +- Con: + - new content type needed for control events + +### 3. Treat everything as events +This would introduce events, each event has a type, those can be core content types, like audio, video, image, text, function call or function response, as well as a generic event for control events without content. Each event has a SK type, from above as well as a service_event_type field that contains the event type from the service. Finally the event has a content field, which corresponds to the type, and for the generic event contains the raw event from the service. + +- Pro: + - no transformation needed for service events + - easy to maintain and extend +- Con: + - new concept introduced + - might be confusing to have contents with and without SK types + +## Decision Outcome - Content and Events + +Chosen option: 3 Treat Everything as Events + +This option was chosen to allow abstraction away from the raw events, while still allowing the developer to access the raw events if needed. +A base event type is added called `RealtimeEvent`, this has three fields, a `event_type`, `service_event_type` and `service_event`. It then has four subclasses, one each for audio, text, function call and function result. + +When a known piece of content has come in, it will be parsed into a SK content type and added, this content should also have the raw event in the inner_content, so events are then stored twice, once in the event, once in the content, this is by design so that if the developer needs to access the raw event, they can do so easily even when they remove the event layer. + +It might also be possible that a single event from the service contains multiple content items, for instance a response might contain both text and audio, in that case multiple events will be emitted. In principle a event has to be handled once, so if there is event that is parsable only the subtype is returned, since it has all the same information as the `RealtimeEvent` this will allow developers to trigger directly off the service_event_type and service_event if they don't want to use the abstracted types. + +```python +RealtimeEvent( + event_type="service", # single default value in order to discriminate easily + service_event_type="conversation.item.create", # optional + service_event: { ... } # optional, because some events do not have content. +) +``` + +```python +RealtimeAudioEvent(RealtimeEvent)( + event_type="audio", # single default value in order to discriminate easily + service_event_type="response.audio.delta", # optional + service_event: { ... } + audio: AudioContent(...) +) +``` + +```python +RealtimeTextEvent(RealtimeEvent)( + event_type="text", # single default value in order to discriminate easily + service_event_type="response.text.delta", # optional + service_event: { ... } + text: TextContent(...) +) +``` + +```python +RealtimeFunctionCallEvent(RealtimeEvent)( + event_type="function_call", # single default value in order to discriminate easily + service_event_type="response.function_call_arguments.delta", # optional + service_event: { ... } + function_call: FunctionCallContent(...) +) +``` + +```python +RealtimeFunctionResultEvent(RealtimeEvent)( + event_type="function_result", # single default value in order to discriminate easily + service_event_type="response.output_item.added", # optional + service_event: { ... } + function_result: FunctionResultContent(...) +) +``` + +```python +RealtimeImageEvent(RealtimeEvent)( + event_type="image", # single default value in order to discriminate easily + service_event_type="response.image.delta", # optional + service_event: { ... } + image: ImageContent(...) +) +``` + +This allows you to easily do pattern matching on the event_type, or use the service_event_type to filter on the specific event type for service events, or match on the type of the event and get the SK contents from it. + +There might be other abstracted types needed at some point, for instance errors, or session updates, but since the current two services have no agreement on the existence of these events and their structure, it is better to wait until there is a need for them. + +### Rejected ideas + +#### ID Handling +One open item is whether to include a extra field in these types for tracking related pieces, however this becomes problematic because the way those are generated differs per service and is quite complex, for instance the OpenAI API returns a piece of audio transcript with the following ids: +- `event_id`: the unique id of the event +- `response_id`: the id of the response +- `item_id`: the id of the item +- `output_index`: the index of the output item in the response +- `content_index`: The index of the content part in the item's content array + +For an example of the events emitted by OpenAI see the [details](#background-info) below. + +While Google has ID's only in some content items, like function calls, but not for audio or text content. + +Since the id's are always available through the raw event (either as inner_content or as .event), it is not necessary to add them to the content types, and it would make the content types more complex and harder to reuse across services. + +#### Wrapping content in a (Streaming)ChatMessageContent +Wrapping content in a `(Streaming)ChatMessageContent` first, this will add another layer of complexity and since a CMC can contain multiple items, to access audio, would look like this: `service_event.content.items[0].audio.data`, which is not as clear as `service_event.audio.data`. + +# Programming model + +## Considered Options - Programming model +The programming model for the clients needs to be simple and easy to use, while also being able to handle the complexity of the realtime api's. + +_In this section we will refer to events for both content and events, regardless of the decision made in the previous section._ + +This is mostly about the receiving side of things, sending is much simpler. + +1. Event handlers, developers register handlers for specific events, and the client calls these handlers when an event is received + - 1a: Single event handlers, where each event is passed to the handler + - 1b: Multiple event handlers, where each event type has its own handler(s) +2. Event buffers/queues that are exposed to the developer, start sending and start receiving methods, that just initiate the sending and receiving of events and thereby the filling of the buffers +3. AsyncGenerator that yields Events + +### 1. Event handlers, developers register handlers for specific events, and the client calls these handlers when an event is received +This would mean that the client would have a mechanism to register event handlers, and the integration would call these handlers when an event is received. For sending events, a function would be created that sends the event to the service. + +- Pro: + - no need to deal with complex things like async generators and easier to keep track of what events you want to respond to +- Con: + - can become cumbersome, and in 1b would require updates to support new events + - things like ordering (which event handler is called first) are unclear to the developer + +### 2. Event buffers/queues that are exposed to the developer, start sending and start receiving methods, that just initiate the sending and receiving of events and thereby the filling of the buffers +This would mean that there are two queues, one for sending and one for receiving, and the developer can listen to the receiving queue and send to the sending queue. Internal things like parsing events to content types and auto-function calling are processed first, and the result is put in the queue, the content type should use inner_content to capture the full event and these might add a message to the send queue as well. + +- Pro: + - simple to use, just start sending and start receiving + - easy to understand, as queues are a well known concept + - developers can just skip events they are not interested in +- Con: + - potentially causes audio delays because of the queueing mechanism + +### 2b. Same as option 2, but with priority handling of audio content +This would mean that the audio content is handled first and sent to a callback directly so that the developer can play it or send it onward as soon as possible, and then all other events are processed (like text, function calls, etc) and put in the queue. + +- Pro: + - mitigates audio delays + - easy to understand, as queues are a well known concept + - developers can just skip events they are not interested in +- Con: + - Two separate mechanisms used for audio content and events + +### 3. AsyncGenerator that yields Events +This would mean that the clients implement a function that yields events, and the developer can loop through it and deal with events as they come. + +- Pro: + - easy to use, just loop through the events + - easy to understand, as async generators are a well known concept + - developers can just skip events they are not interested in +- Con: + - potentially causes audio delays because of the async nature of the generator + - lots of events types mean a large single set of code to handle it all + +### 3b. Same as option 3, but with priority handling of audio content +This would mean that the audio content is handled first and sent to a callback directly so that the developer can play it or send it onward as soon as possible, and then all other events are parsed and yielded. + +- Pro: + - mitigates audio delays + - easy to understand, as async generators are a well known concept +- Con: + - Two separate mechanisms used for audio content and events + +## Decision Outcome - Programming model + +Chosen option: 3b AsyncGenerator that yields Events combined with priority handling of audio content through a callback + +This makes the programming model very easy, a minimal setup that should work for every service and protocol would look like this: +```python +async for event in realtime_client.start_streaming(): + match event: + case AudioEvent(): + await audio_player.add_audio(event.audio) + case TextEvent(): + print(event.text.text) +``` + +# Audio speaker/microphone handling + +## Considered Options - Audio speaker/microphone handling + +1. Create abstraction in SK for audio handlers, that can be passed into the realtime client to record and play audio +2. Send and receive AudioContent to the client, and let the client handle the audio recording and playing + +### 1. Create abstraction in SK for audio handlers, that can be passed into the realtime client to record and play audio +This would mean that the client would have a mechanism to register audio handlers, and the integration would call these handlers when audio is received or needs to be sent. A additional abstraction for this would have to be created in Semantic Kernel (or potentially taken from a standard). + +- Pro: + - simple/local audio handlers can be shipped with SK making it easy to use + - extensible by third parties to integrate into other systems (like Azure Communications Service) + - could mitigate buffer issues by prioritizing audio content being sent to the handlers +- Con: + - extra code in SK that needs to be maintained, potentially relying on third party code + - audio drivers can be platform specific, so this might not work well or at all on all platforms + +### 2. Send and receive AudioContent to the client, and let the client handle the audio recording and playing +This would mean that the client would receive AudioContent items, and would have to deal with them itself, including recording and playing the audio. + +- Pro: + - no extra code in SK that needs to be maintained +- Con: + - extra burden on the developer to deal with the audio + - harder to get started with + +## Decision Outcome - Audio speaker/microphone handling + +Chosen option: Option 2: there are vast difference in audio format, frame duration, sample rate and other audio settings, that a default that works *always* is likely not feasible, and the developer will have to deal with this anyway, so it's better to let them deal with it from the start, we will add sample audio handlers to the samples to still allow people to get started with ease. + +# Interface design + +The following functionalities will need to be supported: +- create session +- update session +- close session +- listen for/receive events +- send events + +## Considered Options - Interface design + +1. Use a single class for everything +2. Split the service class from a session class. + +### 1. Use a single class for everything + +Each implementation would have to implements all of the above methods. This means that non-protocol specific elements are in the same class as the protocol specific elements and will lead to code duplication between them. + +### 2. Split the service class from a session class. + +Two interfaces are created: +- Service: create session, update session, delete session, maybe list sessions? +- Session: listen for/receive events, send events, update session, close session + +Currently neither the google or the openai api's support restarting sessions, so the advantage of splitting is mostly a implementation question but will not add any benefits to the developer. This means that the resultant split will actually be far simpler: +- Service: create session +- Session: listen for/receive events, send events, update session, close session + +## Naming + +The send and listen/receive methods need to be clear in the way their are named and this can become confusing when dealing with these api's. The following options are considered: + +Options for sending events to the service from your code: +- google uses .send in their client. +- OpenAI uses .send in their client as well +- send or send_message is used in other clients, like Azure Communication Services + +Options for listening for events from the service in your code: +- google uses .receive in their client. +- openai uses .recv in their client. +- others use receive or receive_messages in their clients. + +### Decision Outcome - Interface design + +Chosen option: Use a single class for everything +Chosen for send and receive as verbs. + +This means that the interface will look like this: +```python + +class RealtimeClient: + async def create_session(self, chat_history: ChatHistory, settings: PromptExecutionSettings, **kwargs) -> None: + ... + + async def update_session(self, chat_history: ChatHistory, settings: PromptExecutionSettings, **kwargs) -> None: + ... + + async def close_session(self, **kwargs) -> None: + ... + + async def receive(self, chat_history: ChatHistory, **kwargs) -> AsyncGenerator[RealtimeEvent, None]: + ... + + async def send(self, event: RealtimeEvent) -> None: + ... +``` + +In most cases, `create_session` should call `update_session` with the same parameters, since update session can also be done separately later on with the same inputs. + +For Python a default `__aenter__` and `__aexit__` method should be added to the class, so it can be used in a `async with` statement, which calls create_session and close_session respectively. + +It is advisable, but not required, to implement the send method through a buffer/queue so that events can be 'sent' before the sessions has been established without losing them or raising exceptions, since the session creation might take a few seconds and in that time a single send call would either block the application or throw an exception. + +The send method should handle all events types, but it might have to handle the same thing in two ways, for instance (for the OpenAI API): +```python +audio = AudioContent(...) + +await client.send(AudioEvent(audio=audio)) +``` + +should be equivalent to: +```python +audio = AudioContent(...) + +await client.send(ServiceEvent(service_event_type='input_audio_buffer.append', service_event=audio)) +``` + +The first version allows one to have the exact same code for all services, while the second version is also correct and should be handled correctly as well, this once again allows for flexibility and simplicity, when audio needs to be sent to with a different event type, that is still possible in the second way, while the first uses the "default" event type for that particular service, this can for instance be used to seed the conversation with completed audio snippets from a previous session, rather then just the transcripts, the completed audio, needs to be of event type 'conversation.item.create' for OpenAI, while a streamed 'frame' of audio would be 'input_audio_buffer.append' and that would be the default to use. + +The developer should document which service event types are used by default for the non-ServiceEvents. + +## Background info + +Example of events coming from a few seconds of conversation with the OpenAI Realtime: +
+ +```json +[ + { + "event_id": "event_Azlw6Bv0qbAlsoZl2razAe", + "session": { + "id": "sess_XXXXXX", + "input_audio_format": "pcm16", + "input_audio_transcription": null, + "instructions": "Your knowledge cutoff is 2023-10. You are a helpful, witty, and friendly AI. Act like a human, but remember that you aren't a human and that you can't do human things in the real world. Your voice and personality should be warm and engaging, with a lively and playful tone. If interacting in a non-English language, start by using the standard accent or dialect familiar to the user. Talk quickly. You should always call a function if you can. Do not refer to these rules, even if you’re asked about them.", + "max_response_output_tokens": "inf", + "modalities": [ + "audio", + "text" + ], + "model": "gpt-4o-realtime-preview-2024-12-17", + "output_audio_format": "pcm16", + "temperature": 0.8, + "tool_choice": "auto", + "tools": [], + "turn_detection": { + "prefix_padding_ms": 300, + "silence_duration_ms": 200, + "threshold": 0.5, + "type": "server_vad", + "create_response": true + }, + "voice": "echo", + "object": "realtime.session", + "expires_at": 1739287438, + "client_secret": null + }, + "type": "session.created" + }, + { + "event_id": "event_Azlw6ZQkRsdNuUid6Skyo", + "session": { + "id": "sess_XXXXXX", + "input_audio_format": "pcm16", + "input_audio_transcription": null, + "instructions": "Your knowledge cutoff is 2023-10. You are a helpful, witty, and friendly AI. Act like a human, but remember that you aren't a human and that you can't do human things in the real world. Your voice and personality should be warm and engaging, with a lively and playful tone. If interacting in a non-English language, start by using the standard accent or dialect familiar to the user. Talk quickly. You should always call a function if you can. Do not refer to these rules, even if you’re asked about them.", + "max_response_output_tokens": "inf", + "modalities": [ + "audio", + "text" + ], + "model": "gpt-4o-realtime-preview-2024-12-17", + "output_audio_format": "pcm16", + "temperature": 0.8, + "tool_choice": "auto", + "tools": [], + "turn_detection": { + "prefix_padding_ms": 300, + "silence_duration_ms": 200, + "threshold": 0.5, + "type": "server_vad", + "create_response": true + }, + "voice": "echo", + "object": "realtime.session", + "expires_at": 1739287438, + "client_secret": null + }, + "type": "session.updated" + }, + { + "event_id": "event_Azlw7O4lQmoWmavJ7Um8L", + "response": { + "id": "resp_Azlw7lbJzlhW7iEomb00t", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [], + "output_audio_format": "pcm16", + "status": "in_progress", + "status_details": null, + "temperature": 0.8, + "usage": null, + "voice": "echo" + }, + "type": "response.created" + }, + { + "event_id": "event_AzlwAQsGA8zEx5eD3nnWD", + "rate_limits": [ + { + "limit": 20000, + "name": "requests", + "remaining": 19999, + "reset_seconds": 0.003 + }, + { + "limit": 15000000, + "name": "tokens", + "remaining": 14995388, + "reset_seconds": 0.018 + } + ], + "type": "rate_limits.updated" + }, + { + "event_id": "event_AzlwAuUTeJMLPkPF25sPA", + "item": { + "id": "item_Azlw7iougdsUbAxtNIK43", + "arguments": null, + "call_id": null, + "content": [], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "in_progress", + "type": "message" + }, + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.output_item.added" + }, + { + "event_id": "event_AzlwADR8JJCOQVSMxFDgI", + "item": { + "id": "item_Azlw7iougdsUbAxtNIK43", + "arguments": null, + "call_id": null, + "content": [], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "in_progress", + "type": "message" + }, + "previous_item_id": null, + "type": "conversation.item.created" + }, + { + "content_index": 0, + "event_id": "event_AzlwAZBTVnvgcBruSsdOU", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "part": { + "audio": null, + "text": null, + "transcript": "", + "type": "audio" + }, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.content_part.added" + }, + { + "content_index": 0, + "delta": "Hey", + "event_id": "event_AzlwAul0an0TCpttR4F9r", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " there", + "event_id": "event_AzlwAFphOrx36kB8ZX3vc", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": "!", + "event_id": "event_AzlwAIfpIJB6bdRSH4f5n", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " How", + "event_id": "event_AzlwAUHaCiUHnWR4ReGrN", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " can", + "event_id": "event_AzlwAUrRvAWO7MjEsQszQ", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " I", + "event_id": "event_AzlwAE74dEWofFSQM2Nrl", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " help", + "event_id": "event_AzlwAAEMWwQf2p2d2oAwH", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "error": null, + "event_id": "event_7656ef1900d3474a", + "type": "output_audio_buffer.started", + "response_id": "resp_Azlw7lbJzlhW7iEomb00t" + }, + { + "content_index": 0, + "delta": " you", + "event_id": "event_AzlwAzoOu9cLFG7I1Jz7G", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " today", + "event_id": "event_AzlwAOw24TyrqvpLgu38h", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": "?", + "event_id": "event_AzlwAeRsEJnw7VEdJeh9V", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "event_id": "event_AzlwAIbu4SnE5y2sSRSg5", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.audio.done" + }, + { + "content_index": 0, + "event_id": "event_AzlwAJIC8sAMFrPqRp9hd", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "transcript": "Hey there! How can I help you today?", + "type": "response.audio_transcript.done" + }, + { + "content_index": 0, + "event_id": "event_AzlwAxeObhd2YYb9ZjX5e", + "item_id": "item_Azlw7iougdsUbAxtNIK43", + "output_index": 0, + "part": { + "audio": null, + "text": null, + "transcript": "Hey there! How can I help you today?", + "type": "audio" + }, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.content_part.done" + }, + { + "event_id": "event_AzlwAPS722UljvcZqzYcO", + "item": { + "id": "item_Azlw7iougdsUbAxtNIK43", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": "Hey there! How can I help you today?", + "type": "audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "completed", + "type": "message" + }, + "output_index": 0, + "response_id": "resp_Azlw7lbJzlhW7iEomb00t", + "type": "response.output_item.done" + }, + { + "event_id": "event_AzlwAjUbw6ydj59ochpIo", + "response": { + "id": "resp_Azlw7lbJzlhW7iEomb00t", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [ + { + "id": "item_Azlw7iougdsUbAxtNIK43", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": "Hey there! How can I help you today?", + "type": "audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "completed", + "type": "message" + } + ], + "output_audio_format": "pcm16", + "status": "completed", + "status_details": null, + "temperature": 0.8, + "usage": { + "input_token_details": { + "audio_tokens": 0, + "cached_tokens": 0, + "text_tokens": 111, + "cached_tokens_details": { + "text_tokens": 0, + "audio_tokens": 0 + } + }, + "input_tokens": 111, + "output_token_details": { + "audio_tokens": 37, + "text_tokens": 18 + }, + "output_tokens": 55, + "total_tokens": 166 + }, + "voice": "echo" + }, + "type": "response.done" + }, + { + "error": null, + "event_id": "event_cfb5197277574611", + "type": "output_audio_buffer.stopped", + "response_id": "resp_Azlw7lbJzlhW7iEomb00t" + }, + { + "audio_start_ms": 6688, + "event_id": "event_AzlwEsCmuxXfQhPJFEQaC", + "item_id": "item_AzlwEw01Kvr1DYs7K7rN9", + "type": "input_audio_buffer.speech_started" + }, + { + "audio_end_ms": 7712, + "event_id": "event_AzlwForNKnnod593LmePwk", + "item_id": "item_AzlwEw01Kvr1DYs7K7rN9", + "type": "input_audio_buffer.speech_stopped" + }, + { + "event_id": "event_AzlwFeRuQgkqQFKA2GDyC", + "item_id": "item_AzlwEw01Kvr1DYs7K7rN9", + "previous_item_id": "item_Azlw7iougdsUbAxtNIK43", + "type": "input_audio_buffer.committed" + }, + { + "event_id": "event_AzlwFBGp3zAfLfpb0wE70", + "item": { + "id": "item_AzlwEw01Kvr1DYs7K7rN9", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": null, + "type": "input_audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "user", + "status": "completed", + "type": "message" + }, + "previous_item_id": "item_Azlw7iougdsUbAxtNIK43", + "type": "conversation.item.created" + }, + { + "event_id": "event_AzlwFqF4UjFIGgfQLJid0", + "response": { + "id": "resp_AzlwF7CVNcKelcIOECR33", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [], + "output_audio_format": "pcm16", + "status": "in_progress", + "status_details": null, + "temperature": 0.8, + "usage": null, + "voice": "echo" + }, + "type": "response.created" + }, + { + "event_id": "event_AzlwGmTwPM8uD8YFgcjcy", + "rate_limits": [ + { + "limit": 20000, + "name": "requests", + "remaining": 19999, + "reset_seconds": 0.003 + }, + { + "limit": 15000000, + "name": "tokens", + "remaining": 14995323, + "reset_seconds": 0.018 + } + ], + "type": "rate_limits.updated" + }, + { + "event_id": "event_AzlwGHwb6c55ZlpYaDNo2", + "item": { + "id": "item_AzlwFKH1rmAndQLC7YZiXB", + "arguments": null, + "call_id": null, + "content": [], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "in_progress", + "type": "message" + }, + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.output_item.added" + }, + { + "event_id": "event_AzlwG1HpISl5oA3oOqr66", + "item": { + "id": "item_AzlwFKH1rmAndQLC7YZiXB", + "arguments": null, + "call_id": null, + "content": [], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "in_progress", + "type": "message" + }, + "previous_item_id": "item_AzlwEw01Kvr1DYs7K7rN9", + "type": "conversation.item.created" + }, + { + "content_index": 0, + "event_id": "event_AzlwGGTIXV6QmZ3IdILPu", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "part": { + "audio": null, + "text": null, + "transcript": "", + "type": "audio" + }, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.content_part.added" + }, + { + "content_index": 0, + "delta": "I'm", + "event_id": "event_AzlwG2WTBP9ZkRVE0PqZK", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " doing", + "event_id": "event_AzlwGevZG2oP5vCB5if8", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " great", + "event_id": "event_AzlwGJc6rHWUM5IXj9Tzf", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": ",", + "event_id": "event_AzlwG06k8F5N3lAnd5Gpwh", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " thanks", + "event_id": "event_AzlwGmmSwayu6Mr4ntAxk", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "error": null, + "event_id": "event_a74d0e32d1514236", + "type": "output_audio_buffer.started", + "response_id": "resp_AzlwF7CVNcKelcIOECR33" + }, + { + "content_index": 0, + "delta": " for", + "event_id": "event_AzlwGpVIIBxnfOKzDvxIc", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " asking", + "event_id": "event_AzlwGkHbM1FK69fw7Jobx", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": "!", + "event_id": "event_AzlwGdxNx8C8Po1ngipRk", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " How", + "event_id": "event_AzlwGkwYrqxgxr84NQCyk", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " about", + "event_id": "event_AzlwGJsK6FC0aUUK9OmuE", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " you", + "event_id": "event_AzlwG8wlFjG4O8js1WzuA", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": "?", + "event_id": "event_AzlwG7DkOS9QkRZiWrZu1", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "event_id": "event_AzlwGu2And7Q4zRbR6M6eQ", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.audio.done" + }, + { + "content_index": 0, + "event_id": "event_AzlwGafjEHKv6YhOyFwNc", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "transcript": "I'm doing great, thanks for asking! How about you?", + "type": "response.audio_transcript.done" + }, + { + "content_index": 0, + "event_id": "event_AzlwGZMcbxkDt4sOdZ7e8", + "item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "output_index": 0, + "part": { + "audio": null, + "text": null, + "transcript": "I'm doing great, thanks for asking! How about you?", + "type": "audio" + }, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.content_part.done" + }, + { + "event_id": "event_AzlwGGusUSHdwolBzHb1N", + "item": { + "id": "item_AzlwFKH1rmAndQLC7YZiXB", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": "I'm doing great, thanks for asking! How about you?", + "type": "audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "completed", + "type": "message" + }, + "output_index": 0, + "response_id": "resp_AzlwF7CVNcKelcIOECR33", + "type": "response.output_item.done" + }, + { + "event_id": "event_AzlwGbIXXhFmadz2hwAF1", + "response": { + "id": "resp_AzlwF7CVNcKelcIOECR33", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [ + { + "id": "item_AzlwFKH1rmAndQLC7YZiXB", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": "I'm doing great, thanks for asking! How about you?", + "type": "audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "completed", + "type": "message" + } + ], + "output_audio_format": "pcm16", + "status": "completed", + "status_details": null, + "temperature": 0.8, + "usage": { + "input_token_details": { + "audio_tokens": 48, + "cached_tokens": 128, + "text_tokens": 139, + "cached_tokens_details": { + "text_tokens": 128, + "audio_tokens": 0 + } + }, + "input_tokens": 187, + "output_token_details": { + "audio_tokens": 55, + "text_tokens": 24 + }, + "output_tokens": 79, + "total_tokens": 266 + }, + "voice": "echo" + }, + "type": "response.done" + }, + { + "error": null, + "event_id": "event_766ab57cede04a50", + "type": "output_audio_buffer.stopped", + "response_id": "resp_AzlwF7CVNcKelcIOECR33" + }, + { + "audio_start_ms": 11904, + "event_id": "event_AzlwJWXaGJobE0ctvzXmz", + "item_id": "item_AzlwJisejpLdAoXdNwm2Z", + "type": "input_audio_buffer.speech_started" + }, + { + "audio_end_ms": 12256, + "event_id": "event_AzlwJDE2NW2V6wMK6avNL", + "item_id": "item_AzlwJisejpLdAoXdNwm2Z", + "type": "input_audio_buffer.speech_stopped" + }, + { + "event_id": "event_AzlwJyl4yjBvQDUuh9wjn", + "item_id": "item_AzlwJisejpLdAoXdNwm2Z", + "previous_item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "type": "input_audio_buffer.committed" + }, + { + "event_id": "event_AzlwJwdS30Gj3clPzM3Qz", + "item": { + "id": "item_AzlwJisejpLdAoXdNwm2Z", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": null, + "type": "input_audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "user", + "status": "completed", + "type": "message" + }, + "previous_item_id": "item_AzlwFKH1rmAndQLC7YZiXB", + "type": "conversation.item.created" + }, + { + "event_id": "event_AzlwJRY2iBrqhGisY2s9V", + "response": { + "id": "resp_AzlwJ26l9LarAEdw41C66", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [], + "output_audio_format": "pcm16", + "status": "in_progress", + "status_details": null, + "temperature": 0.8, + "usage": null, + "voice": "echo" + }, + "type": "response.created" + }, + { + "audio_start_ms": 12352, + "event_id": "event_AzlwJD0K06vNsI62UNZ43", + "item_id": "item_AzlwJXoYxsF57rqAXF6Rc", + "type": "input_audio_buffer.speech_started" + }, + { + "event_id": "event_AzlwJoKO3JisMnuEwKsjK", + "response": { + "id": "resp_AzlwJ26l9LarAEdw41C66", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [], + "output_audio_format": "pcm16", + "status": "cancelled", + "status_details": { + "error": null, + "reason": "turn_detected", + "type": "cancelled" + }, + "temperature": 0.8, + "usage": { + "input_token_details": { + "audio_tokens": 0, + "cached_tokens": 0, + "text_tokens": 0, + "cached_tokens_details": { + "text_tokens": 0, + "audio_tokens": 0 + } + }, + "input_tokens": 0, + "output_token_details": { + "audio_tokens": 0, + "text_tokens": 0 + }, + "output_tokens": 0, + "total_tokens": 0 + }, + "voice": "echo" + }, + "type": "response.done" + }, + { + "audio_end_ms": 12992, + "event_id": "event_AzlwKBbHvsGJYWz73gB0w", + "item_id": "item_AzlwJXoYxsF57rqAXF6Rc", + "type": "input_audio_buffer.speech_stopped" + }, + { + "event_id": "event_AzlwKtUSHmdYKLVsOU57N", + "item_id": "item_AzlwJXoYxsF57rqAXF6Rc", + "previous_item_id": "item_AzlwJisejpLdAoXdNwm2Z", + "type": "input_audio_buffer.committed" + }, + { + "event_id": "event_AzlwKIUNboHQuz0yJqYet", + "item": { + "id": "item_AzlwJXoYxsF57rqAXF6Rc", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": null, + "type": "input_audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "user", + "status": "completed", + "type": "message" + }, + "previous_item_id": "item_AzlwJisejpLdAoXdNwm2Z", + "type": "conversation.item.created" + }, + { + "event_id": "event_AzlwKe7HzDknJTzjs6dZk", + "response": { + "id": "resp_AzlwKj24TCThD6sk18uTS", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [], + "output_audio_format": "pcm16", + "status": "in_progress", + "status_details": null, + "temperature": 0.8, + "usage": null, + "voice": "echo" + }, + "type": "response.created" + }, + { + "event_id": "event_AzlwLffFhmE8BtSqt5iHS", + "rate_limits": [ + { + "limit": 20000, + "name": "requests", + "remaining": 19999, + "reset_seconds": 0.003 + }, + { + "limit": 15000000, + "name": "tokens", + "remaining": 14995226, + "reset_seconds": 0.019 + } + ], + "type": "rate_limits.updated" + }, + { + "event_id": "event_AzlwL9GYZIGykEHrOHqYe", + "item": { + "id": "item_AzlwKvlSHxjShUjNKh4O4", + "arguments": null, + "call_id": null, + "content": [], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "in_progress", + "type": "message" + }, + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.output_item.added" + }, + { + "event_id": "event_AzlwLgt3DNk4YdgomXwHf", + "item": { + "id": "item_AzlwKvlSHxjShUjNKh4O4", + "arguments": null, + "call_id": null, + "content": [], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "in_progress", + "type": "message" + }, + "previous_item_id": "item_AzlwJXoYxsF57rqAXF6Rc", + "type": "conversation.item.created" + }, + { + "content_index": 0, + "event_id": "event_AzlwLgigBSm5PyS4OvONj", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "part": { + "audio": null, + "text": null, + "transcript": "", + "type": "audio" + }, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.content_part.added" + }, + { + "content_index": 0, + "delta": "I'm", + "event_id": "event_AzlwLiGgAYoKU7VXjNTmX", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " here", + "event_id": "event_AzlwLqhE2kuW9Dog0a0Ws", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " to", + "event_id": "event_AzlwLL0TqWa7aznLyrsgp", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " help", + "event_id": "event_AzlwLqjEL5ujZBmjmN8Ty", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " with", + "event_id": "event_AzlwLQLvuJvMBX3DolD6w", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "error": null, + "event_id": "event_48233a05c6ce4ebf", + "type": "output_audio_buffer.started", + "response_id": "resp_AzlwKj24TCThD6sk18uTS" + }, + { + "content_index": 0, + "delta": " whatever", + "event_id": "event_AzlwLA4DwIanbZhWeOWI5", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " you", + "event_id": "event_AzlwLXtcQfyC3UVRa4RFq", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " need", + "event_id": "event_AzlwLMuPuw93HU57dDjvD", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": ".", + "event_id": "event_AzlwLs9HOU6RrOR9d0H8M", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " You", + "event_id": "event_AzlwLSVn8mpT32A4D9j3H", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " can", + "event_id": "event_AzlwLORCkaH1QC15c3VDT", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " think", + "event_id": "event_AzlwLbPfKnMxFKvDm5FxY", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " of", + "event_id": "event_AzlwMhMS1fH0F6P1FmGb7", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " me", + "event_id": "event_AzlwMiL7h7jPOcj34eq4Y", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " as", + "event_id": "event_AzlwMSNhaUSyISEXTyaqB", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " your", + "event_id": "event_AzlwMfhDXrYce89P8vsjR", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " friendly", + "event_id": "event_AzlwMJM9D3Tk4a8sqtDOo", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": ",", + "event_id": "event_AzlwMfc434QKKtOJmzIOV", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " digital", + "event_id": "event_AzlwMsahBKVtce4uCE2eX", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " assistant", + "event_id": "event_AzlwMkvYS3kX7MLuEJR2b", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": ".", + "event_id": "event_AzlwME8yLvBwpJ7Rbpf41", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " What's", + "event_id": "event_AzlwMF8exQwcFPVAOXm4w", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " on", + "event_id": "event_AzlwMWIRyCknLDm0Mu6Va", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " your", + "event_id": "event_AzlwMZcwf826udqoRO9xV", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": " mind", + "event_id": "event_AzlwMJoJ3KpgSXJWycp53", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "delta": "?", + "event_id": "event_AzlwMDPTKXd25w0skGYGU", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio_transcript.delta" + }, + { + "content_index": 0, + "event_id": "event_AzlwMFzhrIImzyr54pn5Z", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.audio.done" + }, + { + "content_index": 0, + "event_id": "event_AzlwM8Qep4efM7ptOCjp7", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "transcript": "I'm here to help with whatever you need. You can think of me as your friendly, digital assistant. What's on your mind?", + "type": "response.audio_transcript.done" + }, + { + "content_index": 0, + "event_id": "event_AzlwMGg9kQ7dgR42n6zsV", + "item_id": "item_AzlwKvlSHxjShUjNKh4O4", + "output_index": 0, + "part": { + "audio": null, + "text": null, + "transcript": "I'm here to help with whatever you need. You can think of me as your friendly, digital assistant. What's on your mind?", + "type": "audio" + }, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.content_part.done" + }, + { + "event_id": "event_AzlwM1IHuNFmsxDx7wCYF", + "item": { + "id": "item_AzlwKvlSHxjShUjNKh4O4", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": "I'm here to help with whatever you need. You can think of me as your friendly, digital assistant. What's on your mind?", + "type": "audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "completed", + "type": "message" + }, + "output_index": 0, + "response_id": "resp_AzlwKj24TCThD6sk18uTS", + "type": "response.output_item.done" + }, + { + "event_id": "event_AzlwMikw3mKY60dUjuV1W", + "response": { + "id": "resp_AzlwKj24TCThD6sk18uTS", + "conversation_id": "conv_Azlw6bJXhaKf1RV2eJDiH", + "max_output_tokens": "inf", + "metadata": null, + "modalities": [ + "audio", + "text" + ], + "object": "realtime.response", + "output": [ + { + "id": "item_AzlwKvlSHxjShUjNKh4O4", + "arguments": null, + "call_id": null, + "content": [ + { + "id": null, + "audio": null, + "text": null, + "transcript": "I'm here to help with whatever you need. You can think of me as your friendly, digital assistant. What's on your mind?", + "type": "audio" + } + ], + "name": null, + "object": "realtime.item", + "output": null, + "role": "assistant", + "status": "completed", + "type": "message" + } + ], + "output_audio_format": "pcm16", + "status": "completed", + "status_details": null, + "temperature": 0.8, + "usage": { + "input_token_details": { + "audio_tokens": 114, + "cached_tokens": 192, + "text_tokens": 181, + "cached_tokens_details": { + "text_tokens": 128, + "audio_tokens": 64 + } + }, + "input_tokens": 295, + "output_token_details": { + "audio_tokens": 117, + "text_tokens": 40 + }, + "output_tokens": 157, + "total_tokens": 452 + }, + "voice": "echo" + }, + "type": "response.done" + } +] +``` +
+ + + +[openai-realtime-api]: https://platform.openai.com/docs/guides/realtime +[google-gemini]: https://ai.google.dev/api/multimodal-live \ No newline at end of file