You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add the ability to include multiple modalities in a single message.
* Remove several "shorthand" API formats. These can cause confusion between multiple messages and a single message with multiple parts. In doing so, we align with the emerging de-facto industry standards. See discussion in #89 (comment).
* Remove the restriction on audio and image prompts with the system role. (Keep the restriction with the assistant role.)
Closes#89 by superseding it. #89 was an attempt at adding support for multiple modalities in a single message, while preserving all our existing shorthands, but was deemed too confusing.
{ type:"text", content:"My response to your critique:" },
218
+
{ type:"audio", content: audioBlob }
219
+
]
220
+
}]);
215
221
```
216
222
217
-
Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)
223
+
Note how once we move to multimodal prompting, the prompt format becomes more explicit:
224
+
225
+
* We must always pass an array of messages, instead of a single string value.
226
+
* Each message must have a `role` property: unlike with the string shorthand, `"user"` is no longer assumed.
227
+
* The `content` property must be an array of content, if it contains any multimodal content.
228
+
229
+
This extra ceremony is necessary to make it clear that we are sending a single message that contains multimodal content, versus sending multiple messages, one per each piece of content. To avoid such confusion, the multimodal format has fewer defaults and shorthands than if you interact with the API using only text. (See some discussion in [issue #89](https://github.com/webmachinelearning/prompt-api/pull/89).)
230
+
231
+
To illustrate, the following extension of our above [multi-user example](#customizing-the-role-per-prompt) has a similar sequence of text + image + image values compared to our artistic critique example. However, it uses a multi-message structure instead of the artistic critique example's single-message structure, so the model will interpret it differently:
232
+
233
+
```js
234
+
constresponse=awaitsession.prompt([
235
+
{
236
+
role:"user",
237
+
content:"Your compromise just made the discussion more heated. The two departments drew up posters to illustrate their strategies' advantages:"
* For `HTMLVideoElement`, even a single frame might not yet be downloaded when the prompt API is called. In such cases, calling into the prompt API will force at least a single frame's worth of video to download. (The intent is to behave the same as `createImageBitmap(videoEl)`.)
228
259
229
-
* Text prompts can also be done via `{ type: "text", content: aString }`, instead of just `aString`. This can be useful for generic code.
230
-
231
260
* Attempting to supply an invalid combination, e.g. `{ type: "audio", content: anImageBitmap }`, `{ type: "image", content: anAudioBuffer }`, or `{ type: "text", content: anArrayBuffer }`, will reject with a `TypeError`.
232
261
233
-
* As described [above](#customizing-the-role-per-prompt), you can also supply a `role` value in these objects, so that the full form is `{ role, type, content }`. However, for now, using any role besides the default `"user"` role with an image or audio prompt will reject with a `"NotSupportedError"``DOMException`. (As we explore multimodal outputs, this restriction might be lifted in the future.)
262
+
* For now, using the `"assistant"` role with an image or audio prompt will reject with a `"NotSupportedError"``DOMException`. (As we explore multimodal outputs, this restriction might be lifted in the future.)
263
+
264
+
Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)
0 commit comments