Skip to content

Commit 7d7f1e9

Browse files
authored
Update multimodal prompting API surface
* Add the ability to include multiple modalities in a single message. * Remove several "shorthand" API formats. These can cause confusion between multiple messages and a single message with multiple parts. In doing so, we align with the emerging de-facto industry standards. See discussion in #89 (comment). * Remove the restriction on audio and image prompts with the system role. (Keep the restriction with the assistant role.) Closes #89 by superseding it. #89 was an attempt at adding support for multiple modalities in a single message, while preserving all our existing shorthands, but was deemed too confusing.
1 parent f784983 commit 7d7f1e9

File tree

1 file changed

+87
-31
lines changed

1 file changed

+87
-31
lines changed

README.md

Lines changed: 87 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ async function promptWithCalculator(prompt) {
158158
const mathResult = evaluateMathExpression(expression);
159159

160160
// Add the result to the session so it's in context going forward.
161-
await session.prompt({ role: "assistant", content: mathResult });
161+
await session.prompt([{ role: "assistant", content: mathResult }]);
162162

163163
// Return it as if that's what the assistant said to the user.
164164
return mathResult;
@@ -198,23 +198,54 @@ const session = await LanguageModel.create({
198198
const referenceImage = await (await fetch("/reference-image.jpeg")).blob();
199199
const userDrawnImage = document.querySelector("canvas");
200200

201-
const response1 = await session.prompt([
202-
"Give a helpful artistic critique of how well the second image matches the first:",
203-
{ type: "image", content: referenceImage },
204-
{ type: "image", content: userDrawnImage }
205-
]);
201+
const response1 = await session.prompt([{
202+
role: "user",
203+
content: [
204+
{ type: "text", content: "Give a helpful artistic critique of how well the second image matches the first:" },
205+
{ type: "image", content: referenceImage },
206+
{ type: "image", content: userDrawnImage }
207+
]
208+
}]);
206209

207210
console.log(response1);
208211

209212
const audioBlob = await captureMicrophoneInput({ seconds: 10 });
210213

211-
const response2 = await session.prompt([
212-
"My response to your critique:",
213-
{ type: "audio", content: audioBlob }
214-
]);
214+
const response2 = await session.prompt([{
215+
role: "user",
216+
content: [
217+
{ type: "text", content: "My response to your critique:" },
218+
{ type: "audio", content: audioBlob }
219+
]
220+
}]);
215221
```
216222

217-
Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)
223+
Note how once we move to multimodal prompting, the prompt format becomes more explicit:
224+
225+
* We must always pass an array of messages, instead of a single string value.
226+
* Each message must have a `role` property: unlike with the string shorthand, `"user"` is no longer assumed.
227+
* The `content` property must be an array of content, if it contains any multimodal content.
228+
229+
This extra ceremony is necessary to make it clear that we are sending a single message that contains multimodal content, versus sending multiple messages, one per each piece of content. To avoid such confusion, the multimodal format has fewer defaults and shorthands than if you interact with the API using only text. (See some discussion in [issue #89](https://github.com/webmachinelearning/prompt-api/pull/89).)
230+
231+
To illustrate, the following extension of our above [multi-user example](#customizing-the-role-per-prompt) has a similar sequence of text + image + image values compared to our artistic critique example. However, it uses a multi-message structure instead of the artistic critique example's single-message structure, so the model will interpret it differently:
232+
233+
```js
234+
const response = await session.prompt([
235+
{
236+
role: "user",
237+
content: "Your compromise just made the discussion more heated. The two departments drew up posters to illustrate their strategies' advantages:"
238+
},
239+
{
240+
role: "user",
241+
content: [{ type: "image", content: brochureFromTheMarketingDepartment }]
242+
},
243+
{
244+
role: "user",
245+
content: [{ type: "image", content: brochureFromTheFinanceDepartment }]
246+
}
247+
]);
248+
```
218249

219250
Details:
220251

@@ -226,11 +257,11 @@ Details:
226257

227258
* For `HTMLVideoElement`, even a single frame might not yet be downloaded when the prompt API is called. In such cases, calling into the prompt API will force at least a single frame's worth of video to download. (The intent is to behave the same as `createImageBitmap(videoEl)`.)
228259

229-
* Text prompts can also be done via `{ type: "text", content: aString }`, instead of just `aString`. This can be useful for generic code.
230-
231260
* Attempting to supply an invalid combination, e.g. `{ type: "audio", content: anImageBitmap }`, `{ type: "image", content: anAudioBuffer }`, or `{ type: "text", content: anArrayBuffer }`, will reject with a `TypeError`.
232261

233-
* As described [above](#customizing-the-role-per-prompt), you can also supply a `role` value in these objects, so that the full form is `{ role, type, content }`. However, for now, using any role besides the default `"user"` role with an image or audio prompt will reject with a `"NotSupportedError"` `DOMException`. (As we explore multimodal outputs, this restriction might be lifted in the future.)
262+
* For now, using the `"assistant"` role with an image or audio prompt will reject with a `"NotSupportedError"` `DOMException`. (As we explore multimodal outputs, this restriction might be lifted in the future.)
263+
264+
Future extensions may include more ambitious multimodal inputs, such as video clips, or realtime audio or video. (Realtime might require a different API design, more based around events or streams instead of messages.)
234265

235266
### Structured output or JSON output
236267

@@ -547,16 +578,16 @@ interface LanguageModel : EventTarget {
547578
548579
// These will throw "NotSupportedError" DOMExceptions if role = "system"
549580
Promise<DOMString> prompt(
550-
LanguageModelPromptInput input,
581+
LanguageModelPrompt input,
551582
optional LanguageModelPromptOptions options = {}
552583
);
553584
ReadableStream promptStreaming(
554-
LanguageModelPromptInput input,
585+
LanguageModelPrompt input,
555586
optional LanguageModelPromptOptions options = {}
556587
);
557588
558589
Promise<double> measureInputUsage(
559-
LanguageModelPromptInput input,
590+
LanguageModelPrompt input,
560591
optional LanguageModelPromptOptions options = {}
561592
);
562593
readonly attribute double inputUsage;
@@ -597,7 +628,7 @@ dictionary LanguageModelCreateOptions : LanguageModelCreateCoreOptions {
597628
AICreateMonitorCallback monitor;
598629
599630
DOMString systemPrompt;
600-
sequence<LanguageModelPrompt> initialPrompts;
631+
LanguageModelInitialPrompts initialPrompts;
601632
};
602633
603634
dictionary LanguageModelPromptOptions {
@@ -610,37 +641,62 @@ dictionary LanguageModelCloneOptions {
610641
};
611642
612643
dictionary LanguageModelExpectedInput {
613-
required LanguageModelPromptType type;
644+
required LanguageModelMessageType type;
614645
sequence<DOMString> languages;
615646
};
616647
617648
// The argument to the prompt() method and others like it
618649
619-
typedef (LanguageModelPrompt or sequence<LanguageModelPrompt>) LanguageModelPromptInput;
620-
621-
// Prompt lines
622-
623650
typedef (
624-
DOMString // interpreted as { role: "user", type: "text", content: providedValue }
625-
or LanguageModelPromptDict // canonical form
651+
// Canonical format
652+
sequence<LanguageModelMessage>
653+
// Shorthand per the above comment
654+
or sequence<LanguageModelMessageShorthand>
655+
// Shorthand for [{ role: "user", content: [{ type: "text", content: providedValue }] }]
656+
or DOMString
626657
) LanguageModelPrompt;
627658
659+
// The initialPrompts value omits the single string shorthand
660+
typedef (
661+
// Canonical format
662+
sequence<LanguageModelMessage>
663+
// Shorthand per the above comment
664+
or sequence<LanguageModelMessageShorthand>
665+
) LanguageModelInitialPrompts;
666+
667+
668+
dictionary LanguageModelMessage {
669+
required LanguageModelMessageRole role;
670+
required sequence<LanguageModelMessageContent> content;
671+
};
672+
673+
// Shorthand for { role: providedValue.role, content: [{ type: "text", content: providedValue.content }] }
674+
dictionary LanguageModelMessageShorthand {
675+
required LanguageModelMessageRole role;
676+
required DOMString content;
677+
};
678+
679+
dictionary LanguageModelMessageContent {
680+
required LanguageModelMessageType type;
681+
required LanguageModelMessageContentValue content;
682+
};
683+
628684
dictionary LanguageModelPromptDict {
629-
LanguageModelPromptRole role = "user";
630-
LanguageModelPromptType type = "text";
631-
required LanguageModelPromptContent content;
685+
LanguageModelMessageRole role = "user";
686+
LanguageModelMessageType type = "text";
687+
required LanguageModelMessageContent content;
632688
};
633689
634-
enum LanguageModelPromptRole { "system", "user", "assistant" };
690+
enum LanguageModelMessageRole { "system", "user", "assistant" };
635691
636-
enum LanguageModelPromptType { "text", "image", "audio" };
692+
enum LanguageModelMessageType { "text", "image", "audio" };
637693
638694
typedef (
639695
ImageBitmapSource
640696
or AudioBuffer
641697
or BufferSource
642698
or DOMString
643-
) LanguageModelPromptContent;
699+
) LanguageModelMessageContentValue;
644700
```
645701

646702
### Instruction-tuned versus base models

0 commit comments

Comments
 (0)