-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multimodal API such as using image as part of prompt #40
Comments
+1 The current OpenAI async function main() {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "user",
content: [
{ type: "text", text: "What’s in this image?" },
{
type: "image_url",
image_url: {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
},
],
});
} Claud const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": image_media_type,
"data": image_data,
},
}
],
}
]
}); I think this will also be (partially) the solution for #8, let's not reinvent the wheel to much. |
My initial design here was to follow the usual HTTP APIs (e.g. OpenAI, Anthropic, Gemini, etc.). That ended up looking something like await session.prompt([
"This is a user text message",
{ role: "system", content: "This is a system text message" },
{ type: "image", value: image }, // A user image message
{ role: "assistant", content: { type: "image", value: image } }, // An assistant image message
]); Other cosmetic variations might be using MIME types instead of strings like I didn't like the shape that we saw in OpenAI/Anthropic of having different fields per type, i.e. ...But then I realized this was all unnecessary. Strings are different than images and audio! We can just use the type of the input. So my current plan is the following: await session.prompt([
"This is a user text message",
{ role: "system", content: "This is a system text message" },
image, // A user image message
{ role: "assistant", content: image }, // An assistant image message
]); |
Wait, no, that doesn't work, because then we can't distinguish an image |
Gemini Nano XS claims itself to be multimodal but I did not find any corresponding API in Chrome on desktop. Could you add such APIs? Thank you.
The text was updated successfully, but these errors were encountered: