Add multimodal API such as using image as part of prompt #40

yaoyaoumbc · 2024-09-06T19:27:27Z

Gemini Nano XS claims itself to be multimodal but I did not find any corresponding API in Chrome on desktop. Could you add such APIs? Thank you.

basvandorst · 2024-12-10T15:59:51Z

+1

The current languageModel context/prefix is too specific and doesn't account for future AI capabilities like image/voice/video interactions. I'd suggest to have a look at OpenAI (and others) an see how they don't strictly tight to a "language model"

OpenAI

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
          },
        ],
      },
    ],
  });
}

Claud

const message = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  messages: [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                }
            ],
        }
      ]
});

I think this will also be (partially) the solution for #8, let's not reinvent the wheel to much.

domenic · 2025-01-20T06:17:03Z

My initial design here was to follow the usual HTTP APIs (e.g. OpenAI, Anthropic, Gemini, etc.). That ended up looking something like

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  { type: "image", value: image }, // A user image message
  { role: "assistant", content: { type: "image", value: image } }, // An assistant image message
]);

Other cosmetic variations might be using MIME types instead of strings like "text" or "image" (but this seems kind of dumb, why make the web developer care about the difference between image/png and image/jpeg). Or using data instead of value.

I didn't like the shape that we saw in OpenAI/Anthropic of having different fields per type, i.e. { type: "image", image: image } or { type: "audio", audio: audio }. That seemed unnecessary.

...But then I realized this was all unnecessary. Strings are different than images and audio! We can just use the type of the input.

So my current plan is the following:

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  image, // A user image message
  { role: "assistant", content: image }, // An assistant image message
]);

domenic · 2025-01-20T06:20:56Z

Wait, no, that doesn't work, because then we can't distinguish an image Blob from an audio Blob. OK, back to the initial design.

Closes #40. Somewhat helps with #70.

domenic added the enhancement New feature or request label Oct 9, 2024

AdamSobieski mentioned this issue Jan 9, 2025

[FR] Document Object Model Integration #70

Open

domenic added a commit that referenced this issue Jan 20, 2025

Add image and audio prompting API

ff96dc3

Closes #40. Somewhat helps with #70.

domenic added a commit that referenced this issue Jan 20, 2025

Add image and audio prompting API

2a9f391

Closes #40. Somewhat helps with #70.

domenic linked a pull request Jan 20, 2025 that will close this issue

Add image and audio prompting API #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multimodal API such as using image as part of prompt #40

Add multimodal API such as using image as part of prompt #40

yaoyaoumbc commented Sep 6, 2024

basvandorst commented Dec 10, 2024

domenic commented Jan 20, 2025 •

edited

Loading

domenic commented Jan 20, 2025

Add multimodal API such as using image as part of prompt #40

Add multimodal API such as using image as part of prompt #40

Comments

yaoyaoumbc commented Sep 6, 2024

basvandorst commented Dec 10, 2024

domenic commented Jan 20, 2025 • edited Loading

domenic commented Jan 20, 2025

domenic commented Jan 20, 2025 •

edited

Loading