Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multimodal API such as using image as part of prompt #40

Open
yaoyaoumbc opened this issue Sep 6, 2024 · 3 comments · May be fixed by #71
Open

Add multimodal API such as using image as part of prompt #40

yaoyaoumbc opened this issue Sep 6, 2024 · 3 comments · May be fixed by #71
Labels
enhancement New feature or request

Comments

@yaoyaoumbc
Copy link

Gemini Nano XS claims itself to be multimodal but I did not find any corresponding API in Chrome on desktop. Could you add such APIs? Thank you.

@domenic domenic added the enhancement New feature or request label Oct 9, 2024
@basvandorst
Copy link

+1

The current languageModel context/prefix is too specific and doesn't account for future AI capabilities like image/voice/video interactions. I'd suggest to have a look at OpenAI (and others) an see how they don't strictly tight to a "language model"

OpenAI

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "What’s in this image?" },
          {
            type: "image_url",
            image_url: {
              "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
            },
          },
        ],
      },
    ],
  });
}

Claud

const message = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  messages: [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": image_media_type,
                        "data": image_data,
                    },
                }
            ],
        }
      ]
});

I think this will also be (partially) the solution for #8, let's not reinvent the wheel to much.

@domenic
Copy link
Collaborator

domenic commented Jan 20, 2025

My initial design here was to follow the usual HTTP APIs (e.g. OpenAI, Anthropic, Gemini, etc.). That ended up looking something like

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  { type: "image", value: image }, // A user image message
  { role: "assistant", content: { type: "image", value: image } }, // An assistant image message
]);

Other cosmetic variations might be using MIME types instead of strings like "text" or "image" (but this seems kind of dumb, why make the web developer care about the difference between image/png and image/jpeg). Or using data instead of value.

I didn't like the shape that we saw in OpenAI/Anthropic of having different fields per type, i.e. { type: "image", image: image } or { type: "audio", audio: audio }. That seemed unnecessary.

...But then I realized this was all unnecessary. Strings are different than images and audio! We can just use the type of the input.

So my current plan is the following:

await session.prompt([
  "This is a user text message",
  { role: "system", content: "This is a system text message" },
  image, // A user image message
  { role: "assistant", content: image }, // An assistant image message
]);

@domenic
Copy link
Collaborator

domenic commented Jan 20, 2025

Wait, no, that doesn't work, because then we can't distinguish an image Blob from an audio Blob. OK, back to the initial design.

domenic added a commit that referenced this issue Jan 20, 2025
Closes #40. Somewhat helps with #70.
domenic added a commit that referenced this issue Jan 20, 2025
Closes #40. Somewhat helps with #70.
@domenic domenic linked a pull request Jan 20, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants