Task Clarification: Multimodal Audio-text-to-text #1638

camscottie · 2025-07-17T14:41:42Z

camscottie
Jul 17, 2025

Hi all, I saw that this page encouraging contributions, however it seems unclear to me what this task is supposed to represent. There is already audio only tasks, which I think would be represented by this, as they often combine text annotation and audio data (e.g. audio classification and speech recognition).
Could someone clarify what this task is supposed to represent? I'd be happy to contribute to the page so that others can utilise this task more effectively.

Answered by pcuenca

Jul 28, 2025

audio-text-to-text models can take audio + text inputs, and generate text output. They are in a way the equivalent of VLMs (vision-language-models) in the audio domain. You can browse through the list of models that support this task to get a better sense of what they are capable of.

Feel free to open a PR if you feel inspired!

View full answer

Wauplin · 2025-07-18T08:52:02Z

Wauplin
Jul 18, 2025
Collaborator

cc @merveenoyan ?

0 replies

pcuenca · 2025-07-28T16:07:00Z

pcuenca
Jul 28, 2025
Collaborator

audio-text-to-text models can take audio + text inputs, and generate text output. They are in a way the equivalent of VLMs (vision-language-models) in the audio domain. You can browse through the list of models that support this task to get a better sense of what they are capable of.

Feel free to open a PR if you feel inspired!

0 replies

merveenoyan · 2025-07-29T09:56:24Z

merveenoyan
Jul 29, 2025
Collaborator

very sorry I missed this issue! indeed those are models that take e.g. voice recordings and question about the recording like "summarize this recording" or "what topic is this recording talking about" and returns text answer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Task Clarification: Multimodal Audio-text-to-text #1638

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Task Clarification: Multimodal Audio-text-to-text #1638

Uh oh!

camscottie Jul 17, 2025

Replies: 3 comments

Uh oh!

Wauplin Jul 18, 2025 Collaborator

Uh oh!

pcuenca Jul 28, 2025 Collaborator

Uh oh!

merveenoyan Jul 29, 2025 Collaborator

camscottie
Jul 17, 2025

Wauplin
Jul 18, 2025
Collaborator

pcuenca
Jul 28, 2025
Collaborator

merveenoyan
Jul 29, 2025
Collaborator