Task Clarification: Multimodal Audio-text-to-text #1638
-
Hi all, I saw that this page encouraging contributions, however it seems unclear to me what this task is supposed to represent. There is already audio only tasks, which I think would be represented by this, as they often combine text annotation and audio data (e.g. audio classification and speech recognition). |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
cc @merveenoyan ? |
Beta Was this translation helpful? Give feedback.
-
audio-text-to-text models can take audio + text inputs, and generate text output. They are in a way the equivalent of VLMs (vision-language-models) in the audio domain. You can browse through the list of models that support this task to get a better sense of what they are capable of. Feel free to open a PR if you feel inspired! |
Beta Was this translation helpful? Give feedback.
-
very sorry I missed this issue! indeed those are models that take e.g. voice recordings and question about the recording like "summarize this recording" or "what topic is this recording talking about" and returns text answer. |
Beta Was this translation helpful? Give feedback.
audio-text-to-text models can take audio + text inputs, and generate text output. They are in a way the equivalent of VLMs (vision-language-models) in the audio domain. You can browse through the list of models that support this task to get a better sense of what they are capable of.
Feel free to open a PR if you feel inspired!