diff --git a/images/MMAUdio.png b/images/MMAUdio.png new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/images/MMAUdio.png @@ -0,0 +1 @@ + diff --git a/images/MMAudio2.png b/images/MMAudio2.png new file mode 100644 index 0000000..8b13789 --- /dev/null +++ b/images/MMAudio2.png @@ -0,0 +1 @@ + diff --git a/summaries/MMAudio.md b/summaries/MMAudio.md new file mode 100644 index 0000000..715d3a5 --- /dev/null +++ b/summaries/MMAudio.md @@ -0,0 +1,43 @@ +# MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis + +Ho Kei Cheng, Masato Ishii , Akio Hayakawa,Takashi Shibuya,Alexander Schwing **NeurIPS** **2024** + +## Summary + +The paper is about generation of high quality semantically aligned audio samples . The core idea is simple but powerful - instead of only learning from limited video-audio pairs (which are hard to collect at large level), the model jointly train on both video-audio and text-audio data. Additionally , synchrony b/w audio video is improved by conditional synchronization module that aligns video conditions with audio latents at the frame level. Their model also performed very well on simple text to audio genration showing joint training does not hinder single modality performance. + +## Contributions + +- Multimodal Joint Training Paradigm. +- Conditional Synchronization Module. +- Aligned RoPE Positional Embeddings. +- Competitive Text-to-Audio Generation. + +## Method +MMAudio generates audio using flow based model using multimodal conditioning . The approach combines three key components: a multimodal transformer architecture, a conditional synchronization module, and a joint training strategy. +1. Conditional Flow Matching Framework. +2. Multimodal Transformer Architecture. + Key components : + + a. RoPE based positional encoding. + b. Concatenation of Q,K,V for joint attention then splitting the modalities . + +3. Conditional Synchronization module : Pipeline looks like - Extracting high frame rate visual features then project and upsample to match audio rate thus applying token level conditioing. + +Screenshot 2025-09-30 at 4 02 48 AM + +## Results + +Screenshot 2025-09-30 at 4 10 50 AM + +MMAudio achieves state-of-the-art performance across all metrics among public models, with the smallest model (S-16kHz, 157M params) outperforming significantly larger baselines. + + +## Two-Cents + +The model had exceptional synchronisation will SOTA on video to audio . But what i felt is that "joint training" is oversold . we can see in ablation study that this what not revolutionary under the hood . +Other limitation they talked is about generating the intelligible sounds as human sounds are more complex and their model aimed at Foley may fail to accomodate . + +## Resources + +