MAVL: Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

This repository provides code and resources for working with the Multilingual Audio-Video Lyrics (MAVL) dataset, available on Hugging Face (https://huggingface.co/datasets/Noename/MAVL). The MAVL dataset is described in detail in the paper "MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation".

Dataset Description

The MAVL dataset contains a collection of YouTube URLs corresponding to songs from animated musicals. It includes meticulously aligned lyrics (English, Spanish, French, Korean, and Japanese) with corresponding timestamps, song titles, and artist information. MAVL is designed as the first multilingual, multimodal benchmark for singable lyrics translation.

Crucially, this dataset does not contain the actual audio, video, or full lyric text files directly due to copyright considerations. Instead, it provides structured metadata and URLs, along with a compact representation of lyrics (e.g., first letters of words, first/last words of lines) and their syllable counts, and precise timestamps. This allows for the reconstruction of original lyrics and the download of associated multimedia content via provided scripts.

Dataset Structure

The dataset is provided in a JSON format. Each entry represents a song and contains:

Song Title: The name of the animated song.
lyrics: A nested structure containing lyric lines for each language (US_og for original English, ES for Spanish, FR for French, KR for Korean, JP for Japanese). Each line includes:
- text: A compact representation of the original lyric line (e.g., [["Tsimp", "There's", "pants"]] for "Squirrels in my pants"). This is designed to allow for the reconstruction of the full lyric text using external resources.
- line_number: The sequential number of the lyric line.
- syllable_count: The syllable count of the lyric line.
- start: Start timestamp of the lyric line in the audio/video.
- end: End timestamp of the lyric line in the audio/video.
youtube_url: Dictionary containing YouTube URLs for the original and dubbed versions in different languages.
lyrics_url: Dictionary containing URLs to external websites where the full lyrics can be found for each language.
video: Boolean flags indicating the availability of video for each language.

Usage Instructions

Prerequisites

Install required dependencies:
```
pip install -r requirements.txt
```

Download the MAVL dataset from Hugging Face:

# Download mavl_dataset.json from https://huggingface.co/datasets/Noename/MAVL
# Place it in your working directory

Step 1: Download Full Lyrics

The first step is to download the complete lyrics from the URLs provided in the dataset:

python dataset/download_lyrics.py --input mavl_dataset.json --output mavl_datasets

Parameters:

--input: Path to the MAVL dataset JSON file (default: mavl_dataset.json)
--output: Base directory where lyrics will be saved (default: mavl_datasets)

This creates a directory structure like:

mavl_datasets/
├── Song Title 1/
│   ├── US_og/
│   │   └── lyrics.txt
│   ├── ES/
│   │   └── lyrics.txt
│   └── ...
└── Song Title 2/
    └── ...

Note: This process may take some time as it respects rate limits for lyrics websites. Failed downloads are logged to download_failed.jsonl.

Step 2: Restore Complete Lyrics

After downloading the lyrics, restore the full text from the compact representation:

python dataset/restore_lyrics.py

This script:

Uses the compact clues in mavl_dataset.json
Matches them with the downloaded lyrics in mavl_datasets/
Outputs complete lyrics to mavl_dataset_restored.json

The restored dataset contains the full lyric text instead of the compact representation.

Step 3: Download YouTube Videos and Audio

To download the multimedia content:

python dataset/download_yt.py --input mavl_dataset.json --output mavl_datasets

Parameters:

--input: Path to the MAVL dataset JSON file (default: mavl_dataset.json)
--output: Base directory where videos and audio will be saved (default: mavl_datasets)

This downloads:

video.mp4: The YouTube video file
audio.wav: Extracted audio in WAV format

Directory structure after this step:

mavl_datasets/
├── Song Title 1/
│   ├── US_og/
│   │   ├── lyrics.txt
│   │   ├── video.mp4
│   │   └── audio.wav
│   └── ...
└── ...

Requirements: This step requires yt-dlp and ffmpeg to be installed on your system.

Step 4: Process Lyrics for Copyright Protection (Optional)

If you need to create a copyright-protected version of full lyrics:

python dataset/process_lyrics.py --lyrics_path mavl_dataset_restored.json --output_path copyright_protected_lyrics.json

This reverses the restoration process, converting full lyrics back to compact representations.

Code Descriptions

dataset/download_lyrics.py: This script downloads the full lyrics from the lyrics_url provided in the MAVL dataset.
dataset/restore_lyrics.py: This script reconstructs the original lyrics from the compact representation provided in the dataset.
dataset/download_yt.py: This script downloads the video and audio files from the YouTube URLs specified in the dataset.
dataset/process_lyrics.py: This script processes the full lyrics to generate a copyright-protected representation, as used in the MAVL dataset (e.g., first letters of words, syllable counts).
ipa_converter/: This directory contains code for converting lyrics to the International Phonetic Alphabet (IPA) using the epitran library. It includes:
- epitran_utils.py: Utility functions for IPA conversion.
- valid_mappings.csv: Mappings for handling specific characters or words during IPA conversion.
- test.py: Test scripts for the IPA converter.
- The code in this directory handles numerical and other non-alphabetic characters by converting them to their written-out forms in the target language before IPA conversion. It also includes logic to handle embedded English words within other languages.
language_processors/: This directory contains language-specific transliteration tools used by the ipa_converter.
process_syllable/: This directory contains code for counting syllables in each language. It includes:
- Language-specific syllable counting scripts (e.g., english.py, korean.py).
- A syllabifier/ subdirectory with resources and code for syllabification, including:
  - CMU dictionary files for English.
  - Scripts for parsing the CMU dictionary and performing syllabification.

Technical Attributions for Processing Tools

The MAVL dataset's detailed annotations, including syllable counts and IPA transcriptions, rely on several external tools and libraries. We acknowledge and appreciate the work of their respective creators. The specific tools used for each language for syllable counting and IPA conversion are detailed in Table 20 of our accompanying paper and are summarized below:

Syllable Counting Tools

English: Utilizes the Syllabifier tool, often associated with the CMU Pronouncing Dictionary for phoneme-based syllabification.
Spanish: Employs the python-ported version of count-syllables-in-spanish.
French: Leverages the python-ported version of syllabify-fr.
Korean: Syllable count is determined by the length of text, referring to the number of Hangul blocks (characters).
Japanese: Uses the enhanced, python-ported version of japanese-mora-counter to count morae, which serve as the rhythmic unit in Japanese.

IPA Conversion Tools

All Languages (English, Spanish, French, Korean, Japanese): All IPA transcriptions are generated using epitran (https://github.com/dmort27/epitran), a Python library for transcribing orthographic text into IPA.
- For English, epitran can optionally leverage the CMU Pronouncing Dictionary for improved accuracy.
- The language_processors directory in this repository includes custom logic to preprocess numbers and handle embedded English words before epitran conversion for better cross-lingual consistency.

Intended Uses & Limitations

This dataset is intended solely for non-commercial research purposes, such as lyrics translation, music-lyrics alignment, music information retrieval, and multimodal language model development.

IMPORTANT LIMITATIONS:

Users must independently ensure their use of the content linked via URLs complies with copyright law and YouTube's Terms of Service.
The dataset provides only URLs and research-generated metadata/annotations; it does not grant any rights to the underlying copyrighted content.
YouTube links and external lyrics links may become invalid over time.
The dataset primarily focuses on animated musicals and may not generalize to all musical genres or styles.
The current lyric representation requires external processing to reconstruct full lyric text.

License

This dataset is distributed under the CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International) license (see the LICENSE file for full details). Key points include:

Attribution (BY): You must give appropriate credit to the original creators ([Woohyun Cho/MIRLAB]).
NonCommercial (NC): You may not use the dataset for commercial purposes.
Underlying Content Ownership: It is crucial to understand that this dataset does not grant any rights to the copyrighted songs/videos linked via YouTube URLs or the full lyric texts obtained from external sources. Users are solely responsible for ensuring their use of this content complies with applicable copyright laws and YouTube's Terms of Service.
Dataset Compilation & Annotations: Permission is granted to use, copy, and modify the URL compilation and any original annotations (such as timestamps, syllable counts, and IPA transcriptions) for non-commercial research, provided attribution is given to the dataset creators ([Woohyun Cho/MIRLAB]) and the license terms are followed. This metadata, compiled by our research, can be freely adapted for non-commercial purposes with proper acknowledgment.
NO WARRANTY: The dataset is provided "AS IS" without warranty. Links may become broken over time.
Liability: The dataset creators are not liable for any issues arising from the use of the dataset or the linked content.

Citation

If you use the MAVL dataset in your research, please cite our paper:

@misc{cho2025mavlmultilingualaudiovideolyrics,
      title={MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation},
      author={Woohyun Cho and Youngmin Kim and Sunghyun Lee and Youngjae Yu},
      year={2025},
      eprint={2505.18614},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={[https://arxiv.org/abs/2505.18614](https://arxiv.org/abs/2505.18614)},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAVL: Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Dataset Description

Dataset Structure

Usage Instructions

Prerequisites

Step 1: Download Full Lyrics

Step 2: Restore Complete Lyrics

Step 3: Download YouTube Videos and Audio

Step 4: Process Lyrics for Copyright Protection (Optional)

Code Descriptions

Technical Attributions for Processing Tools

Syllable Counting Tools

IPA Conversion Tools

Intended Uses & Limitations

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
dataset		dataset
ipa_converter		ipa_converter
language_processors		language_processors
process_syllable		process_syllable
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

MAVL: Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

Dataset Description

Dataset Structure

Usage Instructions

Prerequisites

Step 1: Download Full Lyrics

Step 2: Restore Complete Lyrics

Step 3: Download YouTube Videos and Audio

Step 4: Process Lyrics for Copyright Protection (Optional)

Code Descriptions

Technical Attributions for Processing Tools

Syllable Counting Tools

IPA Conversion Tools

Intended Uses & Limitations

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages