Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

Open
mubtasimahasan opened this issue Oct 20, 2024 · 2 comments

Comments

@mubtasimahasan
Copy link

I am attempting to prepare the Multilingual LibriSpeech (MLS) dataset using the lhotse.recipes.mls:

lhotse prepare mls $corpus_dir $output_dir --flac --num-jobs 40
  • $corpus_dir contains only the mls_english directory.
  • $output_dir is the directory where I expect the output manifests to be saved.

After running this command for more than 72 hours, the process seems to be stuck. I can see the following files in the $output_dir:

  • mls-english_recordings_dev.jsonl.gz
  • mls-english_recordings_test.jsonl.gz
  • mls-english_supervisions_dev.jsonl.gz
  • mls-english_supervisions_test.jsonl.gz

However, the following files are missing:

  • mls-english_recordings_train.jsonl.gz
  • mls-english_supervisions_train.jsonl.gz

The output log shows the process stuck at:

Scanning audio files (*.flac): 10807259it [15:50, 7377.79it/s]

This has been the status since the very beginning, and there doesn't seem to be any further progress.

Questions:

  1. How can I resolve this issue?
    The command appears to be hanging when scanning the train set. Could this be a bug or an issue with handling large datasets?

  2. Is my use of an HDD causing slow processing?
    I am using an HDD for storage, and the train set of the mls_english subset is 2.4 TB in size. Could the HDD's performance be causing the extreme slowness?

  3. Is there a way to speed up manifest preparation for large datasets?
    Are there optimizations or alternative approaches I could try to handle the manifest preparation more efficiently for such a large dataset?

Any guidance on these issues would be greatly appreciated! Thank you for your help.

@pzelasko
Copy link
Collaborator

The MLS recipe was the first one we added for very large datasets, and it's implemented less efficiently than others. You'd need to modify it to use incremental manifest writers so it avoids blowing up CPU memory. See how it's done in GigaSpeech recipe for example

with RecordingSet.open_writer(
output_dir / f"gigaspeech_recordings_{part}.jsonl.gz"
) as rec_writer, SupervisionSet.open_writer(
output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz"
) as sup_writer, CutSet.open_writer(
output_dir / f"gigaspeech_cuts_{part}.jsonl.gz"
) as cut_writer:

@pzelasko
Copy link
Collaborator

But yeah generally expect it to take a while as English MLS is quite sizeable. It may be possible to implement it differently to accommodate distributed compute environments and speed up if you do sth like process directory per single worker and write to chunks instead of a single manifest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants