Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

mubtasimahasan · 2024-10-20T19:06:18Z

I am attempting to prepare the Multilingual LibriSpeech (MLS) dataset using the lhotse.recipes.mls:

lhotse prepare mls $corpus_dir $output_dir --flac --num-jobs 40

$corpus_dir contains only the mls_english directory.
$output_dir is the directory where I expect the output manifests to be saved.

After running this command for more than 72 hours, the process seems to be stuck. I can see the following files in the $output_dir:

mls-english_recordings_dev.jsonl.gz
mls-english_recordings_test.jsonl.gz
mls-english_supervisions_dev.jsonl.gz
mls-english_supervisions_test.jsonl.gz

However, the following files are missing:

mls-english_recordings_train.jsonl.gz
mls-english_supervisions_train.jsonl.gz

The output log shows the process stuck at:

Scanning audio files (*.flac): 10807259it [15:50, 7377.79it/s]

This has been the status since the very beginning, and there doesn't seem to be any further progress.

Questions:

How can I resolve this issue?
The command appears to be hanging when scanning the train set. Could this be a bug or an issue with handling large datasets?
Is my use of an HDD causing slow processing?
I am using an HDD for storage, and the train set of the mls_english subset is 2.4 TB in size. Could the HDD's performance be causing the extreme slowness?
Is there a way to speed up manifest preparation for large datasets?
Are there optimizations or alternative approaches I could try to handle the manifest preparation more efficiently for such a large dataset?

Any guidance on these issues would be greatly appreciated! Thank you for your help.

The text was updated successfully, but these errors were encountered:

pzelasko · 2024-10-21T19:20:23Z

The MLS recipe was the first one we added for very large datasets, and it's implemented less efficiently than others. You'd need to modify it to use incremental manifest writers so it avoids blowing up CPU memory. See how it's done in GigaSpeech recipe for example

lhotse/lhotse/recipes/gigaspeech.py

Lines 96 to 102 in a30720b

    
           with RecordingSet.open_writer( 
        
               output_dir / f"gigaspeech_recordings_{part}.jsonl.gz" 
        
           ) as rec_writer, SupervisionSet.open_writer( 
        
               output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz" 
        
           ) as sup_writer, CutSet.open_writer( 
        
               output_dir / f"gigaspeech_cuts_{part}.jsonl.gz" 
        
           ) as cut_writer:

pzelasko · 2024-10-21T19:22:24Z

But yeah generally expect it to take a while as English MLS is quite sizeable. It may be possible to implement it differently to accommodate distributed compute environments and speed up if you do sth like process directory per single worker and write to chunks instead of a single manifest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

mubtasimahasan commented Oct 20, 2024

pzelasko commented Oct 21, 2024

pzelasko commented Oct 21, 2024

Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

Lhotse Manifest Preparation Stuck and Incomplete for MLS English Train Set #1403

Comments

mubtasimahasan commented Oct 20, 2024

Questions:

pzelasko commented Oct 21, 2024

pzelasko commented Oct 21, 2024