You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This has been the status since the very beginning, and there doesn't seem to be any further progress.
Questions:
How can I resolve this issue?
The command appears to be hanging when scanning the train set. Could this be a bug or an issue with handling large datasets?
Is my use of an HDD causing slow processing?
I am using an HDD for storage, and the train set of the mls_english subset is 2.4 TB in size. Could the HDD's performance be causing the extreme slowness?
Is there a way to speed up manifest preparation for large datasets?
Are there optimizations or alternative approaches I could try to handle the manifest preparation more efficiently for such a large dataset?
Any guidance on these issues would be greatly appreciated! Thank you for your help.
The text was updated successfully, but these errors were encountered:
The MLS recipe was the first one we added for very large datasets, and it's implemented less efficiently than others. You'd need to modify it to use incremental manifest writers so it avoids blowing up CPU memory. See how it's done in GigaSpeech recipe for example
But yeah generally expect it to take a while as English MLS is quite sizeable. It may be possible to implement it differently to accommodate distributed compute environments and speed up if you do sth like process directory per single worker and write to chunks instead of a single manifest.
I am attempting to prepare the Multilingual LibriSpeech (MLS) dataset using the lhotse.recipes.mls:
$corpus_dir
contains only themls_english
directory.$output_dir
is the directory where I expect the output manifests to be saved.After running this command for more than 72 hours, the process seems to be stuck. I can see the following files in the
$output_dir
:mls-english_recordings_dev.jsonl.gz
mls-english_recordings_test.jsonl.gz
mls-english_supervisions_dev.jsonl.gz
mls-english_supervisions_test.jsonl.gz
However, the following files are missing:
mls-english_recordings_train.jsonl.gz
mls-english_supervisions_train.jsonl.gz
The output log shows the process stuck at:
This has been the status since the very beginning, and there doesn't seem to be any further progress.
Questions:
How can I resolve this issue?
The command appears to be hanging when scanning the train set. Could this be a bug or an issue with handling large datasets?
Is my use of an HDD causing slow processing?
I am using an HDD for storage, and the
train
set of themls_english
subset is 2.4 TB in size. Could the HDD's performance be causing the extreme slowness?Is there a way to speed up manifest preparation for large datasets?
Are there optimizations or alternative approaches I could try to handle the manifest preparation more efficiently for such a large dataset?
Any guidance on these issues would be greatly appreciated! Thank you for your help.
The text was updated successfully, but these errors were encountered: