Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diarization too slow #274

Open
MitPitt opened this issue May 25, 2023 · 19 comments
Open

Diarization too slow #274

MitPitt opened this issue May 25, 2023 · 19 comments

Comments

@MitPitt
Copy link

MitPitt commented May 25, 2023

1 hour 30 minutes of audio were processing for over 1 hour in the diarization... stage. I'm using an RTX 3090.

I'm guessing --batch_size doesn't affect pyannote. A setting for pyannote's batch size would be very nice to have.

@jzeller2011
Copy link

I'm having the same issue. From what i'm reading, the pyannote/speaker-diarization model is slow, but word-level segmentation may be slowing it down even more. I assume there are factors that impact this more than others (i think number of speakers or number of segments influences this the most, but that's just a guess). Looking at hardware usage during runtime, looks like it's batching either one segment at a time or one word at a time (this would make sense, since we're chasing word-level timestamps with whisperx. The pyannote model reports a 2.5% realtime factor, which is definitely NOT been my experience, but may be the case if you ran the raw audio through without segmentation). Maybe there's a way to count individual calls to the GPU to verify. I haven't found a workaround yet, let me know if you find something out.

@moritzbrantner
Copy link
Contributor

I have the same issue.

@DigilConfianz
Copy link

#159 (comment)

@m-bain
Copy link
Owner

m-bain commented May 26, 2023

1 hour 30 minutes of audio were processing for over 1 hour in the diarization... stage. I'm using an RTX 3090.

That's very strange, it should not be that long, I would expect 5-10mins max. I suspect some bug here.

I'm guessing --batch_size doesn't affect pyannote. A setting for pyannote's batch size would be very nice to have.

I would assume most of the time is the clustering step, which can be recursive and can take long if its not finding satisfactory cluster sizes.

From what i'm reading, the pyannote/speaker-diarization model is slow, but word-level segmentation may be slowing it down even more.

Nah the ASR and word-level segmentation is ran independently of the diarization. The diarization is just running a standard pyannote pipeline. So word-level segmentation / whisperx batching shouldnt effect this

@geoglrb
Copy link

geoglrb commented May 30, 2023

@m-bain I'm also having extremely slow diarization. Using CLI.

Just now, to explore further, I also tried setting the --threads parameter to 50 to see if that would do something (I would prefer GPU!) and it is now making use of a variable number of threads, but well about four, which is what it had seemed to be limited to by default. There is still some GPU memory allocated even in the diarization stage, but not a ton. Very naive question--could things be slow because all of us have pyannote using CPU for some reason? Is there a way to specify that whisperx's pyannote must use GPU?

For reference, in case it helps:

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
2
>>> torch.version.cuda
'11.7'

@sorgfresser
Copy link
Contributor

There is an issue regarding pyannote not using GPU, but it should not occur with whisperx. To read more on this, see pyannote/pyannote-audio#1354.
It might have something to do with the device index though. Are both of your GPUs the same size? We're currently not passing device_index to the diarization, so we will simply do to('cuda') on loading the diarization model. This might be a problem when multiple GPUs are available.

@goneill
Copy link

goneill commented Jun 7, 2023

I am also having an extremely long, ie overnight, diarization on the command line. The transcription occurs, I get two failures in the align segment and then diarization occurs, and I get the following errors:

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.2. To apply the upgrade to your files permanently, run python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1. Bad things might happen unless you revert torch to 1.x.

and then I left it running overnight and still in the same state.

@davidas1
Copy link
Contributor

davidas1 commented Aug 1, 2023

Please try my suggestion in #399 and see if it helps you too.
I'm getting around 30sec for diarization of 30 minute video using the standard model in the pyannote/speaker-diarization pipeline (speechbrain/spkrec-ecapa-voxceleb), and around 15sec if I change the embedding model to pyannote/embedding

@DigilConfianz
Copy link

@davidas1 There is speed improvement when changing to whisper loaded audio from the raw audio file as you suggested. Thanks for that. How to change the embedding model in code?

@davidas1
Copy link
Contributor

davidas1 commented Aug 1, 2023

Changing the pyannote pipeline is a bit more involved - I'm using an offline pipeline like described in https://github.com/pyannote/pyannote-audio/blob/develop/tutorials/applying_a_pipeline.ipynb
I had to patch whisperx a bit to allow working with a custom local pipeline.
Using this method you can customize the pipeline by editing the config.yaml (change the "embedding" configuration to the desired model).

@datacurse
Copy link

Please try my suggestion in #399 and see if it helps you too. I'm getting around 30sec for diarization of 30 minute video using the standard model in the pyannote/speaker-diarization pipeline (speechbrain/spkrec-ecapa-voxceleb), and around 15sec if I change the embedding model to pyannote/embedding

what??? thats crazy! here is my timings for 30 minute long mp3:
transcribe time: 69 seconds
align time: 10 seconds
diarization: 24 seconds
around 90 seconds in total, like 3 times longer than yours, and thats excluding the initial model loadings.

could you please suggest something like a checklist for speeding things up? i also updated to get your recet patch and it did speed up my diarization exponentially

@davidas1
Copy link
Contributor

davidas1 commented Aug 3, 2023

I wrote that diarization takes 30sec, not the entire pipeline - before the change the diarization took almost 2 minutes.
Your timing looks great, other than the transcribe step that is faster on my setup, but that's probably due to the GPU you're using.

@datacurse
Copy link

oooh i see that clears things. i got 4090 tho

@dantheman0207
Copy link

I'm looking for some help or insight into why diarization is so slow for me.

I have a recording that is 1 minute and 14 seconds with two native English speakers and diarization takes 11 minutes and 49 seconds (transcription took 6 seconds). I'm running on a Mac mini with an M2 chip and 8GB of RAM. I assume in this case it's running on CPU although I'm not sure with the Apple silicon. I'm basically using the default example on the README for transcribing and diarizing a file.

With a longer file (27 minutes and 39 seconds), with multiple speakers, it takes 2 minutes and 47 seconds to transcribe, 1 minute and 6 seconds to align but 12 hours, 48 minutes to diarize!

@awhillas
Copy link

Same here. I'm getting 2-3% GPU utilization 0.9 GB of GPU memory?

@SergeiKarulin
Copy link

same issue. Almost no GPU utilization and 1.5 hour of diarization per 60 minutes audio.

@eplinux
Copy link

eplinux commented Apr 11, 2024

same issue. Almost no GPU utilization and 1.5 hour of diarization per 60 minutes audio.

same here

@eplinux
Copy link

eplinux commented Apr 16, 2024

I also noticed that there seems to be some throttling affecting the GPU utilization on Windows 11. As soon as the terminal window is in the background, the GPU utilization drops dramatically

@prkumar112451
Copy link

@m-bain Diarization is a key aspect where multiple speakers are having a conversation. I've been exploring different ways to speed up transcription & diarization pipeline.

Can see lots of different options for speeding up transcription like : CTranslate2, Batching, Flash Attention, Distil-Whisper, ComputeTime (float32,16)

but finding very limited options for diarization speedup.

for a 20 minutes audio, with optimizations we are able to get transcriptions in around 35 seconds.
But diarizing a 20 minute audio is taking roughly 1 minute via Nemo and around 45 seconds via Pyannote.

Could you please share if there is any direction which we can follow to speedup diarization process?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests