Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

Open
jdeast opened this issue Jan 22, 2025 · 1 comment

Comments

@jdeast
Copy link

jdeast commented Jan 22, 2025

I'm using a modified version of WhisperX that just returns the speaker embeddings from pyannote.audio so I can match speakers across multiple videos (similar to PR 997).

I've noticed a recurring bad speaker, which is a combination of failures: unidentified switch in speakers, unmic'ed speakers, zoom distorted audio, one word exchanges, etc. However, the failure is very different across videos, or even within a video, and yet the cosine similarity between the embeddings for that failed speaker and other failed speakers is high. I'm guessing these are all just different flavors of noisy/uncertain data, but I was surprised that noise could match other noise in such a high fraction (14/24) of videos.

Doing some simple stats on the embeddings array, the garbage ones appear to have a large minima, small maxima, an especially tight range (max-min), their average is very near zero, and their standard deviation is small. I think both the range and/or standard deviation are signatures that could probably be used to identify and filter out noise prior to diarization to improve its quality.

I have 14 different videos that all have a speaker that matches to the same failed speaker, likely because their embedding vectors just occupy a small volume around the origin.

SPEAKER_43 here:
https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/2024-10-05_3gvhm0AovZU.html

Matches (cosine similarity > 0.7) garbage in 13/23 other videos (search for "3gvhm0AovZU_SPEAKER_43"):

https://medford-transcripts.github.io/2018-03-20_3N-X2ResFqI/2018-03-20_3N-X2ResFqI.html
https://medford-transcripts.github.io/2020-08-04_n50NtLaAUqY/2020-08-04_n50NtLaAUqY.html
https://medford-transcripts.github.io/2021-05-20_fvIk50DtTTc/2021-05-20_fvIk50DtTTc.html
https://medford-transcripts.github.io/2021-06-07_tZVXN6zzUHw/2021-06-07_tZVXN6zzUHw.html
https://medford-transcripts.github.io/2022-11-16_Azob8X18NRY/2022-11-16_Azob8X18NRY.html
https://medford-transcripts.github.io/2022-11-21_7D6c0Dkkm94/2022-11-21_7D6c0Dkkm94.html
https://medford-transcripts.github.io/2023-01-05_3oP-OTu9DFs/2023-01-05_3oP-OTu9DFs.html
https://medford-transcripts.github.io/2023-02-04_mBOS9fhmkww/2023-02-04_mBOS9fhmkww.html
https://medford-transcripts.github.io/2023-02-06_BSAjmRA8UYk/2023-02-06_BSAjmRA8UYk.html
https://medford-transcripts.github.io/2023-03-02_goLe37yQgNQ/2023-03-02_goLe37yQgNQ,html
https://medford-transcripts.github.io/2023-04-26_Y0_Ezb06bvc/2023-04-26_Y0_Ezb06bvc.html
https://medford-transcripts.github.io/2023-05-23_-GdGrA4wKuQ/2023-05-23_-GdGrA4wKuQ.html
https://medford-transcripts.github.io/2025-01-20_IByfBf6FgY8/2025-01-20_IByfBf6FgY8.html

You can replace the html file in the URL with "model.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/model.pkl) for the transcribed, aligned, and diarized result, and "embeddings.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/embeddings.pkl) for the speaker embeddings. The HTML file has timestamped links to the underlying youtube video.

Does this belong in the pyannote.audio issues?

@jdeast
Copy link
Author

jdeast commented Jan 26, 2025

It's also worth noting that, outside of these noisy embeddings, the diarization is actually quite good (near perfect).

It would be great to fix them, but it'd be a huge improvement to just flag them as suspect or just overwrite them as unknown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant