You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using a modified version of WhisperX that just returns the speaker embeddings from pyannote.audio so I can match speakers across multiple videos (similar to PR 997).
I've noticed a recurring bad speaker, which is a combination of failures: unidentified switch in speakers, unmic'ed speakers, zoom distorted audio, one word exchanges, etc. However, the failure is very different across videos, or even within a video, and yet the cosine similarity between the embeddings for that failed speaker and other failed speakers is high. I'm guessing these are all just different flavors of noisy/uncertain data, but I was surprised that noise could match other noise in such a high fraction (14/24) of videos.
Doing some simple stats on the embeddings array, the garbage ones appear to have a large minima, small maxima, an especially tight range (max-min), their average is very near zero, and their standard deviation is small. I think both the range and/or standard deviation are signatures that could probably be used to identify and filter out noise prior to diarization to improve its quality.
I have 14 different videos that all have a speaker that matches to the same failed speaker, likely because their embedding vectors just occupy a small volume around the origin.
I'm using a modified version of WhisperX that just returns the speaker embeddings from pyannote.audio so I can match speakers across multiple videos (similar to PR 997).
I've noticed a recurring bad speaker, which is a combination of failures: unidentified switch in speakers, unmic'ed speakers, zoom distorted audio, one word exchanges, etc. However, the failure is very different across videos, or even within a video, and yet the cosine similarity between the embeddings for that failed speaker and other failed speakers is high. I'm guessing these are all just different flavors of noisy/uncertain data, but I was surprised that noise could match other noise in such a high fraction (14/24) of videos.
Doing some simple stats on the embeddings array, the garbage ones appear to have a large minima, small maxima, an especially tight range (max-min), their average is very near zero, and their standard deviation is small. I think both the range and/or standard deviation are signatures that could probably be used to identify and filter out noise prior to diarization to improve its quality.
I have 14 different videos that all have a speaker that matches to the same failed speaker, likely because their embedding vectors just occupy a small volume around the origin.
SPEAKER_43 here:
https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/2024-10-05_3gvhm0AovZU.html
Matches (cosine similarity > 0.7) garbage in 13/23 other videos (search for "3gvhm0AovZU_SPEAKER_43"):
https://medford-transcripts.github.io/2018-03-20_3N-X2ResFqI/2018-03-20_3N-X2ResFqI.html
https://medford-transcripts.github.io/2020-08-04_n50NtLaAUqY/2020-08-04_n50NtLaAUqY.html
https://medford-transcripts.github.io/2021-05-20_fvIk50DtTTc/2021-05-20_fvIk50DtTTc.html
https://medford-transcripts.github.io/2021-06-07_tZVXN6zzUHw/2021-06-07_tZVXN6zzUHw.html
https://medford-transcripts.github.io/2022-11-16_Azob8X18NRY/2022-11-16_Azob8X18NRY.html
https://medford-transcripts.github.io/2022-11-21_7D6c0Dkkm94/2022-11-21_7D6c0Dkkm94.html
https://medford-transcripts.github.io/2023-01-05_3oP-OTu9DFs/2023-01-05_3oP-OTu9DFs.html
https://medford-transcripts.github.io/2023-02-04_mBOS9fhmkww/2023-02-04_mBOS9fhmkww.html
https://medford-transcripts.github.io/2023-02-06_BSAjmRA8UYk/2023-02-06_BSAjmRA8UYk.html
https://medford-transcripts.github.io/2023-03-02_goLe37yQgNQ/2023-03-02_goLe37yQgNQ,html
https://medford-transcripts.github.io/2023-04-26_Y0_Ezb06bvc/2023-04-26_Y0_Ezb06bvc.html
https://medford-transcripts.github.io/2023-05-23_-GdGrA4wKuQ/2023-05-23_-GdGrA4wKuQ.html
https://medford-transcripts.github.io/2025-01-20_IByfBf6FgY8/2025-01-20_IByfBf6FgY8.html
You can replace the html file in the URL with "model.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/model.pkl) for the transcribed, aligned, and diarized result, and "embeddings.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/embeddings.pkl) for the speaker embeddings. The HTML file has timestamped links to the underlying youtube video.
Does this belong in the pyannote.audio issues?
The text was updated successfully, but these errors were encountered: