Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

jdeast · 2025-01-22T06:19:30Z

I'm using a modified version of WhisperX that just returns the speaker embeddings from pyannote.audio so I can match speakers across multiple videos (similar to PR 997).

I've noticed a recurring bad speaker, which is a combination of failures: unidentified switch in speakers, unmic'ed speakers, zoom distorted audio, one word exchanges, etc. However, the failure is very different across videos, or even within a video, and yet the cosine similarity between the embeddings for that failed speaker and other failed speakers is high. I'm guessing these are all just different flavors of noisy/uncertain data, but I was surprised that noise could match other noise in such a high fraction (14/24) of videos.

Doing some simple stats on the embeddings array, the garbage ones appear to have a large minima, small maxima, an especially tight range (max-min), their average is very near zero, and their standard deviation is small. I think both the range and/or standard deviation are signatures that could probably be used to identify and filter out noise prior to diarization to improve its quality.

I have 14 different videos that all have a speaker that matches to the same failed speaker, likely because their embedding vectors just occupy a small volume around the origin.

SPEAKER_43 here:
https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/2024-10-05_3gvhm0AovZU.html

Matches (cosine similarity > 0.7) garbage in 13/23 other videos (search for "3gvhm0AovZU_SPEAKER_43"):

You can replace the html file in the URL with "model.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/model.pkl) for the transcribed, aligned, and diarized result, and "embeddings.pkl" (e.g., https://medford-transcripts.github.io/2024-10-05_3gvhm0AovZU/embeddings.pkl) for the speaker embeddings. The HTML file has timestamped links to the underlying youtube video.

Does this belong in the pyannote.audio issues?

jdeast · 2025-01-26T16:04:59Z

It's also worth noting that, outside of these noisy embeddings, the diarization is actually quite good (near perfect).

It would be great to fix them, but it'd be a huge improvement to just flag them as suspect or just overwrite them as unknown.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

jdeast commented Jan 22, 2025

jdeast commented Jan 26, 2025

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

Identify noisy embeddings with standard deviation or range of vector to improve diarization? #1011

Comments

jdeast commented Jan 22, 2025

jdeast commented Jan 26, 2025