feat(diarizer): expose per-chunk embeddings on DiarizationResult#1
Merged
adamsro merged 1 commit intoMay 2, 2026
Merged
Conversation
Add an opt-in `exposeChunkEmbeddings` flag on OfflineDiarizerConfig that surfaces per-chunk speaker embeddings (already computed internally) on DiarizationResult.chunkEmbeddings. Enables chunk-granularity post-processing — e.g. cluster-purity correction via centroid migration — without re-running the embedding model. Default off; no behavior change for existing callers.
cc9f41f
into
feat/embedding-skip-strategy-v0.13.4
1 of 2 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why is this change needed?
Surfaces the per-chunk speaker embeddings the offline pipeline already computes internally (currently only reachable via
--export-embeddingsJSON dump in the CLI), so consumers can implement chunk-granularity post-processing without re-running the embedding model.The motivating use case is downstream cluster-purity correction: the clustering step occasionally lands a fraction of one speaker's chunks in another speaker's cluster (mega-cluster contamination on long files, bidirectional smearing between similar voices). With per-chunk embeddings plus their cluster assignments exposed on
DiarizationResult, downstream code can compute per-cluster centroids and migrate chunks whose own-cluster cosine is dominated by an alternate centroid — pure NumPy/BLAS-equivalent post-processing, no extra model calls.What changed
ChunkEmbeddingstruct (Sendable, Codable) carryingspeakerId,chunkIndex,speakerIndex,startTimeSeconds,endTimeSeconds,embedding256, andrho128. Speaker IDs follow the same"S\(cluster + 1)"convention asTimedSpeakerSegment.speakerIdso chunk embeddings align to segments by string equality.rho128is non-optional and empty when no PLDA model is loaded, matching the internalTimedEmbedding.rho128shape.chunkEmbeddings: [ChunkEmbedding]?field onDiarizationResult(defaults tonil, populated only when opted in).OfflineDiarizerConfig.exposeChunkEmbeddings: Boolflag (defaults tofalse). When enabled,OfflineDiarizerManager.process(...)maps the internal[TimedEmbedding] + assignmentsto the public[ChunkEmbedding]array viabuildPublicChunkEmbeddings(...).OfflineModuleTests.swiftcovering default values, opt-in behavior,DiarizationResultinitializer round-trip, field round-trip, the cluster-int →"S\(N)"mapping, length-mismatch handling (returns empty + warns), empty input, andCodableround-tripping.Backwards compatibility & performance
Tests & lint
```
$ swift test
Executed 1352 tests, with 24 tests skipped and 0 failures (0 unexpected)
$ swift format lint --recursive --configuration .swift-format Sources/ Tests/
(only pre-existing warnings on Fa/Fb in OfflineDiarizerConfig and
Sortformer/CLI files; none introduced by this change)
```
Out of scope
The downstream chunk-cluster migration consumer that motivated this API lives outside this repo and is not part of this PR.