Skip to content

feat(diarizer): expose per-chunk embeddings on DiarizationResult#1

Merged
adamsro merged 1 commit into
feat/embedding-skip-strategy-v0.13.4from
feat/expose-chunk-embeddings
May 2, 2026
Merged

feat(diarizer): expose per-chunk embeddings on DiarizationResult#1
adamsro merged 1 commit into
feat/embedding-skip-strategy-v0.13.4from
feat/expose-chunk-embeddings

Conversation

@adamsro
Copy link
Copy Markdown
Member

@adamsro adamsro commented May 2, 2026

Why is this change needed?

Surfaces the per-chunk speaker embeddings the offline pipeline already computes internally (currently only reachable via --export-embeddings JSON dump in the CLI), so consumers can implement chunk-granularity post-processing without re-running the embedding model.

The motivating use case is downstream cluster-purity correction: the clustering step occasionally lands a fraction of one speaker's chunks in another speaker's cluster (mega-cluster contamination on long files, bidirectional smearing between similar voices). With per-chunk embeddings plus their cluster assignments exposed on DiarizationResult, downstream code can compute per-cluster centroids and migrate chunks whose own-cluster cosine is dominated by an alternate centroid — pure NumPy/BLAS-equivalent post-processing, no extra model calls.

What changed

  • New public ChunkEmbedding struct (Sendable, Codable) carrying speakerId, chunkIndex, speakerIndex, startTimeSeconds, endTimeSeconds, embedding256, and rho128. Speaker IDs follow the same "S\(cluster + 1)" convention as TimedSpeakerSegment.speakerId so chunk embeddings align to segments by string equality. rho128 is non-optional and empty when no PLDA model is loaded, matching the internal TimedEmbedding.rho128 shape.
  • New optional chunkEmbeddings: [ChunkEmbedding]? field on DiarizationResult (defaults to nil, populated only when opted in).
  • New OfflineDiarizerConfig.exposeChunkEmbeddings: Bool flag (defaults to false). When enabled, OfflineDiarizerManager.process(...) maps the internal [TimedEmbedding] + assignments to the public [ChunkEmbedding] array via buildPublicChunkEmbeddings(...).
  • 9 new unit tests in OfflineModuleTests.swift covering default values, opt-in behavior, DiarizationResult initializer round-trip, field round-trip, the cluster-int → "S\(N)" mapping, length-mismatch handling (returns empty + warns), empty input, and Codable round-tripping.

Backwards compatibility & performance

  • Fully opt-in. With `exposeChunkEmbeddings = false` (the default), the new code path is one boolean check that lands in the `nil` branch — no extra allocation, no extra compute, no memory cost.
  • `DiarizationResult.init(...)` adds `chunkEmbeddings:` between `speakerDatabase:` and `timings:` with default `nil`. All existing callers in this repo use named arguments (`DiarizerManager.swift:220`, with one positional-`segments`-only call at `:223`), so no call site changes.
  • `OfflineDiarizerConfig.init(...)` adds `exposeChunkEmbeddings:` with default `false`. The community/community-1 presets and existing init overloads keep their behavior.
  • When opted in: ~1–2 MB per hour of audio for the embedding + PLDA payload. Helper is O(n) over already-computed data — no model calls, no audio access, ~1 ms even on 90-min files.

Tests & lint

```
$ swift test
Executed 1352 tests, with 24 tests skipped and 0 failures (0 unexpected)

$ swift format lint --recursive --configuration .swift-format Sources/ Tests/

(only pre-existing warnings on Fa/Fb in OfflineDiarizerConfig and

Sortformer/CLI files; none introduced by this change)

```

Out of scope

The downstream chunk-cluster migration consumer that motivated this API lives outside this repo and is not part of this PR.

Add an opt-in `exposeChunkEmbeddings` flag on OfflineDiarizerConfig that
surfaces per-chunk speaker embeddings (already computed internally) on
DiarizationResult.chunkEmbeddings. Enables chunk-granularity
post-processing — e.g. cluster-purity correction via centroid migration —
without re-running the embedding model. Default off; no behavior change
for existing callers.
@adamsro adamsro merged commit cc9f41f into feat/embedding-skip-strategy-v0.13.4 May 2, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant