Skip to content

Add TTSKit with Qwen3-TTS support#425

Open
ZachNagengast wants to merge 6 commits intomainfrom
ttskit
Open

Add TTSKit with Qwen3-TTS support#425
ZachNagengast wants to merge 6 commits intomainfrom
ttskit

Conversation

@ZachNagengast
Copy link
Contributor

@ZachNagengast ZachNagengast commented Feb 19, 2026

WhisperKit is expanding into text-to-speech!

TTSKit adds a new library for on-device text-to-speech using Core ML-accelerated Qwen3-TTS models (CustomVoice 0.6B and 1.7B in this first release) with real-time streaming playback on Apple Silicon. In this first PR, we're introducing the library into the WhisperKit package (WhisperKit will be renamed to reflect the new multi-Kit nature of Argmax Open-source SDK) as an optional import to add real-time TTS capabilities with a state-of-the-art open-source model, either on its own or as a complement to WhisperKit speech-to-text.

This PR is still in the final phases of development, but here are a few highlights:

TTSKit Library

  • Download, load, generate, and stream playback in ~3 lines of code.
  • Protocol-based component architecture (6 swappable Core ML components: TextProjecting, CodeEmbedding, MultiCodeEmbedding, CodeDecoding, MultiCodeDecoding, SpeechDecoding) for plugging in new model backends.
  • Qwen3-TTS implementation with 9 built-in voices, 10 languages, and style instruction support (1.7b variant only).
  • Automatic text chunking for long-form generations with concurrent chunk generation and cross-fade stitching.
  • Adaptive streaming playback (TTSPlaybackStrategy.auto) that measures first-step latency to pre-buffer just enough audio.
  • Seedable RNG for reproducible generation.
  • WAV and M4A (AAC) audio export

Example usage playing audio in real-time out of the default speaker:

    let ttsKit = try await TTSKit()
    try await ttsKit.playSpeech(text: "Hello from TTSKit!")

New target: ArgmaxCore

  • Extracted a shared target with various utilities from WhisperKit so TTSKit can share them without depending on it directly

CLI

  • For now we plan to deploy this as a new command on whisperkit-cli tts that can be used like this:
    • swift run whisperkit-cli tts --text "Hello from TTSKit" --play
    • Full control over speaker, language, model, style instruction, temperature, chunking, compute units, and seed.

TTSKit Example app

  • macOS and iOS example app with model management, real-time waveform visualization, generation history persisted as M4A files, and more. Use this as a quick way to try it out!
ttskit-example-app

Roadmap

We plan to continue to add support for state-of-the-art models and improve inference latency for TTSKit over the next few weeks. The immediate follow-up is the voice cloning feature from Qwen3-TTS and a 2x reduction in time-to-first-byte (TTFB) so this on-device project achieves a consistent sub-100 ms, providing a latency edge over cloud deployments of the same model. In the meantime, we encourage anyone reading this to check out this PR, give it a spin, and let us know how it goes!


@MainActor
@Observable
final class ViewModel: @unchecked Sendable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would break this down to smaller viewmodels if it goes too long, e.g DownloadViewModel vs. TTSViewModel

@@ -1,5 +1,5 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had an old swift-transformers resolved

/// Thin wrapper around `os_unfair_lock` that exposes a Swift-friendly
/// `withLock` helper. This lock is non-reentrant and optimized for low
/// contention, matching the semantics of Core Foundation's unfair lock.
public final class UnfairLock: @unchecked Sendable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to make this class name generic for future proof with swift6, seems os_unfair_lock is not the recommended way to lock in swift 6.
probably rename it Mutext so we can reimp it with actual Swift.Mutext later
now

public final class Mutex: @unchecked Sendable {
    private let lock = OSAllocatedUnfairLock()

    public init() {}

    @inlinable
    public func withLock<T>(_ body: () throws -> T) rethrows -> T {
        try lock.withLock(body)
    }
}


later

public final class Mutex: Sendable {
private let mutex: Swift.Mutex

public init(_ value: Value) {
    self.mutex = Mutex(value)
}

public func withLock<T>(_ body: (inout Value) throws -> T) rethrows -> T {
    try mutex.withLock(body)
}

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider adding another package under ArgmaxCore? like ArgmaxCore/CoreML

///
/// Downloads only the files matching the configured component variants.
/// Files are cached locally by the Hub library.
open class func download(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we decouple model download from TTSKit? ArgmaxCore could provide a downloader for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep have some todos relating to this

// Copyright © 2026 Argmax, Inc. All rights reserved.

import Accelerate
@_exported import ArgmaxCore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why @_exported?

)

XCTAssertGreaterThan(result.audio.count, 0, "Audio samples should be non-empty")
XCTAssertGreaterThan(result.audioDuration, 1.0, "Expect at least 1s of speech")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will seed guarantee the audio length is always deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// For licensing see accompanying LICENSE.md file.
// Copyright © 2024 Argmax, Inc. All rights reserved.

import ArgmaxCore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would want to break these test down to isolated class test.

e.g1 TTSKitTest.swift that injects a Config with mocked components, and verify
TTSKitTest.generateSpeech interacts with the components correctly, tasks created etc.

e.g2 Qwen3TTSGenerateTaskTest.swfit that inejcts mocked components. verify run interacts with them correctly

/// owns its own sampler (derived seed) so concurrent tasks don't share RNG state.
/// Model components are shared read-only references - `MLModel.prediction()` is
/// thread-safe. The class is `@unchecked Sendable` to permit `open` subclassing.
open class TTSGenerateTask: @unchecked Sendable, TTSGenerating {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the class be renamed to Qwen3TTSGenerateTask ? ditto to other files under Qwen3TTS

@argmaxinc argmaxinc deleted a comment from chen-argmax Feb 19, 2026
/// Serializes access to a value with an `os_unfair_lock` so mutation stays
/// thread-safe. Useful for properties on types marked `@unchecked Sendable`.
@propertyWrapper
public struct PropertyLock<Value: Codable & Sendable>: Sendable, Codable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR; @ZachNagengast @chen-argmax guys, this doesn't make reference or value type properties truly thread safe.

I was playing around with this and trying to move Sendable and Codable conformances outside. Did some verifications on the current implementation and mine. Ran the snippet below with different variations

  1. Reference type property Ref
  2. Value type property Ref
  3. Plain property of type Int

None of them was safe. Locking accessors isn't enough. We need to wrap mutations with locks

final class Ref: Codable, @unchecked Sendable {
    var count: Int

    init(count: Int = 0) { self.count = count }

    enum CodingKeys: String, CodingKey { case count }

    required init(from decoder: Decoder) throws {
        let c = try decoder.container(keyedBy: CodingKeys.self)
        self.count = try c.decode(Int.self, forKey: .count)
    }

    func encode(to encoder: Encoder) throws {
        var c = encoder.container(keyedBy: CodingKeys.self)
        try c.encode(count, forKey: .count)
    }
}

final class Holder: @unchecked Sendable {
    @TranscriptionPropertyLock var ref = Ref()
}

@main
struct Main {
    static func main() async {
        let workers = max(2, ProcessInfo.processInfo.activeProcessorCount * 2)
        let perWorker = 50_000
        let expected = workers * perWorker

        print("workers=\(workers), perWorker=\(perWorker), expected=\(expected)")

        for run in 1...10 {
            let holder = Holder()

            await withTaskGroup(of: Void.self) { group in
                for _ in 0..<workers {
                    group.addTask {
                        for _ in 0..<perWorker {
                            holder.ref.count += 1
                        }
                    }
                }
            }

            let final = holder.ref.count
            print("run \(run): expected=\(expected) actual=\(final)")
        }
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach used in AudioProcessor PR (also the WhisperKit PR) works. Snippet below:

import Foundation
import os.lock

@usableFromInline
final class UnfairLock: @unchecked Sendable {
    @usableFromInline
    var lock = os_unfair_lock()

    @inlinable
    func withLock<T>(_ body: () throws -> T) rethrows -> T {
        os_unfair_lock_lock(&lock)
        defer { os_unfair_lock_unlock(&lock) }
        return try body()
    }
}

final class Ref {
    var count = 0
}

final class HolderInt: @unchecked Sendable {
    private let stateLock = UnfairLock()
    private var countStorage = 0

    var count: Int {
        get {
            stateLock.withLock { countStorage }
        }
        set {
            stateLock.withLock { countStorage = newValue }
        }
    }

    func increment() {
        stateLock.withLock {
            countStorage += 1
        }
    }
}

final class HolderRef: @unchecked Sendable {
    private let stateLock = UnfairLock()
    private let refStorage = Ref()

    var refCount: Int {
        stateLock.withLock { refStorage.count }
    }

    func incrementRef() {
        stateLock.withLock {
            refStorage.count += 1
        }
    }
}

@main
struct Main {
    static func main() async {
        let workers = max(2, ProcessInfo.processInfo.activeProcessorCount * 2)
        let perWorker = 50_000
        let expected = workers * perWorker

        print("[Int] workers=\(workers), perWorker=\(perWorker), expected=\(expected)")
        for run in 1...10 {
            let holder = HolderInt()
            await withTaskGroup(of: Void.self) { group in
                for _ in 0..<workers {
                    group.addTask {
                        for _ in 0..<perWorker {
                            holder.increment()
                        }
                    }
                }
            }
            let final = holder.count
            print("[Int] run \(run): expected=\(expected) actual=\(final)")
        }

        print("[Ref] workers=\(workers), perWorker=\(perWorker), expected=\(expected)")
        for run in 1...10 {
            let holder = HolderRef()
            await withTaskGroup(of: Void.self) { group in
                for _ in 0..<workers {
                    group.addTask {
                        for _ in 0..<perWorker {
                            holder.incrementRef()
                        }
                    }
                }
            }
            let final = holder.refCount
            print("[Ref] run \(run): expected=\(expected) actual=\(final)")
        }
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid concern, essentially if the property wrapped property has another property, read/write wont' be thread safe.

e.g this is thread safe

holder.ref = otherRef

this is not thread safe

holder.ref.count += 1

@ZachNagengast we may want to add document for this wrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it isn't safe to use it for pure value type properties e.g. Int either. we probably need to use _modify instead of set.

I am checking these resources:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a note in this PR but will leave the fix to a followup 👍

@zaidbren
Copy link

I am trying to run the 1.7B model on macbook air m1, and although the 0.6B version worked fine, in the 1.7B, It first specialize the model for the device, than loading and when it was generating, it stopped and throws this error :- Unable to compute the prediction using ML Program. It can be an invalid input data or broken/unsupported model.

Copy link
Contributor

@chen-argmax chen-argmax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with a comment to add doc toPropertyLock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants