Skip to content

TBB worker threads scheduled to efficiency cores on Apple Silicon #3277

@ssp3nc3r

Description

@ssp3nc3r

Issue: TBB worker threads scheduled to efficiency cores on Apple Silicon

Summary

On Apple Silicon Macs, TBB worker threads are created with the default QoS (Quality of Service) class. macOS interprets this as "not user-facing work" and may schedule these threads to efficiency (E) cores even when performance (P) cores are available. This significantly degrades Stan's parallel performance.

Description

Apple Silicon chips have heterogeneous cores:

  • Performance (P) cores: Fast, for compute-intensive work
  • Efficiency (E) cores: Slower (~3x), for background tasks

macOS uses QoS classes to decide core scheduling. The default QoS class signals "this work isn't urgent," allowing macOS to prefer E-cores to save power. For Stan's compute workloads, this is the wrong signal.

Observed behavior

  • Environment: macOS 26.2 (Tahoe), Apple M3 Ultra (24 P-cores, 8 E-cores)
  • Stan model using reduce_sum with 12 threads per chain, 2 chains
  • Initial CPU usage: ~800% per chain (threads on P-cores)
  • After ~4 minutes: CPU usage drops to ~100-300% per chain (threads demoted to E-cores)
  • P-cores sit idle while E-cores are saturated

This behavior appears more aggressive in macOS 26 (Tahoe) but affects all Apple Silicon Macs.

Root cause

TBB creates worker threads without setting a QoS class, so they inherit the default. macOS sees long-running default-QoS threads as background work and demotes them to E-cores.

Proposed fix

Stan Math already has a task_scheduler_observer in stan/math/rev/core/init_chainablestack.hpp that runs when TBB worker threads are created. Adding a call to pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0) in on_scheduler_entry() would signal to macOS that these are user-initiated compute threads.

This doesn't prevent E-core usage - it tells macOS to prefer P-cores when available. If all P-cores are busy (e.g., running more threads than P-cores), macOS can still use E-cores. The fix ensures P-cores aren't left idle while work runs slowly on E-cores.

void on_scheduler_entry(bool worker) {
#ifdef __APPLE__
#if defined(__arm64__) || defined(__aarch64__)
    // Prefer performance cores for compute threads
    pthread_set_qos_class_self_np(QOS_CLASS_USER_INITIATED, 0);
#endif
#endif
    // ... existing AD tape initialization ...
}

References

Environment

  • macOS: 26.2 (Tahoe), also affects earlier versions
  • Hardware: Apple M3 Ultra (also affects M1, M2, M4 series)
  • Stan Math: 2.38.0 (TBB 2020.3)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions