Skip to content

Conversation

@uriyage
Copy link
Contributor

@uriyage uriyage commented Dec 4, 2025

Summary

This PR redesigns the IO threading communication model, replacing the inefficient client-list polling approach with a high-performance, lock-free queue architecture. This change improves throughput by 8–17% across various workloads and lays the groundwork for offloading command execution to IO threads in following PRs.

Performance Comparison: Unstable vs New IO Queues

Type Operation Unstable Branch (M TPS) New IO Queues (M TPS) Difference (%)
CME SET 1.02 1.19 +16.67%
CME GET 1.30 1.47 +13.08%
CMD SET 1.15 1.35 +17.39%
CMD GET 1.52 1.64 +7.89%
  • Test Configuration: 8 IO threads • 400 clients • 512-byte values • 3M keys

Motivation

The previous IO model had several limitations that created performance bottlenecks:

  • Inefficient Polling: The main thread lacks a direct notification mechanism for completed work. Instead, it must constantly iterate through a list of all pending clients to check their state, wasting significant CPU cycles.
  • Manual Load Balancing: Jobs are assigned to specific threads upfront. This requires the main thread to predict which thread to use, often leaving some threads idle while others are overloaded.
  • Static Scaling: Thread activation relies on a fixed heuristic (e.g., 1 thread per 2 events). This approach fails to adapt to varying workloads, such as TLS connections or differing read/write sizes.

The Solution

To address these inefficiencies, this PR replaces the single SPSC queue used currently with three specialized queues to handle communication and load balancing more effectively.

1. Main > IO: Shared Queue (Single Producer Multi Consumer)

Single queue from the main-thread to IO threads.

  • Automatic Load Balancing: All threads pull from the same source. Busy threads take less work, and idle threads take more, so we don't need to manually select a thread.
  • Adaptive Scaling: We now use the queue depth to decide when to add or remove threads. If the queue is full, we scale up; if it's empty, we scale down.
    • Ignition: To get things started before the queue fills up, we monitor the main thread's CPU. If usage goes over 30%, we wake up the first IO thread.
  • Implementation: To prevent contention among consumers, each item in the ring buffer is padded to reside in its own cache line. Sequence numbers are utilized to indicate whether a cell is empty or populated, allowing threads to safely claim work.

2. IO > Main: The Response Channel (MPSC Queue)

We replaced the old polling loop with a response queue.

  • ** Faster Completion:** IO threads push completed jobs into this queue. The main thread detects new data simply by checking if the queue is not empty, removing the need to scan pending clients.
  • Contention Management: To avoid lock contention, each thread reserves a slot by atomically incrementing the tail index. In the rare event that the queue is full, pending jobs are buffered in a local temporary list until space becomes available.

3. MAIN > IO (Thread-Specific): Private Inbox (SPSC Queue)

We kept the existing Single-Producer Single-Consumer (SPSC) queues for tasks that must happen on a specific thread (like freeing memory allocated by that thread). IO threads always check their private inbox before looking at the shared queue.

Changes Required

  • Async client release
    The main thread no longer busy-waits for IO threads to finish with a client. Since the client must be popped from the multi-producer queue before it can be released, clients with pending IO are now marked for asynchronous closure.

  • eviction clients logic
    Updated evictClients() to account for memory pending release (clients marked close_asap). freeClient() now returns a status code (1 for freed, 0 for async-close) to ensure the eviction loop does not over-evict by ignoring memory that is about to be reclaimed.

  • events-per-io-thread config
    Replaced the events-per-io-thread configuration with io-threads-always-active. as we no longer track events, since this config is use only for tests no backward compatibility issue arises.

  • packed job instead of handlers
    Jobs are now represented as tagged pointers (using lower 3 bits for job type) instead of separate {handler, data} structs. This reduces memory overhead and allows jobs to be passed through the queues as single pointers.

  • head caching in spsc queue
    The SPSC queue now caches the head index on the producer side (head_cache) to avoid frequent atomic loads. The producer only refreshes from the atomic head when the cache indicates the queue might be full, reducing cross-thread cache-line bouncing.

  • deferred commit in SPSC queue.
    spscEnqueue() supports batching via a commit flag. Multiple jobs can be enqueued with commit=false, then flushed with a single spscCommit() call, reducing atomic operations and cache-line bouncing.

  • rollback on fullness check failure
    When spmcEnqueue() fails due to a full queue, the client state is rolled back (e.g., io_write_state reset to CLIENT_IDLE). This rollback approach removes the need to call an expensive isFull check before every enqueue, we just attempt the enqueue and revert if it fails.

  • epoll offloading via SPSC at high thread counts.
    When active_io_threads_num > 9, poll jobs are sent to per-thread SPSC queues (round-robin). Since threads check their private queue first, this ensures poll jobs are processed promptly without waiting behind jobs in the shared SPMC queue.

  • avoid offload write before read comes back
    Added a check if (c->io_read_state == CLIENT_PENDING_IO) return C_OK in trySendWriteToIOThreads(). In the previous per-thread SPSC implementation, we could send consecutive read and write jobs for the same client knowing a single thread would handle them in order. With the shared SPMC queue, different threads may pick up the jobs, so we must wait for the read to complete before sending a write to avoid 2 threads handling the same client.

  • removing pending_read_list_node from client and clients_pending_io_read/write lists from server
    Removed pending_read_list_node from the client struct and clients_pending_io_read/clients_pending_io_write lists from valkeyServer. as the new mpsc eliminates the need for these tracking structures.

  • added inst metrics for pending io jobs
    Added instantaneous_io_pending_jobs metric via STATS_METRIC_IO_WAIT to track average queue depth over time.

  • added stat for current active threads number
    Added active_io_threads_num to the INFO stats output for better visibility.

  • added internal inst metric for main-thread cpu (non apple compliant)
    Added STATS_METRIC_MAIN_THREAD_CPU_SYS to track main thread CPU usage via getrusage(RUSAGE_THREAD). This powers the "ignition" policy, when CPU exceeds 30%, the first IO thread is activated. RUSAGE_THREAD is Linux-specific, so macOS falls back to event-count heuristics.

  • added stat for pending read and writes for io
    Added io_threaded_reads_pending and io_threaded_writes_pending stats to track how many read/write jobs are currently in-flight to IO threads.

  • added volatile for crashed
    Changed server.crashed from int to volatile int to ensure the crash flag is visible across threads immediately, allowing IO threads to detect a crash and stop sending responses back to the main thread to avoid deadlock on crash.

Co-authored-by: Dan Touitou [email protected]
Signed-off-by: Uri Yagelnik [email protected]

@uriyage uriyage force-pushed the io-threads-queues branch 2 times, most recently from dd619ee to 9af5548 Compare December 4, 2025 19:50
@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

❌ Patch coverage is 30.53279% with 339 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.34%. Comparing base (04d0bba) to head (e377b7b).

Files with missing lines Patch % Lines
src/io_threads.c 25.44% 208 Missing ⚠️
src/io_queues.h 35.91% 91 Missing ⚠️
src/networking.c 29.82% 40 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2909      +/-   ##
============================================
- Coverage     72.41%   72.34%   -0.07%     
============================================
  Files           129      130       +1     
  Lines         70528    70773     +245     
============================================
+ Hits          51076    51204     +128     
- Misses        19452    19569     +117     
Files with missing lines Coverage Δ
src/config.c 78.44% <ø> (ø)
src/server.c 88.42% <100.00%> (+0.02%) ⬆️
src/server.h 100.00% <ø> (ø)
src/networking.c 89.13% <29.82%> (+0.69%) ⬆️
src/io_queues.h 35.91% <35.91%> (ø)
src/io_threads.c 30.13% <25.44%> (-5.45%) ⬇️

... and 12 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Dan Touitou <[email protected]>
Signed-off-by: Uri Yagelnik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant