-
Notifications
You must be signed in to change notification settings - Fork 3k
feat: add threaded I/O pipeline for video processing #1997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
feat: add threaded I/O pipeline for video processing #1997
Conversation
4c2a8d0 to
90d47de
Compare
Implements pipeline with bounded queues to overlap decode, compute and encode. Reduces I/O stalls.
90d47de to
03f6239
Compare
|
Hey @AnonymDevOSS, This PR pretty good. I think there are plans to create new video API #1924. What are your thoughts on adding threading for the new Video API? |
|
Sure, I'll take a look at it this week. |
|
Hi @AnonymDevOSS! The benchmark results look very promising. Is there any reason you decided to add a new |
|
The main reason was to include both of them in the benchmark for comparison purposes. I cannot think of any scenario where it would introduce issues or drawbacks. |
|
@AnonymDevOSS In that case would you have time to replace old API with new API today? I'm pushing towards |
…mentation and redundant tests
|
I pushed it now. Sorry, I read your message a bit too late. |
Implements pipeline with bounded queues to overlap decode, compute and encode. Reduces I/O stalls.
Description
Process a video using a threaded pipeline that asynchronously
reads frames, applies a callback to each, and writes the results
to an output file.
This function implements a three-stage pipeline designed to maximize
frame throughput.
Reader thread: reads frames from disk into a bounded queue ('read_q')
until full, then blocks. This ensures we never load more than 'prefetch'
frames into memory at once.
Main thread: dequeues frames, applies the 'callback(frame, idx)',
and enqueues the processed result into 'write_q'.
This is the compute stage. It's important to note that it's not threaded,
so you can safely use any detectors, trackers, or other stateful objects
without synchronization issues.
Writer thread: dequeues frames and writes them to disk.
Both queues are bounded to enforce back-pressure:
Summary:
It's thread-safe: because the callback runs only in the main thread,
using a single stateful detector/tracker inside callback does not require
synchronization with the reader/writer threads.
While the main thread processes frame N, the reader is already decoding frame N+1,
and the writer is encoding frame N-1. They operate concurrently without blocking
each other.
Type of change
Please delete options that are not relevant.
How has this change been tested, please provide a testcase or example of how you tested the change?
I created a benchmark script to measure the performance impact of these changes. (benchmark_process_video.py;
full_results.txt)
I created 3 functions to benchmark: opencv (short), opencv (long), tracker; benchmarking current process_video and new process_video_threads.
Results below, 5 executions for each case:
Initially, I explored using threads and processes to parallelize process_video (I can push some of those prototypes if needed), but this design wasn’t thread-safe for stateful callbacks (e.g. trackers) and showed little improvement in profiling; most of the total time was spent on disk I/O rather than computation.
This optimization instead focuses on improving the I/O path, yielding a more generic and safe performance gain.