Skip to content

Conversation

jbylicki
Copy link

This PR implements parallelized partition finding in the extract_fa step. We have tested the speedup on proprietary designs that we can't share, and found great improvements. Partition finding needs to only read from the global state, and thus requires little synchronization. We've tested parallelization on later loops, but found no improvements there due to frequent writes.

The run times of the extract_fa step compared to main (runs were performed on 8 cores):

branch design median extract_fa time [s]
main design-1 45.111
extract-fa-parallelization design-1 25.255
main design-2 0.813
extract-fa-parallelization design-2 0.605
main design-3 17.464
extract-fa-parallelization design-3 10.008
main design-4 101.922
extract-fa-parallelization design-4 56.426
main design-5 67.834
extract-fa-parallelization design-5 40.071
main design-6 32.499
extract-fa-parallelization design-6 22.378
main design-7 80.202
extract-fa-parallelization design-7 38.357
main design-8 533.619
extract-fa-parallelization design-8 256.580

@ShinyKate ShinyKate assigned widlarizer and jix and unassigned widlarizer Sep 22, 2025
count_func2 = 0;
count_func3 = 0;
if (config.verbose)
log(" checking %s\n", log_signal(it.first));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even read-only access for RTLIL data structures currently isn't thread safe. There are a lot more places where worker thread RTLIL access happens, but I picked this as the most obvious one. Until we are able to change that, only the main thread can access RTLIL. To ensure we can maintain this, we also require that this is made obvious by not handing RTLIL references to code running on worker threads in the first place. See #5266 (comment) for a recent discussion on the requirements for adding multi-threaded code to Yosys and the corresponding PR for an example of what is currently possible.

I also think if we are adding multi-threading, we should prefer using work queues to dynamically balance the workload instead of statically splitting it like this PR currently does. The PR I linked to introduces some primitives for this.

pool<tuple<tuple<SigBit, SigBit, SigBit>,int, SigBit>> tl_func_3;
};

std::mutex consteval_mtx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The declaration of this mutex is far removed from the data that is actually protected by it. If raw mutexes are used at all that should be right next to each other.

When using a mutex like this, there is also nothing preventing or even hinting at an issue when introducing new ce accesses that are not protected by the appropriate lock guard. This makes it way too easy to introduce bugs that can be very hard to debug. For that reason I'm inclined to require use of higher level primitives within passes.

We could e.g. add our own Mutex<T> that combines a std::mutex mutex and a T value and only provides access via a lock method that passes out our own MutexGuard<T> that combines a std::lock_guard guard and a T &value ensuring that you only get to access the shared value while you hold the lock. (Unless you explicitly store a reference elsewhere, but that's always a hazard and not specific to multi-threading, making it somewhat easier to spot.) This is more or less the same API that Rust provides but of course there's nothing that stops this approach from being implemented in C++. The first example I found is folly's Synchronized, which also goes into a bit more detail motivating the use of this API over what std::mutex provides.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants