-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding C-Vise Performance with multiple cores and Comparison with C-Reduce #123
Comments
Hey!
Thanks, I appreciate any happy user of my tool. To be honest, as I left SUSE some time ago (where I used C-Vise on daily basis as the GCC developer), my interest in the project is not so high. But I'm happy to help you and hopefully improve your use-cases.
Well, the biggest difference that helped me a lot is that C-Vise can run clang_delta in parallel (compared to C-Reduce) which might lead to a reduction speed-up in cases where clang_delta takes a significant amount of time (modern C++ code snippets). Speaking about 92516, the time is dominated by the compiler, where for the unreduced test-case, it takes ~7s while the clang_delta is only 1s. Apart from that, recently I added to C-Vise that the initial 2 passes flip each other after N=30 successful transformations:
Looking again at the aforementioned test-case, it takes 2800s with 12 threads, while the single-threaded mode takes 3800s, so yes, the speed up is negligible. I've made a quick analysis and I think it's caused by the algorithm itself: it finds a first successful state (a BinaryState is used in case of ClangDelta passes), that reduces the test-case. Then it waits for all previous states to finish and once done, it makes the reduction based on that state. However, it seems the number of failing states in between these succ. state is quite small (in later passes), and they typically fail fast (a compilation error + -Wfatal-errors). The second problem (similar to one I addressed with N=3 succ. transforms): some passes make small improvements and it would be better to move to another pass (try 'S' keystroke during the run to make a human intervention):
as seen, 0.5% improvement was achieved in 5 minutes, while the next pass then makes a much bigger leap in a few seconds. One might also experiment with the pass order a bit where I can imagine LinesPass being run for a limited number of iterations in between the initial clang_delta passes. Anyway, are you willing to invest some time in the C-Vise project where you could experiment with the ideas that can potentially improve your HPC test-cases? Any chance you can upload a real unreduced test-case I can play with? And how many test-cases do you reduce in some period of time? Again, appreciate your interest and really valuable feedback for the project. |
Log files for PR 92516: |
@pramodk Just a friendly reminder. I spent quite some time working with C-Vise and C-Reduce, thus I would like to leverage your very useful experience. |
@marxin Like Pramod, I've been going over the performance characteristics of CVise & CReduce as well. But, my use case is Could you shed some light on how these tools perform in such cases? Should I expect CVise parallelism to perform better? |
What languages do you speak about? Generally speaking, you can run just |
For an extra data point -- running with -n166 cvise grinds to a halt. htop showing nothing happening in any cvise processes for long periods of time. Drop it down to 6 threads and it drastically speeds up. Very strange behavior. |
Hello Martin,
Firstly, big thanks for putting together and keeping up with this invaluable tool 🙌 -- it's been very helpful for us, and I'm sure for many others, when it comes to sorting out compiler bugs in codebases!
This is not a bug/issue but just reaching out to get a better handle on C-Vise and make sure we're getting the most out of it.
Just a bit of background: we started with the C-Reduce to tackle some compiler bugs in our project. But, as some reductions were taking hours n hours, we tried throwing in multiple threads/cores to speed things up. But the execution time didn't show major improvements. Then we stumbled onto C-Vise, wondering if it would run faster. But nope, similar behaviour— no big drop in execution time even with multiple cores.
I've read through the discussions in #41 and #114, also details from John Regehr's blog post about Parallel Test-Case Reduction in C-Reduce. I get the gist of the constraints from the algorithm/parallelization strategy, but I thought I could still ask for more clarity on below points:
Both C-Reduce and C-Vise use clang_delta, with C-Vise's top layer in Python and C-Reduce's in Perl. All good! I am curious how the parallelization scheme in C-Vise differs from C-Reduce. What makes (or is supposed to make) C-Vise more efficient than C-Reduce? (i.e. what is a Super-parallel aspect in C-Vise?)
To make sure my setup is right, I wanted to see if I can hit the performance baseline in the README (instead of our internal test case):
So I did the following:
and a similar one for C-Reduce.
i.e.
--n
CLI option, the C-Reduce chooses the number of cores 4 whereas C-Vise chooses 16. This means in the default execution (without choosing an appropriate number of threads) I haven’t seen better runtimes using C-Vise than C-Reduce (i.e. C-Reduce with 4 cores is a bit better than C-Vise with 16 cores). And also, with the cpu I have, I don’t think there are any numa issues.I'm not deep into these tools yet, especially internal implementation-specific details. Just posting my observations out here, hoping you can tell if this is more or less expected or if there's something off in my analysis.
Thanks again for all your work and support!
The text was updated successfully, but these errors were encountered: