Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding C-Vise Performance with multiple cores and Comparison with C-Reduce #123

Open
pramodk opened this issue Jan 7, 2024 · 7 comments

Comments

@pramodk
Copy link

pramodk commented Jan 7, 2024

Hello Martin,

Firstly, big thanks for putting together and keeping up with this invaluable tool 🙌 -- it's been very helpful for us, and I'm sure for many others, when it comes to sorting out compiler bugs in codebases!

This is not a bug/issue but just reaching out to get a better handle on C-Vise and make sure we're getting the most out of it.

Just a bit of background: we started with the C-Reduce to tackle some compiler bugs in our project. But, as some reductions were taking hours n hours, we tried throwing in multiple threads/cores to speed things up. But the execution time didn't show major improvements. Then we stumbled onto C-Vise, wondering if it would run faster. But nope, similar behaviour— no big drop in execution time even with multiple cores.

I've read through the discussions in #41 and #114, also details from John Regehr's blog post about Parallel Test-Case Reduction in C-Reduce. I get the gist of the constraints from the algorithm/parallelization strategy, but I thought I could still ask for more clarity on below points:

  1. Both C-Reduce and C-Vise use clang_delta, with C-Vise's top layer in Python and C-Reduce's in Perl. All good! I am curious how the parallelization scheme in C-Vise differs from C-Reduce. What makes (or is supposed to make) C-Vise more efficient than C-Reduce? (i.e. what is a Super-parallel aspect in C-Vise?)

  2. To make sure my setup is right, I wanted to see if I can hit the performance baseline in the README (instead of our internal test case):

    Test-case Size C-Vise Reduction C-Reduce Reduction Speed Up
    PR92516 6.5 MB 35m 77m 220%

So I did the following:

  • On Ubuntu 22.4 box, I installed C-Vise (2.4) and C-Reduce (2.10)
  • I installed GCC commit r278246 mentioned in the Bug 92516
  • The Ubuntu box has an Intel Core i9-12900K CPU and it’s mostly idle.
  • The interestingness test and the driver script looked like below:
$ cat bug.sh

#!/bin/bash

compile_output=$(~/install/gcc-r278246/bin/g++ -O3 bug567.cc -std=c++1z 2>&1)

if [[ $compile_output == *"internal compiler error: in vect_schedule_slp_instance"* ]]; then
    echo "Still fails for vectorisation with -O3!"
    exit 0
else
    exit 1
fi


$ cat driver.sh

for ncores in 16 12 8 6 4 2 1;
do
    start_time=$SECONDS
    taskset -c 0-$((ncores - 1)) cvise --timestamp --n $ncores bug.sh bug567.cc    # taskset or numactl to pin the threads 
    elapsed_time=$((SECONDS - start_time))
    echo "==>  --n $ncores took: $elapsed_time seconds"
done

and a similar one for C-Reduce.

  • The timing from the above execution of C-Vise and C-Reduce look like below:
#threads C-Reduce (seconds) C-Vise (seconds)
1 5161 3916
2 4680 4348
4 3557 3624
6 3387 3446
8 3834 3406
12 4798 3424
16 5785 3579

compare-creduce-cvise

i.e.

  • C-Vise runs more faster with ncores == 1, 12 and 16. But, if we look at the fastest execution then the C-Reduce and C-Vise are quite close. (I repeated the runs a few times)
  • Another aspect I noticed, in the absence of --n CLI option, the C-Reduce chooses the number of cores 4 whereas C-Vise chooses 16. This means in the default execution (without choosing an appropriate number of threads) I haven’t seen better runtimes using C-Vise than C-Reduce (i.e. C-Reduce with 4 cores is a bit better than C-Vise with 16 cores). And also, with the cpu I have, I don’t think there are any numa issues.

I'm not deep into these tools yet, especially internal implementation-specific details. Just posting my observations out here, hoping you can tell if this is more or less expected or if there's something off in my analysis.

Thanks again for all your work and support!

@pramodk
Copy link
Author

pramodk commented Jan 13, 2024

I was playing a bit more with another example from our internal project using the NVHPC compiler. I used similar benchmarking scripts mentioned above. Simply looking at the htop, all CPUs are used as expected:

image

And the the comparison look like below:

image

i.e.

  • For this specific example, both CVise and C-Reduce didn't improve reduction time with more cores
  • CVise is ~ 1.9x slower than C-Reduce when ncores == 1 (and ~1.3x with ncores == 16).

@marxin
Copy link
Owner

marxin commented Jan 13, 2024

Hey!

Firstly, big thanks for putting together and keeping up with this invaluable tool 🙌 -- it's been very helpful for us, and I'm sure for many others, when it comes to sorting out compiler bugs in codebases!

Thanks, I appreciate any happy user of my tool. To be honest, as I left SUSE some time ago (where I used C-Vise on daily basis as the GCC developer), my interest in the project is not so high. But I'm happy to help you and hopefully improve your use-cases.

What makes (or is supposed to make) C-Vise more efficient than C-Reduce? (i.e. what is a Super-parallel aspect in C-Vise?)

Well, the biggest difference that helped me a lot is that C-Vise can run clang_delta in parallel (compared to C-Reduce) which might lead to a reduction speed-up in cases where clang_delta takes a significant amount of time (modern C++ code snippets). Speaking about 92516, the time is dominated by the compiler, where for the unreduced test-case, it takes ~7s while the clang_delta is only 1s. Apart from that, recently I added to C-Vise that the initial 2 passes flip each other after N=30 successful transformations:

00:00:29 INFO ===< ClangBinarySearchPass::replace-function-def-with-decl (30 T) >===
00:00:33 INFO using C++ standard: c++2b with 11886 transformation opportunities
00:01:02 INFO (10.1%, 6126558 bytes, 140780 lines)
...
00:04:42 INFO skipping after 30 successful transformations

Looking again at the aforementioned test-case, it takes 2800s with 12 threads, while the single-threaded mode takes 3800s, so yes, the speed up is negligible.

I've made a quick analysis and I think it's caused by the algorithm itself: it finds a first successful state (a BinaryState is used in case of ClangDelta passes), that reduces the test-case. Then it waits for all previous states to finish and once done, it makes the reduction based on that state. However, it seems the number of failing states in between these succ. state is quite small (in later passes), and they typically fail fast (a compilation error + -Wfatal-errors).

The second problem (similar to one I addressed with N=3 succ. transforms): some passes make small improvements and it would be better to move to another pass (try 'S' keystroke during the run to make a human intervention):

00:13:33 INFO (70.3%, 2024657 bytes, 62746 lines)
00:13:35 INFO (70.3%, 2024240 bytes, 62741 lines)
00:13:36 INFO (70.3%, 2024004 bytes, 62736 lines)
...
00:18:45 INFO (70.8%, 1986991 bytes, 61990 lines)
00:18:46 INFO (70.8%, 1986960 bytes, 61989 lines)
00:18:48 INFO (70.8%, 1986934 bytes, 61988 lines)
00:18:49 INFO (70.8%, 1986864 bytes, 61986 lines)
00:18:49 INFO ===< LinesPass::0 >===
00:18:52 INFO (70.9%, 1979778 bytes, 3329 lines)
00:18:54 INFO (71.0%, 1975236 bytes, 3219 lines)
00:18:55 INFO (71.4%, 1946537 bytes, 3109 lines)
00:18:56 INFO (71.5%, 1941087 bytes, 2999 lines)
00:18:57 INFO (71.6%, 1936804 bytes, 2944 lines)
00:18:58 INFO (71.7%, 1930422 bytes, 2889 lines)
00:19:00 INFO (71.7%, 1924734 bytes, 2834 lines)
00:19:01 INFO (76.9%, 1570311 bytes, 2779 lines)
00:19:02 INFO (77.0%, 1567687 bytes, 2724 lines)
00:19:04 INFO (77.0%, 1566062 bytes, 2697 lines)
00:19:05 INFO (77.0%, 1565166 bytes, 2670 lines)
00:19:06 INFO (77.0%, 1564674 bytes, 2643 lines)

as seen, 0.5% improvement was achieved in 5 minutes, while the next pass then makes a much bigger leap in a few seconds.
One might implement a more clever algorithm that will switch to another pass after the improvement speed is low.

One might also experiment with the pass order a bit where I can imagine LinesPass being run for a limited number of iterations in between the initial clang_delta passes.

Anyway, are you willing to invest some time in the C-Vise project where you could experiment with the ideas that can potentially improve your HPC test-cases? Any chance you can upload a real unreduced test-case I can play with? And how many test-cases do you reduce in some period of time?

Again, appreciate your interest and really valuable feedback for the project.

@marxin
Copy link
Owner

marxin commented Jan 13, 2024

Log files for PR 92516:
threads-1.txt
threads-24.txt

@marxin
Copy link
Owner

marxin commented Feb 20, 2024

@pramodk Just a friendly reminder. I spent quite some time working with C-Vise and C-Reduce, thus I would like to leverage your very useful experience.

@dufferzafar
Copy link

dufferzafar commented Apr 1, 2024

@marxin Like Pramod, I've been going over the performance characteristics of CVise & CReduce as well.

But, my use case is --not-c so I'm trying it out with other languages & my own interesting.sh script.

Could you shed some light on how these tools perform in such cases? Should I expect CVise parallelism to perform better?

@marxin
Copy link
Owner

marxin commented Apr 2, 2024

other languages

What languages do you speak about?

Generally speaking, you can run just cvise-delta.py that performs line-by-line removal and it might be faster than what C-Reduce does. Please test it.

@lanza
Copy link

lanza commented May 11, 2024

For an extra data point -- running with -n166 cvise grinds to a halt. htop showing nothing happening in any cvise processes for long periods of time. Drop it down to 6 threads and it drastically speeds up. Very strange behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants