Skip to content

Conversation

@alexandrebouchard
Copy link
Member

No description provided.

@miguelbiron
Copy link
Collaborator

I was thinking that, instead of / in addition to creating new artificial tests, we could just grab the timing information in the DefaultTestSet structure.. We could either do this at the top level only or recursively for every single testset.

@trevorcampbell
Copy link
Collaborator

@miguelbiron do you mean that we'd collect/report benchmarking results for the tests in our test suite? I'm not sure there will be a tonne of overlap between the things we want to benchmark and the things we want to use for unit testing.

@miguelbiron
Copy link
Collaborator

That's true... Also there's no alloc data in the testset.

@trevorcampbell
Copy link
Collaborator

trevorcampbell commented Mar 29, 2025

@miguelbiron @alexandrebouchard I've added the prototype code for the workflows. A brief explanation:

  • the current version of benchmarking results for main is always kept up to date on test/benchmark.csv
  • any time there is a commit to main, benchmark_update.yml runs and refreshes the new benchmarking results, and commits those back to main.
  • any time there is a PR targeting main, benchmark_compare.yml runs on PRs and compares to the current version of the benchmarking results stored in the csv file to the new PR code. The results are posted to the PR thread as a markdown table in a comment. Whenever the PR is updated, the comparison will re-run and the markdown table will be updated.
  • both workflows have a 60 min timeout
  • test/benchmark.jl is responsible for running the benchmark suite (I made a minor modification to the format of the csv)
  • test/compare_benchmarks.jl is responsible for comparing two benchmarking CSVs and output the difference as a markdown table
  • Right now this runs on a github runner, but with a small change, once we are ready to spin up the runner, we can make this self-hosted (it's commented out currently)

One issue we should fix before merging: both benchmark.jl and compare_benchmarks.jl trigger Julia to download/install/build all kinds of packages on the workflow. The benchmarking itself is almost instantaneous, but the package building/installation is super slow (especially for compare_benchmarks.jl, it's like 15-20 mins or so).

I am not entire sure how we do environment setup in the benchmark.jl script, but I just copied it for compare_benchmarks.jl -- I imagine we can probably be smarter about that somehow? Thoughts?

@trevorcampbell
Copy link
Collaborator

I also had an idea of something neat we could do: we could include a plot of benchmarking results as a function of time in the Pigeons documentation.

This command:

git log --pretty=format:"%H" --follow -- test/benchmark.csv

will spit out a list of git commit hashes corresponding to commits where test/benchmark.csv was changed. When we build the docs, we can iterate through those, collect all the results CSVs, join them and create PlotlyJS plots for each test, output an html file that is linked from the docs.

Thoughts?

@trevorcampbell
Copy link
Collaborator

Ah one more thing @alexandrebouchard we should probably modfiy test/benchmark.jl to run >1 trials for each benchmark and report the median / 25 / 75 or something like that.

(At least on mvn1000 it seems the results vary by around 10% regularly for a single trial)

@miguelbiron
Copy link
Collaborator

Ditto the last comment and also we probably need some basic metadata about the environment. At the very least we should store the exact Julia version used.

@codecov
Copy link

codecov bot commented Mar 30, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.82%. Comparing base (e492fa7) to head (c837962).
⚠️ Report is 34 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #333      +/-   ##
==========================================
+ Coverage   87.40%   87.82%   +0.42%     
==========================================
  Files         107      107              
  Lines        2660     2654       -6     
==========================================
+ Hits         2325     2331       +6     
+ Misses        335      323      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@miguelbiron
Copy link
Collaborator

@alexandrebouchard @trevorcampbell the failing docs are due to the old nonreproducible errors that seem to have been fixed with the new DPPL version. We should just merge this PR so that we can then merge #328 which doesn't suffer from this

@trevorcampbell
Copy link
Collaborator

trevorcampbell commented Apr 18, 2025

OK I think I'm giving up on reducing precompilation time for now. Let's just push that off to a later PR and get this one sorted.

Seems like the last TODOs here are

  • include basic metainfo in the results CSV
  • run each benchmark > 1 times to average and get estimates of std error
  • store CI console log and Manifest files somewhere

@trevorcampbell
Copy link
Collaborator

OK @alexandrebouchard this should be all sorted now. Example PR thread message with diff below:

image

@trevorcampbell
Copy link
Collaborator

trevorcampbell commented Aug 16, 2025

Ah actually @alexandrebouchard one thing I'd like your eye on before we merge is how packages are activated/installed/used in benchmark.jl and compare_benchmarks.jl

You'll notice the using Pigeons call takes 25s (previously it was 2s) according to the benchmark, which seems off to me (maybe that includes some precompilation, not sure -- you have a better grasp of how julia manages packages and the Pigeons ecosystem than I do)

but once that's sorted we can merge and i'll get the runner set up properly for this repo

@trevorcampbell
Copy link
Collaborator

trevorcampbell commented Aug 18, 2025

Improved the look&feel of the diff table and included metainfo in the diff

image

@alexandrebouchard
Copy link
Member Author

This is looks really great and will be super useful. I'll add more targets very shortly, I'm building a collection.

@trevorcampbell
Copy link
Collaborator

@alexandrebouchard FYI I replaced versioninfo(verbose=true) with just versioninfo() to avoid the noisy/unreliable CPU measurements from polluting the diffs. It still has almost all the information we'd want (except memory, but c'est la vie)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants