|
| 1 | +--- |
| 2 | +title: "Rayon, performance without knowing" |
| 3 | +slug: rayon-performance-without-knowing |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Why Performance Is Hard |
| 10 | + |
| 11 | +I wanted to speed things up in Rust and let's be honest, *threads are one of the best tools to improve performance in Rust*.. But using the Tokio crate can be quite unintuitive and difficult to use, all those await and features... |
| 12 | + |
| 13 | +Here's where [rayon](https://docs.rs/rayon/latest/rayon/index.html) helps us, allows us to parallelize tasks without having to think about threads. It's simple to use, fast, lightweight, and just works. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +### A simple definition |
| 18 | + |
| 19 | +Rayon is a library that helps you run code in parallel, making it easy to turn slow, step-by-step computations into faster ones that use multiple CPU cores. |
| 20 | + |
| 21 | +It's a small and easy-to-use tool that lets you add parallelism. It makes sure your code runs safely without data races, and it only uses parallelism when it makes sense, depending on the amount of work at runtime. |
| 22 | + |
| 23 | +For example, we can simply turn this line: |
| 24 | + |
| 25 | +```rust |
| 26 | +let results: Vec<_> = data.iter().map(|x| x.do_something()).collect(); |
| 27 | +``` |
| 28 | + |
| 29 | +into: |
| 30 | + |
| 31 | +```rust |
| 32 | +use rayon::prelude::*; |
| 33 | + |
| 34 | +let results: Vec<_> = data.par_iter().map(|x| x.do_something()).collect(); |
| 35 | +``` |
| 36 | + |
| 37 | +Using the [prelude](https://docs.rs/rayon/latest/rayon/prelude/index.html) is the easiest way to do parallelism using rayon in rust. |
| 38 | + |
| 39 | +--- |
| 40 | + |
| 41 | +## Let's break down performance |
| 42 | + |
| 43 | +### Without rayon |
| 44 | + |
| 45 | +I ran the following code which iterates from 1 to 1,000,000, computes the cube (`x.pow(3)`) and the square (`x.pow(2)`) of each number, takes the remainder of both results using modulo `97,531`, then sums those two remainders. I ran it using `cargo run` without any optimization: `Finished dev profile [unoptimized + debuginfo] target(s) in 0.86s` Running target\\debug\\ry.exe `2, 12, 36, 80, 150, 252, 392, 576, 810, 1100` |
| 46 | + |
| 47 | +These are the CPU specs: CPU Name: Intel(R) microarchitecture code named Alderlake-S Frequency: 2.5 GHz Logical CPU Count: 12 |
| 48 | + |
| 49 | +```rust |
| 50 | +fn main() { |
| 51 | + let data: Vec<u64> = (1..1_000_000).collect(); |
| 52 | + let results: Vec<u64> = data.iter() |
| 53 | + .map(|x| x.pow(3) % 97_531 + x.pow(2) % 97_531) |
| 54 | + .collect(); |
| 55 | + println!("{:?}", &results[..10]); |
| 56 | +} |
| 57 | +``` |
| 58 | + |
| 59 | +I measured performance using Intel Vtune profiler and we can see that without using rayon it needs 0.041s using 1 single thread |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | +and the function which needs more time is the main, because we iterate, calculate and collect the result |
| 64 | + |
| 65 | + |
| 66 | + |
| 67 | +--- |
| 68 | + |
| 69 | +### With rayon |
| 70 | + |
| 71 | +The computation level is the same as before, but this time we use rayon: |
| 72 | + |
| 73 | +```rust |
| 74 | +use rayon::prelude::*; |
| 75 | + |
| 76 | +fn main() { |
| 77 | + let data: Vec<u64> = (1..1_000_000).collect(); |
| 78 | + let results: Vec<u64> = data.par_iter() |
| 79 | + .map(|x| x.pow(3) % 97_531 + x.pow(2) % 97_531) |
| 80 | + .collect(); |
| 81 | + println!("{:?}", &results[..10]); |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +adding `rayon = "1.10.0"` to your Cargo.toml dependencies |
| 86 | + |
| 87 | +I compiled without optimizations: |
| 88 | + |
| 89 | +`Finished dev profile [unoptimized + debuginfo] target(s) in 0.02s` |
| 90 | + |
| 91 | +`Running target\debug\ry.exe` |
| 92 | + |
| 93 | +`[2, 12, 36, 80, 150, 252, 392, 576, 810, 1100]` |
| 94 | + |
| 95 | +Already now we can see that the program ran in 0.02 seconds, compared to 0.86 seconds without Rayon, but let's see in detail: |
| 96 | + |
| 97 | + |
| 98 | + |
| 99 | +* First, we can see it uses 8 threads instead of just one |
| 100 | + |
| 101 | +* We see that it took 0.029 seconds instead of 0.041s |
| 102 | + |
| 103 | +* CPU status is constantly in idle mode instead of poor as before |
| 104 | + |
| 105 | + |
| 106 | +As before all the effective time is used by one single function which is the last called |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## When (and When Not) to Use Rayon |
| 113 | + |
| 114 | +The ideal use cases are *CPU-bound work, large datasets, pure functions, sorting, etc.* |
| 115 | + |
| 116 | +Instead for *small workloads, shared mutable state or I/O-heavy tasks* it's better to use the [Tokio](https://docs.rs/tokio/latest/tokio/) runtime if you really need it. The [Tokio module](https://docs.rs/tokio/latest/tokio/#modules) supports `fs, time, command execution, net` and a lot more using multithreading, but that's another topic I'll write about... |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +### Other stuff Rayon does |
| 121 | + |
| 122 | +Beyond `.map` and `.par_iter` Rayon also includes `.filter()`, `.reduce()`, `.for_each()`, `join()` for parallel sorting |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +## To sum up |
| 127 | + |
| 128 | +`Rayon` isn't always the best choice. Still, it's a **smart and safe way** to add parallelism. It helps you **scale workloads with minimal code changes**, making it a solid choice for performance-critical applications. |
| 129 | + |
| 130 | +\`💡 Got another crate in mind? |
| 131 | + |
| 132 | +## ☕ Was this helpful? |
| 133 | + |
| 134 | +Treat me to a coffee on Ko-fi [https://ko-fi.com/riccardoadami](https://ko-fi.com/riccardoadami) |
0 commit comments