[WIP] Fast isValidCell #496

ajfriend · 2021-07-05T22:17:05Z

In working on a new compactCells implementation, isValidCell was showing up as a bottleneck in some of the benchmarks, so I started playing around with optimizing it.

This is still experimental (e.g., I'm still using 7 instead of the INVALID_DIGIT macro just because I find that easier for prototyping).

Putting this up now as a draft PR to get some early feedback, to take notes on what I've tried so far, and to discuss different implementation options.

Current benchmarks have this new implementation taking about 60% of the time of the current implementation. Note that the benchmarks can depend on the resolution of the cells; low resolution cells tend to go faster because the last INVALID_DIGIT check can avoid more looping (I think...).

coveralls · 2021-07-05T22:19:39Z

Coverage decreased (-2.5%) to 96.201% when pulling 5a70747 on ajfriend:fast_isValidCell into 42f56e3 on uber:master.

ajfriend · 2021-07-05T22:32:28Z

src/h3lib/lib/h3Index.c

+    // The 4 mode bits should be 0b0001 (H3_CELL_MODE)
+    // The 3 reserved bits should be 0b000
+    // In total, the top 8 bits should be 0b00001000
+    if (GT(h, 8) != 0b00001000) return 0;


Binary 0b literals might not be portable, in which case we can always switch to hex. Binary is a bit easier for prototyping, tho :)

Alternative implementation:
(also, could use #defines for the bit lengths here)

if (GT(h, 1) != 0) return 0; h <<= 1; if (GT(h, 4) != 1) return 0; h <<= 4; if (GT(h, 3) != 0) return 0; h <<= 3;

This alternative might be clearer, and Clang is even able to compile each version to the same instructions. However, GCC doesn't seem to produce the same for the longer version: https://godbolt.org/z/sjKed8jn1

(Even though the last shift in each example is optimized out to a no-op, still gives you the general idea.)

ajfriend · 2021-07-05T22:33:45Z

src/h3lib/lib/h3Index.c

+    // The first nonzero digit can't be a 1 (i.e., "deleted subsequence",
+    // PENTAGON_SKIPPED_DIGIT, or K_AXES_DIGIT).
+    // Test for pentagon base cell first to avoid this loop if possible.
+    if (isBaseCellPentagonArr[bc]) {


Using this lookup table seems to be a bit faster than calling _isBaseCellPentagon(bc).

Alternative:

int _isBaseCellPentagon(int baseCell) { switch(baseCell) { case 4: case 14: case 24: case 38: case 49: case 58: case 63: case 72: case 83: case 97: case 107: case 117: return 1; default: return 0; } }

If you can use baseCellData[bc].isPentagon that should be equivalent? Not sure of the linking cost of going outside the file.

ajfriend · 2021-07-05T22:54:33Z

src/h3lib/lib/h3Index.c

+    // Now check that all the unused digits after `res` are
+    // set to 7 (INVALID_DIGIT).
+    // Bit shift operations allow us to avoid looping through digits;
+    // this saves time in benchmarks.
+    int shift = (15 - res) * 3;
+    uint64_t m = 0;
+    m = ~m;
+    m >>= shift;
+    m = ~m;
+    if (h != m) return 0;


Alternatives, but the one above using bit shifting seemed to be the fastest:

Loop

(slowest of all 3)

for (; r <= 15; r++) { if (GT(h, 3) != 7) return 0; h <<= 3; }

Lookup table

static const uint64_t m7s[16] = { 0b1111111111111111111111111111111111111111111110000000000000000000, 0b1111111111111111111111111111111111111111110000000000000000000000, 0b1111111111111111111111111111111111111110000000000000000000000000, 0b1111111111111111111111111111111111110000000000000000000000000000, 0b1111111111111111111111111111111110000000000000000000000000000000, 0b1111111111111111111111111111110000000000000000000000000000000000, 0b1111111111111111111111111110000000000000000000000000000000000000, 0b1111111111111111111111110000000000000000000000000000000000000000, 0b1111111111111111111110000000000000000000000000000000000000000000, 0b1111111111111111110000000000000000000000000000000000000000000000, 0b1111111111111110000000000000000000000000000000000000000000000000, 0b1111111111110000000000000000000000000000000000000000000000000000, 0b1111111110000000000000000000000000000000000000000000000000000000, 0b1111110000000000000000000000000000000000000000000000000000000000, 0b1110000000000000000000000000000000000000000000000000000000000000, 0b0}; if (h != m7s[res]) return 0;

ajfriend · 2021-07-05T23:14:28Z

Some benchmark runs. New algo seems to take roughly 65% or 53% of the time of the current, depending on the input cells.

new algo

build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1141.561000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 785.043000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1138.550000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 783.216000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1141.815000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 782.925000 microseconds per iteration (1000 iterations)

current algo

build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1788.558000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 1462.292000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1750.815000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 1502.663000 microseconds per iteration (1000 iterations)
build/bin/benchmarkIsValidCell
    -- pentagonChildren_8_14: 1773.345000 microseconds per iteration (1000 iterations)
    -- pentagonChildren_2_8: 1467.033000 microseconds per iteration (1000 iterations)

Here are the ratios:

new = [
    [1141.561000, 785.043000],
    [1138.550000, 783.216000],
    [1141.815000, 782.925000],
]
old = [
    [1788.558000, 1462.292000],
    [1750.815000, 1502.663000],
    [1773.345000, 1467.033000],
]

np.array(new)/np.array(old)

gives

array([[0.63825775, 0.53685789],
       [0.65029715, 0.52121866],
       [0.6438764 , 0.5336792 ]])

nrabinowitz

This is super fun :). The only optimization I have to add here is Duff's Device, not sure if it will gain you much but it could improve the remaining loops.

nrabinowitz · 2021-07-12T20:46:08Z

src/h3lib/lib/h3Index.c

+// Get Top t bits from h
+#define GT(h, t) ((h) >> (64 - (t)))
+
+static const bool isBaseCellPentagonArr[128] = {


Why is this an improvement over baseCellData[baseCellNum].isPentagon?

nrabinowitz · 2021-07-12T20:47:30Z

src/h3lib/lib/h3Index.c

+    // | ...        |    ... |
+    // | Digit 15   |      3 |
+    //
+    // Additionally, we try to group operations and void loops when possible.


Nit: avoid loops?

nrabinowitz · 2021-07-12T21:04:37Z

src/h3lib/lib/h3Index.c

+    // The first nonzero digit can't be a 1 (i.e., "deleted subsequence",
+    // PENTAGON_SKIPPED_DIGIT, or K_AXES_DIGIT).
+    // Test for pentagon base cell first to avoid this loop if possible.
+    if (isBaseCellPentagonArr[bc]) {


If you can use baseCellData[bc].isPentagon that should be equivalent? Not sure of the linking cost of going outside the file.

nrabinowitz · 2021-07-12T21:09:04Z

src/h3lib/lib/h3Index.c

@@ -78,49 +78,108 @@ void H3_EXPORT(h3ToString)(H3Index h, char *str, size_t sz) {
    sprintf(str, "%" PRIx64, h);
 }

+// Get Top t bits from h
+#define GT(h, t) ((h) >> (64 - (t)))


Nit: This looks like "greater than" to me - GET_BITS?

nrabinowitz · 2021-07-12T21:16:13Z

src/h3lib/lib/h3Index.c

-        if (digit != INVALID_DIGIT) return 0;
+    // After (possibly) taking care of pentagon logic, check that
+    // the remaining digits up to `res` are not 7 (INVALID_DIGIT).
+    // Don't see a way to avoid this loop :(


You could unroll it into groups of 4 or 8, possibly with a macro to simplify the code. This would only work for hex base cells though, or for pentagon base cells if you re-check all the digits you just checked. Might be worth trying.

Not sure if this PR is still alive, but I've just stumbled on it.

Last week I've implemented my own quick isValidCell (a simpler one though, because for my use case I don't need to have any specific check for pentagons) and I think this loop can be avoided.

The search for the pattern 0b111 may not be easy to implement quickly, but on the other hand we can apply a bitwise NOT and look for 0b000 which is easier 🙂.
Indeed, looking for a null triplet is similar to looking for a nul-byte, something we do a lot in C (think strlen).

By digging into the annals of the Old Gods, a.k.a. comp.lang.c, we can extract this little gem: Alan Mycroft's null-byte detection algorithm, posted in 1987.

Tweaking the bitmasks to works on 3-bit instead of 8-bit does the trick and we end up with a code that would rougly look like

#include<stdint.h> #include<stdio.h> #include<assert.h> #define MAX_RESOLUTION 15 #define CELL_BITSIZE 3 int has_invalid_digit(uint64_t index) { static const uint64_t LO_MAGIC = 0x49249249249ULL; // ...001 001 001... static const uint64_t HI_MAGIC = 0x124924924924ULL; // ...100 100 100... const uint64_t resolution = (index >> 52 & 0xF); const uint64_t unused_count = MAX_RESOLUTION - resolution; const uint64_t unused_bitsize = unused_count * CELL_BITSIZE; const uint64_t digits_mask = (1ULL << (resolution * CELL_BITSIZE)) - 1; const uint64_t digits = (index >> unused_bitsize) & digits_mask; return ((~digits - LO_MAGIC) & (digits & HI_MAGIC)) != 0ULL; } int main(void) { // Valid H3 index // 0-0001-000-1100-0010101-110-101-110-001-100-000-101-001-100-110-110-101-111-111-111 assert(!has_invalid_digit(0x08c2bae305336bffULL)); // First digit invalid. // 0-0001-000-1100-0010101-111-101-110-001-100-000-101-001-100-110-110-101-111-111-111 assert(has_invalid_digit(0x08c2bee305336bffULL)); // Digit in the middle invalid. // 0-0001-000-1100-0010101-110-101-110-001-100-111-101-001-100-110-110-101-111-111-111 assert(has_invalid_digit(0x08c2bae33d336bffULL)); // Last digit invalid. // 0-0001-000-1100-0010101-110-101-110-001-100-000-101-001-100-110-110-111-111-111-111 assert(has_invalid_digit(0x08c2bae305336fffULL)); return 0; }

So, in the end we can replace the loop by ((~digits - LO_MAGIC) & (digits & HI_MAGIC)) != 0ULL.

For sure, this would deserve some nice comments, because it's not obvious on a first read (I went heavy on the comments for my implementation xD).

If something is not clear I can give a more detailed explanation (but that basically boils down on how Alan Mycroft's null-byte detection works).

Wow, this is awesome! What a great find!

I spent a little time reworking this PR to add your logic. It definitely helps on the benchmarks.

It looks like you also came up with some improvements for the pentagon branch here: nmandery/h3ron#34

I'll have to parse that as well. Also happy if you want to make a pull against this PR!

dfellis · 2021-07-15T23:23:31Z

src/h3lib/lib/h3Index.c

+        if (GT(h, nBitsDigit) == 7) return 0;
+        h <<= nBitsDigit;


Assuming bitshifts are as expensive as they were when I was in college, it may make sense to just have an array of 16 masks and do if (h & masks[r] == masks[r]) return 0; as the only part of the inner loop.

I ran some toy timing tests and didn't see much difference between masking vs shifting: https://gist.github.com/ajfriend/32571f0af1f3ea3133b6836fc861c730

I wouldn't be surprised if both code generate the same (or almost) assembly behind.

Compiler are quite clever for this kind of simple expression nowadays, like you don't have to write n >> 1 instead of n / 2 and so on.
This is basic peephole optimisation.

ajfriend · 2021-07-15T23:45:54Z

todo: maybe add a benchmark that runs through all the cells in a resolution.

ajfriend · 2022-02-13T21:06:53Z

@slaperche-zenly It looks like your fast version of the pentagon branch would depend on a C equivalent of leading_zeros(): https://gist.github.com/slaperche-zenly/204e9b8e305cfb8ce2eec49fbe7b9396#file-lib-rs-L62

Anyone have an idea if there is such a C equivalent?

slaperche-zenly · 2022-02-13T21:16:05Z

@slaperche-zenly It looks like your fast version of the pentagon branch would depend on a C equivalent of leading_zeros(): https://gist.github.com/slaperche-zenly/204e9b8e305cfb8ce2eec49fbe7b9396#file-lib-rs-L62

Anyone have an idea if there is such a C equivalent?

C compilers usually expose this as a builtin or intrinsic (IIRC it's __builtin_clz for GCC), but naming may differ across compilers and would require some conditional compilation to select the right one, plus a fallback on a handwritten version for unsupported compiler/architecture (this page lists some way to implement it, they call it "Find the log base 2 of an integer")

ajfriend · 2022-02-13T21:54:32Z

C compilers usually expose this as a builtin or intrinsic (IIRC it's __builtin_clz for GCC), but naming may differ across compilers and would require some conditional compilation to select the right one, plus a fallback on a handwritten version for unsupported compiler/architecture (this page lists some way to implement it, they call it "Find the log base 2 of an integer")

That page is cool! I was thinking of the trick converting the int to a float and extracting the exponent, but not sure how portable that code would be. @isaacbrodsky @nrabinowitz @dfellis, thoughts?

slaperche-zenly · 2022-02-13T23:10:50Z

I was thinking of the trick converting the int to a float and extracting the exponent, but not sure how portable that code would be

As far as portability goes, you should be fine as long as you have IEEE-754 compliant floating point numbers, which should be everywhere (bar a few embedded or old/exotic systems).

I would be more worried about the type punning that may or may not run afoul of the strict aliasing rules (IIRC, the standard is not super clear on what’s allowed and what summon nasal daemons...)
Cf. here for a quick overview of the mess 😅

The approach based on DeBruijn sequence, adapted to 64-bit, sounds safer to me.

dfellis · 2022-02-15T15:39:00Z

src/h3lib/lib/h3Index.c

+        g <<= 19;
+        g >>= 19;  // at this point, g < 2^45 - 1


Why not replace this with a simple & mask that has 0s for the top 19 bits and 1s for the lower 45 bits? That should be much faster than shifting up and down 19 times each.

Yup, agreed!

isaacbrodsky · 2022-03-11T00:36:31Z

src/h3lib/lib/h3Index.c

+        // reinterpret the double bits as uint64_t
+        g = *(uint64_t *)&f;


I'm unclear if this is safe - it may be necessary to use a different implementation on a platform like ARM (we can probably test this on Raspberry Pi). We should check if UBSan complains about this.

ajfriend added 10 commits July 5, 2021 00:08

initial commit

196b271

use NB_

5125e7b

m7s

aa0e572

benchmarks running

744c023

actual ones running

ea83ff9

just do the pentagon

1b63946

two tests

0ca2c28

science!

e898d63

docs

983cb3d

cleaning up

5bcc2ee

ajfriend commented Jul 5, 2021

View reviewed changes

ajfriend requested review from isaacbrodsky, nrabinowitz and dfellis July 5, 2021 23:14

fix benchmark bug

3e912d8

nrabinowitz reviewed Jul 12, 2021

View reviewed changes

dfellis reviewed Jul 15, 2021

View reviewed changes

This was referenced Jul 17, 2021

Update for 3.7.2 pocketken/H3.net#36

Closed

Release 3.7.2 pocketken/H3.net#37

Merged

ajfriend mentioned this pull request Aug 13, 2021

Performance benchmarking improvements #504

Open

ajfriend added 2 commits September 26, 2021 16:37

three implementations

bef505f

comments for benchmarking

63962ed

ajfriend mentioned this pull request Sep 27, 2021

Add benchmarkIsValidCell.c #518

Merged

ajfriend added 2 commits February 12, 2022 19:02

Merge branch 'master' into fast_isValidCell_try_merge

0f18b18

moving stuff around

c7a32f2

slaperche-zenly mentioned this pull request Feb 13, 2022

A faster h3IsValid nmandery/h3ron#34

Closed

ajfriend added 4 commits February 13, 2022 00:22

checking that the end is all 7's

c22fa9f

wicked fast method

47b6584

good times

2afecd6

shifting around shifts

dde10bb

ajfriend changed the title ~~[experimental] Fast isValidCell~~ [WIP] Fast isValidCell Feb 13, 2022

just make the 0 test explicit

9edb39a

ajfriend added 4 commits February 13, 2022 15:45

tables for the fancy bitwise ops

5e18135

Adding some notes

5d68649

bad news bits

65a5e89

better tests, better macros, better comments

5a70747

dfellis reviewed Feb 15, 2022

View reviewed changes

isaacbrodsky reviewed Mar 11, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fast isValidCell #496

[WIP] Fast isValidCell #496

ajfriend commented Jul 5, 2021 •

edited

Loading

coveralls commented Jul 5, 2021 •

edited

Loading

ajfriend Jul 5, 2021

ajfriend Jul 5, 2021 •

edited

Loading

ajfriend Jul 5, 2021

ajfriend Jul 5, 2021

nrabinowitz Jul 12, 2021

ajfriend Jul 5, 2021

ajfriend commented Jul 5, 2021 •

edited

Loading

nrabinowitz left a comment

nrabinowitz Jul 12, 2021

nrabinowitz Jul 12, 2021

nrabinowitz Jul 12, 2021

nrabinowitz Jul 12, 2021

nrabinowitz Jul 12, 2021

slaperche-zenly Jan 31, 2022

ajfriend Feb 13, 2022

dfellis Jul 15, 2021 •

edited

Loading

ajfriend Feb 17, 2022

slaperche-zenly Feb 17, 2022

ajfriend commented Jul 15, 2021

ajfriend commented Feb 13, 2022

slaperche-zenly commented Feb 13, 2022 •

edited

Loading

ajfriend commented Feb 13, 2022

slaperche-zenly commented Feb 13, 2022

dfellis Feb 15, 2022 •

edited

Loading

ajfriend Feb 15, 2022

isaacbrodsky Mar 11, 2022

		// reinterpret the double bits as uint64_t
		g = (uint64_t )&f;

[WIP] Fast isValidCell #496

Are you sure you want to change the base?

[WIP] Fast isValidCell #496

Conversation

ajfriend commented Jul 5, 2021 • edited Loading

coveralls commented Jul 5, 2021 • edited Loading

Choose a reason for hiding this comment

ajfriend Jul 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Loop

Lookup table

ajfriend commented Jul 5, 2021 • edited Loading

new algo

current algo

nrabinowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfellis Jul 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfriend commented Jul 15, 2021

ajfriend commented Feb 13, 2022

slaperche-zenly commented Feb 13, 2022 • edited Loading

ajfriend commented Feb 13, 2022

slaperche-zenly commented Feb 13, 2022

dfellis Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfriend commented Jul 5, 2021 •

edited

Loading

coveralls commented Jul 5, 2021 •

edited

Loading

ajfriend Jul 5, 2021 •

edited

Loading

ajfriend commented Jul 5, 2021 •

edited

Loading

dfellis Jul 15, 2021 •

edited

Loading

slaperche-zenly commented Feb 13, 2022 •

edited

Loading

dfellis Feb 15, 2022 •

edited

Loading