Computing the number of digits - benchmarks

Ideally, there would be some very clever blogpost that the code in this repository would merely be accompanying. Unfortunately, I don't blog. On the other hand, I do code. Coding requires some documentation. Hence, this readme.

It's not that we lack online material about computing the number of digits fast. On Youtube, you can watch Andrei Alexandrescu discuss this subject, e.g., here. Daniel Lemire devoted this and then this blogpost to this topic. Sadly, there are no actual measurements. This code rectifies that.

Build

No Makefile is provided. Run the script build.sh to create the executables. To create the test data, run the script prepare_data.sh. The script run_tests.sh will perform the benchmark tests and write the output to the results directory.

Finally, the script sanity.sh will check that every method gives the correct result for all 32 bit unsigned integers; run this if you're paranoid.

Methods

Our baseline is the classical, naive implementation:

uint32_t num_digs(uint32_t n)
{
    uint32_t r = 1;
    for (;;) {
        if (n < 10) return r;
        n /= 10;
        r++;
    }
}

There are various approaches to improving this algorithm. The Lemire blogposts discuss table-base implementations, e.g.,

uint32_t num_digs(uint32_t n)
{
    static uint32_t table[] = {9, 99, 999, 9999, 99999,
        999999, 9999999, 99999999, 999999999};
        int y = (9 * (31 - __builtin_clz(x|1))) >> 5;
        y += x > table[y];
        return y + 1;
}

The Alexandrescu-method relies on the observation that actual data in practice do not follow a uniform distribution: most numbers we deal with are small. Let's call this method "bigdiv". This code illustrates the concept:

uint32_t num_digs(uint32_t n)
{
    uint32_t result = 1;
    for(;;) {
        if (n < 10) return result;
        if (n < 100) return result+1;
        if (n < 1000) return result+2;
        if (n < 10000) return result+3;
        n /= 10000u;
        result += 4;
    }
}

My two cents is based on the hope that instruction-level parallelism might turn out to be faster in practice than either of the two methods above. Let's call this approach the "parallel" method. Here's the most extreme implementation of this idea:

uint32_t num_digs(uint32_t n)
{
    return 1 +
        (n >= 10) +
        (n >= 100) +
        (n >= 1000) +
        (n >= 10000) +
        (n >= 100000) +
        (n >= 1000000) +
        (n >= 10000000) +
        (n >= 100000000) +
        (n >= 1000000000);
}

Of course, these methods can be combined in various ways, e.g., like this:

uint32_t num_digs(uint32_t n)
{
    uint32_t result = 1;
    for(;;) {
        if (n < 1000) {
            return result + (n >= 10) + (n >= 100);
        }
        n /= 1000u;
        result += 3;
    }
}

or this:

uint32_t num_digs(uint32_t n)
{
    static const uint64_t table[] = {
        4294967296,  8589934582,  8589934582,  8589934582,  12884901788,
        12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
        21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
        25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
        34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
        38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
        42949672960, 42949672960};
    if (n < 10) {
        return 1;
    }
    return (n + table[31 - __builtin_clz(n)]) >> 32;
}

Data

Three datasets are used. Two of those are randomly generated with different distributions. The third dataset is extracted from two Github repositories (this and this). The idea was to have something that resembles "real" data; there's no telling if this attempt actually succeeded.

Simple microbenchmarks won't cut it

We can't just benchmark the num_digs variants in isolation, because runtimes can vary dramatically depending on how the return value is used in the calling site. This is demonstrated by the executables in directory simple. These programs compute the number of digits for all integers in a given range. The "add" versions add all the returned values; the "xor" versions xor them.

Do this after you've run build.sh:

cd simple && for i in *; do ./${i} 500000 1000000; done

On my machine, I get the following results:

Program	Running time (microseconds)
parallel-add	557
bigdiv5-parallel-add	675
bigdiv5-add	854
bigdiv5-xor	854
table2-add	974
table2-xor	974
bigdiv4-add	1155
bigdiv4-xor	1164
table-add	1252
table-xor	1252
bigdiv3-parallel-add	1384
bigdiv3-parallel-xor	1384
bigdiv4-parallel-add	1385
bigdiv4-parallel-xor	1385
bigdiv3-add	1432
bigdiv3-xor	1432
bigdiv5-parallel-xor	1568
bigdiv2-add	1683
bigdiv2-xor	1683
bigdiv2-parallel-add	1728
bigdiv2-parallel-xor	1728
multiply-add	1750
multiply-xor	1753
parallel-xor	1966
baseline-xor	2711
baseline-add	2718
table2-parallel0-xor	2997
table2-parallel1-add	2997
table2-parallel1-xor	2997
table2-parallel0-add	3001
table2-parallel2-add	3003
table2-parallel4-xor	3005
table2-parallel2-xor	3007
table2-parallel3-add	3007
table2-parallel3-xor	3007
table2-parallel4-add	3007

The "add" version of parallel beats all the competition, finishing in 557 microseconds. Its "xor" counterpart performs badly, finishing in about 2000 microseconds. Thus, the performance of a digit-counting function can massively depend on the context in which the function is called.

The actual benchmarks in the progs directory try to work around this problem by measuring the performance of the digit-counting function in a (hopefully) realistic, "itoa"-like context.

Results

I ran all tests in a Linux laptop (Ubuntu 20.04), with this cpu: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz Versions: GCC 9.4.0, CLANG 12.0.1. Other CPUs and compilers will surely yield different results.

Here's a chart that gives a summary of the results; these numbers were obtained by averaging the results of multiple runs across the gcc and the clang versions and normalized against the "baseline" implementation. Greater numbers are better: e.g., 2.0 would mean that the given executable finished twice as fast as the baseline version on a dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
benchmark.cpp		benchmark.cpp
build.sh		build.sh
functions.cpp		functions.cpp
prepare_data.sh		prepare_data.sh
results.png		results.png
run_tests.sh		run_tests.sh
sanity.cpp		sanity.cpp
sanity.sh		sanity.sh
simple.cpp		simple.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computing the number of digits - benchmarks

Build

Methods

Data

Simple microbenchmarks won't cut it

Results

About

Releases

Packages

Languages

olahg/num_digits_fast

Folders and files

Latest commit

History

Repository files navigation

Computing the number of digits - benchmarks

Build

Methods

Data

Simple microbenchmarks won't cut it

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages