Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc++] Optimize ranges::copy for forward_iterator and segmented_iterator #120134

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

winner245
Copy link
Contributor

@winner245 winner245 commented Dec 16, 2024

This patch optimizes the performance of {std, ranges}::copy when copying from forward_iterator or segmented_iterator inputs to a vector<bool>::iterator output, yielding a performance improvement of up to 9x. The key optimizations include:

  1. forward_iterator-pair inputs: Instead of iterating through individual bits in vector<bool> using bitwise masks and writing bit by bit, the optimization first assembles the input data into whole words and then directly copy the whole word to the underlying storage of vector<bool>. This word-wise copying approach leads to a 9x performance improvement for {std, ranges}::copy.
  2. segmented_iterator-pair inputs: for segmented_iterator inputs, the optimization subdivides the input data into segments, where each segment reduces to a range representable by a forward_iterator-pair. This transforms the problem to the case of forward_iterator-pair inputs, resulting in a similar 9.0x speed-up for segmented iterators such as deque iterators.

As a byproduct of this work, all the iterator-pair and range-based operations that internally call std::copy have also achieved a similar speed-up of up to 9x. The improved vector<bool> operations include:

  • range-ctor: vector(std::from_range_t, R&& rg, const Allocator& alloc)
  • range-assignment: assign_range(R&& rg)
  • range-insertion: insert_range(const_iterator pos, R&& rg)
  • range-append: append_range(R&& rg)
  • iterator-pair ctor: vector(InputIt first, InputIt last, const Allocator& alloc)
  • iterator-pair assignment: assign(InputIt first, InputIt last)
  • iterator-pair insert: insert(const_iterator pos, InputIt first, InputIt last)

Benchmarks

Comprehensive benchmarks have been provided to seamlessly integrate into the recently enhanced benchmark framework developed by @ldionne. These results demonstrate the substantial performance improvements for both {std, ranges}::copy and vector<bool> operations.

{std, ranges}::copy
--------------------------------------------------------------------------------------------------
Benchmark                                                      Before           After      Speedup
--------------------------------------------------------------------------------------------------
std::copy(vector<int>, std::vector<bool>)/8                   8.20 ns         5.36 ns         1.5x
std::copy(vector<int>, std::vector<bool>)/64                  62.4 ns         11.5 ns         5.4x
std::copy(vector<int>, std::vector<bool>)/512                  598 ns         72.8 ns         8.2x
std::copy(vector<int>, std::vector<bool>)/4096                4806 ns          548 ns         8.8x
std::copy(vector<int>, std::vector<bool>)/32768              38917 ns         4343 ns         9.0x
std::copy(vector<int>, std::vector<bool>)/262144            310583 ns        35333 ns         8.8x
std::copy(vector<int>, std::vector<bool>)/1048576          1263629 ns       138828 ns         9.1x
std::copy(deque<int>, std::vector<bool>)/8                    9.75 ns         6.15 ns         1.6x
std::copy(deque<int>, std::vector<bool>)/64                   75.6 ns         12.5 ns         6.0x
std::copy(deque<int>, std::vector<bool>)/512                   605 ns         69.8 ns         8.7x
std::copy(deque<int>, std::vector<bool>)/4096                 4868 ns          541 ns         9.0x
std::copy(deque<int>, std::vector<bool>)/32768               38721 ns         4412 ns         8.8x
std::copy(deque<int>, std::vector<bool>)/262144             312165 ns        38059 ns         8.2x
std::copy(deque<int>, std::vector<bool>)/1048576           1241997 ns       150776 ns         8.2x
rng::copy(vector<int>, std::vector<bool>)/8                   6.62 ns         7.36 ns         0.9x
rng::copy(vector<int>, std::vector<bool>)/64                  56.7 ns         11.5 ns         4.9x
rng::copy(vector<int>, std::vector<bool>)/512                  606 ns         73.1 ns         8.3x
rng::copy(vector<int>, std::vector<bool>)/4096                4894 ns          565 ns         8.7x
rng::copy(vector<int>, std::vector<bool>)/32768              39207 ns         4579 ns         8.6x
rng::copy(vector<int>, std::vector<bool>)/262144            312108 ns        37471 ns         8.3x
rng::copy(vector<int>, std::vector<bool>)/1048576          1258726 ns       151054 ns         8.3x
rng::copy(deque<int>, std::vector<bool>)/8                    10.4 ns         6.71 ns         1.5x
rng::copy(deque<int>, std::vector<bool>)/64                   61.3 ns         13.4 ns         4.6x
rng::copy(deque<int>, std::vector<bool>)/512                   612 ns         77.1 ns         7.9x
rng::copy(deque<int>, std::vector<bool>)/4096                 4928 ns          601 ns         8.2x
rng::copy(deque<int>, std::vector<bool>)/32768               38558 ns         5006 ns         7.7x
rng::copy(deque<int>, std::vector<bool>)/262144             314093 ns        40211 ns         7.8x
rng::copy(deque<int>, std::vector<bool>)/1048576           1273987 ns       174036 ns         7.3x
std::copy(forward_iterator, vector<bool>)/64                  62.7 ns         13.2 ns         4.8x
std::copy(forward_iterator, vector<bool>)/512                  621 ns         77.1 ns         8.1x
std::copy(forward_iterator, vector<bool>)/4096                4924 ns          608 ns         8.1x
std::copy(forward_iterator, vector<bool>)/32768              39075 ns         4868 ns         8.0x
std::copy(forward_iterator, vector<bool>)/262144            339969 ns        39697 ns         8.6x
std::copy(forward_iterator, vector<bool>)/1048576          1262003 ns       159540 ns         7.9x
std::copy(random_access_iterator, vector<bool>)/64            64.7 ns         13.1 ns         4.9x
std::copy(random_access_iterator, vector<bool>)/512            614 ns         79.3 ns         7.7x
std::copy(random_access_iterator, vector<bool>)/4096          4922 ns          605 ns         8.1x
std::copy(random_access_iterator, vector<bool>)/32768        39038 ns         4918 ns         7.9x
std::copy(random_access_iterator, vector<bool>)/262144      317645 ns        39480 ns         8.0x
std::copy(random_access_iterator, vector<bool>)/1048576    1288111 ns       160619 ns         8.0x
rng::copy(forward_iterator, vector<bool>)/64                  61.5 ns         12.0 ns         5.1x
rng::copy(forward_iterator, vector<bool>)/512                  617 ns         70.5 ns         8.8x
rng::copy(forward_iterator, vector<bool>)/4096                4837 ns          559 ns         8.7x
rng::copy(forward_iterator, vector<bool>)/32768              38589 ns         4415 ns         8.7x
rng::copy(forward_iterator, vector<bool>)/262144            310598 ns        35696 ns         8.7x
rng::copy(forward_iterator, vector<bool>)/1048576          1238270 ns       144161 ns         8.6x
rng::copy(random_access_iterator, vector<bool>)/64            61.2 ns         11.5 ns         5.3x
rng::copy(random_access_iterator, vector<bool>)/512            610 ns         70.7 ns         8.6x
rng::copy(random_access_iterator, vector<bool>)/4096          4839 ns          540 ns         9.0x
rng::copy(random_access_iterator, vector<bool>)/32768        38503 ns         4364 ns         8.8x
rng::copy(random_access_iterator, vector<bool>)/262144      308859 ns        37045 ns         8.3x
rng::copy(random_access_iterator, vector<bool>)/1048576    1262466 ns       145308 ns         8.7x

vector<bool>
---------------------------------------------------------------------------------------------------------------------
Benchmark                                                                           Before            After   Speedup
---------------------------------------------------------------------------------------------------------------------
std::vector<bool>::ctor(fwd_iter, fwd_iter) (cheap elements)/1024                   1176 ns         159 ns       7.4x
std::vector<bool>::ctor(fwd_iter, fwd_iter) (cheap elements)/65536                 73382 ns        8985 ns       8.2x
std::vector<bool>::ctor(fwd_iter, fwd_iter) (cheap elements)/1048576             1209835 ns      141813 ns       8.5x
std::vector<bool>::ctor(ra_iter, ra_iter) (cheap elements)/1024                     1216 ns         156 ns       7.8x
std::vector<bool>::ctor(ra_iter, ra_iter) (cheap elements)/65536                   78756 ns        8861 ns       8.9x
std::vector<bool>::ctor(ra_iter, ra_iter) (cheap elements)/1048576               1371515 ns      145451 ns       9.4x
std::vector<bool>::assign(fwd_iter, fwd_iter) (cheap elements)/1024                 1234 ns         147 ns       8.4x
std::vector<bool>::assign(fwd_iter, fwd_iter) (cheap elements)/65536               76720 ns        9008 ns       8.5x
std::vector<bool>::assign(fwd_iter, fwd_iter) (cheap elements)/1048576           1294336 ns      157726 ns       8.2x
std::vector<bool>::assign(ra_iter, ra_iter) (cheap elements)/1024                   1208 ns         154 ns       7.8x
std::vector<bool>::assign(ra_iter, ra_iter) (cheap elements)/65536                 76906 ns        9336 ns       8.2x
std::vector<bool>::assign(ra_iter, ra_iter) (cheap elements)/1048576             1271409 ns      162803 ns       7.8x
std::vector<bool>::insert(begin, fwd_iter, fwd_iter) (cheap elements)/1024          1347 ns         157 ns       8.6x
std::vector<bool>::insert(begin, fwd_iter, fwd_iter) (cheap elements)/65536        85550 ns        9203 ns       9.3x
std::vector<bool>::insert(begin, fwd_iter, fwd_iter) (cheap elements)/1048576    1390490 ns      153613 ns       9.1x
std::vector<bool>::insert(begin, ra_iter, ra_iter) (cheap elements)/1024            1414 ns         157 ns       9.0x
std::vector<bool>::insert(begin, ra_iter, ra_iter) (cheap elements)/65536          89137 ns        9279 ns       9.6x
std::vector<bool>::insert(begin, ra_iter, ra_iter) (cheap elements)/1048576      1390876 ns      152021 ns       9.1x
std::vector<bool>::ctor(fwd_range) (cheap elements)/1024                            1222 ns         159 ns       7.7x
std::vector<bool>::ctor(fwd_range) (cheap elements)/65536                          78556 ns       10083 ns       7.8x
std::vector<bool>::ctor(fwd_range) (cheap elements)/1048576                      1277549 ns      151697 ns       8.4x
std::vector<bool>::ctor(ra_range) (cheap elements)/1024                             1253 ns         157 ns       8.0x
std::vector<bool>::ctor(ra_range) (cheap elements)/65536                           78893 ns        9202 ns       8.6x
std::vector<bool>::ctor(ra_range) (cheap elements)/1048576                       1267711 ns      149150 ns       8.5x
std::vector<bool>::assign_range(fwd_range) (cheap elements)/1024                    1251 ns         147 ns       8.5x
std::vector<bool>::assign_range(fwd_range) (cheap elements)/65536                  80422 ns        9197 ns       8.7x
std::vector<bool>::assign_range(fwd_range) (cheap elements)/1048576              1313784 ns      166100 ns       7.9x
std::vector<bool>::assign_range(ra_range) (cheap elements)/1024                     1250 ns         145 ns       8.6x
std::vector<bool>::assign_range(ra_range) (cheap elements)/65536                   78518 ns        9516 ns       8.3x
std::vector<bool>::assign_range(ra_range) (cheap elements)/1048576               1298629 ns      164614 ns       7.9x
std::vector<bool>::insert_range(fwd_range) (cheap elements)/1024                    1385 ns         158 ns       7.5x
std::vector<bool>::insert_range(fwd_range) (cheap elements)/65536                  85954 ns        9554 ns       9.0x
std::vector<bool>::insert_range(fwd_range) (cheap elements)/1048576              1440081 ns      156074 ns       9.2x
std::vector<bool>::insert_range(ra_range) (cheap elements)/1024                     1360 ns         158 ns       8.6x
std::vector<bool>::insert_range(ra_range) (cheap elements)/65536                   87742 ns        9546 ns       9.2x
std::vector<bool>::insert_range(ra_range) (cheap elements)/1048576               1401256 ns      155856 ns       9.0x
std::vector<bool>::append_range(fwd_range) (cheap elements)/1024                    1429 ns         153 ns       9.3x
std::vector<bool>::append_range(fwd_range) (cheap elements)/65536                  85063 ns        9135 ns       9.3x
std::vector<bool>::append_range(fwd_range) (cheap elements)/1048576              1361950 ns      152327 ns       8.9x
std::vector<bool>::append_range(ra_range) (cheap elements)/1024                     1341 ns         159 ns       8.4x
std::vector<bool>::append_range(ra_range) (cheap elements)/65536                   85127 ns        9265 ns       9.2x
std::vector<bool>::append_range(ra_range) (cheap elements)/1048576               1395319 ns      160286 ns       8.7x

@winner245 winner245 marked this pull request as ready for review December 16, 2024 20:19
@winner245 winner245 requested a review from a team as a code owner December 16, 2024 20:19
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Dec 16, 2024
@llvmbot
Copy link
Member

llvmbot commented Dec 16, 2024

@llvm/pr-subscribers-libcxx

Author: Peng Liu (winner245)

Changes

General description

This PR is part of a series aimed at significantly improving the performance of vector&lt;bool&gt;. Each PR focuses on enhancing a specific subset of operations, ensuring they are self-contained and easy to review. The main idea for performance improvements involves using word-wise implementation along with bit manipulation techniques, rather than solely using bit-wise operations in the previous implementation, resulting in substantial performance gains.

Current PR

This PR enhances the performance of all range-based operations in vector&lt;bool&gt; by at least 5x. The main idea is to provide a more efficient overload of std::__copy(_InIter __first, _InIter __last, __bit_iterator&lt;_Cp, false&gt; __result), which is used by various range-based operations in vector<bool>. With this efficient overload of std::__copy, all range-based operations benefit from significant performance improvements, which apply to the iterator-pair based range operations as well as C++23's range constructor and {insert, append}_range functions:

  • range-ctor vector( InputIt first, InputIt last, const Allocator&amp; alloc): 5.84x
  • C++23 range-ctor vector(std::from_range_t, R&amp;&amp; rg, const Allocator&amp; alloc): 5.86x
  • range-assignment assign(InputIt first, InputIt last): 5.84x
  • C++23 assign_range(R&amp;&amp; rg): 5.9x
  • range-insert insert( const_iterator pos, InputIt first, InputIt last ): 6.38x
  • C++23 insert_range(const_iterator pos, R&amp;&amp; rg): 6.45x
  • C++23 append_range(R&amp;&amp; rg): 5.5x

Before:

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ConstructIterIter/vector_bool/5140480      22432969 ns     22560977 ns           31
BM_ConstructFromRange/vector_bool/5140480     22499312 ns     22632239 ns           31
BM_Assign_IterIter/vector_bool/5140480        22542583 ns     22679677 ns           30
BM_Assign_Range/vector_bool/5140480           22739005 ns     22881371 ns           31
BM_Insert_Iter_IterIter/vector_bool/5140480   23249604 ns     23398233 ns           30
BM_Insert_Range/vector_bool/5140480           23031899 ns     23181587 ns           30
BM_Append_Range/vector_bool/5140480           23432886 ns     23586148 ns           29

After:

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ConstructIterIter/vector_bool/5140480       3836990 ns      3857075 ns          182
BM_ConstructFromRange/vector_bool/5140480      3838558 ns      3860015 ns          177
BM_Assign_IterIter/vector_bool/5140480         3856720 ns      3879212 ns          181
BM_Assign_Range/vector_bool/5140480            3849086 ns      3872665 ns          178
BM_Insert_Iter_IterIter/vector_bool/5140480    3639338 ns      3661651 ns          189
BM_Insert_Range/vector_bool/5140480            3569611 ns      3592612 ns          195
BM_Append_Range/vector_bool/5140480            4256268 ns      4284186 ns          168

Full diff: https://github.com/llvm/llvm-project/pull/120134.diff

4 Files Affected:

  • (modified) libcxx/include/__algorithm/copy.h (+50)
  • (modified) libcxx/include/__bit_reference (+3)
  • (modified) libcxx/test/benchmarks/containers/ContainerBenchmarks.h (+58)
  • (added) libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp (+37)
diff --git a/libcxx/include/__algorithm/copy.h b/libcxx/include/__algorithm/copy.h
index 4f30b2050abbaf..f737bc4e98e6d6 100644
--- a/libcxx/include/__algorithm/copy.h
+++ b/libcxx/include/__algorithm/copy.h
@@ -13,6 +13,8 @@
 #include <__algorithm/for_each_segment.h>
 #include <__algorithm/min.h>
 #include <__config>
+#include <__fwd/bit_reference.h>
+#include <__iterator/distance.h>
 #include <__iterator/iterator_traits.h>
 #include <__iterator/segmented_iterator.h>
 #include <__type_traits/common_type.h>
@@ -95,6 +97,54 @@ struct __copy_impl {
     }
   }
 
+  template <class _InIter, class _Cp, __enable_if_t<__has_forward_iterator_category<_InIter>::value, int> = 0>
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_InIter, __bit_iterator<_Cp, false>>
+  operator()(_InIter __first, _InIter __last, __bit_iterator<_Cp, false> __result) {
+    using _It                      = __bit_iterator<_Cp, false>;
+    using __storage_type           = typename _It::__storage_type;
+    __storage_type __n             = static_cast<__storage_type>(std::distance(__first, __last));
+    const unsigned __bits_per_word = _It::__bits_per_word;
+
+    if (__n) {
+      // do first partial word, if present
+      if (__result.__ctz_ != 0) {
+        __storage_type __clz = static_cast<__storage_type>(__bits_per_word - __result.__ctz_);
+        __storage_type __dn  = std::min(__clz, __n);
+        __storage_type __w   = *__result.__seg_;
+        __storage_type __m   = (~__storage_type(0) << __result.__ctz_) & (~__storage_type(0) >> (__clz - __dn));
+        __w &= ~__m;
+        for (__storage_type __i = 0; __i < __dn; ++__i, ++__first)
+          __w |= static_cast<__storage_type>(*__first) << __result.__ctz_++;
+        *__result.__seg_ = __w;
+        if (__result.__ctz_ == __bits_per_word) {
+          __result.__ctz_ = 0;
+          ++__result.__seg_;
+        }
+        __n -= __dn;
+      }
+    }
+    // do middle whole words, if present
+    __storage_type __nw = __n / __bits_per_word;
+    __n -= __nw * __bits_per_word;
+    for (; __nw; --__nw) {
+      __storage_type __w = 0;
+      for (__storage_type __i = 0; __i < __bits_per_word; ++__i, ++__first)
+        __w |= static_cast<__storage_type>(*__first) << __i;
+      *__result.__seg_++ = __w;
+    }
+    // do last partial word, if present
+    if (__n) {
+      __storage_type __w = 0;
+      for (__storage_type __i = 0; __i < __n; ++__i, ++__first)
+        __w |= static_cast<__storage_type>(*__first) << __i;
+      __storage_type __m = ~__storage_type(0) >> (__bits_per_word - __n);
+      *__result.__seg_ &= ~__m;
+      *__result.__seg_ |= __w;
+      __result.__ctz_ = __n;
+    }
+    return std::make_pair(std::move(__first), std::move(__result));
+  }
+
   // At this point, the iterators have been unwrapped so any `contiguous_iterator` has been unwrapped to a pointer.
   template <class _In, class _Out, __enable_if_t<__can_lower_copy_assignment_to_memmove<_In, _Out>::value, int> = 0>
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 pair<_In*, _Out*>
diff --git a/libcxx/include/__bit_reference b/libcxx/include/__bit_reference
index 22637d43974123..e8cbb63988ba54 100644
--- a/libcxx/include/__bit_reference
+++ b/libcxx/include/__bit_reference
@@ -10,6 +10,7 @@
 #ifndef _LIBCPP___BIT_REFERENCE
 #define _LIBCPP___BIT_REFERENCE
 
+#include <__algorithm/copy.h>
 #include <__algorithm/copy_n.h>
 #include <__algorithm/fill_n.h>
 #include <__algorithm/min.h>
@@ -970,6 +971,8 @@ private:
   _LIBCPP_CONSTEXPR_SINCE_CXX20 friend void
   __fill_n_bool(__bit_iterator<_Dp, false> __first, typename _Dp::size_type __n);
 
+  friend struct __copy_impl;
+
   template <class _Dp, bool _IC>
   _LIBCPP_CONSTEXPR_SINCE_CXX20 friend __bit_iterator<_Dp, false> __copy_aligned(
       __bit_iterator<_Dp, _IC> __first, __bit_iterator<_Dp, _IC> __last, __bit_iterator<_Dp, false> __result);
diff --git a/libcxx/test/benchmarks/containers/ContainerBenchmarks.h b/libcxx/test/benchmarks/containers/ContainerBenchmarks.h
index 6d21e12896ec9e..123f7bc95d4745 100644
--- a/libcxx/test/benchmarks/containers/ContainerBenchmarks.h
+++ b/libcxx/test/benchmarks/containers/ContainerBenchmarks.h
@@ -51,6 +51,30 @@ void BM_Assignment(benchmark::State& st, Container) {
   }
 }
 
+template <class Container, class GenInputs>
+void BM_Assign_IterIter(benchmark::State& st, Container c, GenInputs gen) {
+  auto in  = gen(st.range(0));
+  auto beg = in.begin();
+  auto end = in.end();
+  for (auto _ : st) {
+    c.assign(beg, end);
+    DoNotOptimizeData(c);
+    DoNotOptimizeData(in);
+    benchmark::ClobberMemory();
+  }
+}
+
+template <std::size_t... sz, typename Container, typename GenInputs>
+void BM_Assign_Range(benchmark::State& st, Container c, GenInputs gen) {
+  auto in = gen(st.range(0));
+  for (auto _ : st) {
+    c.assign_range(in);
+    DoNotOptimizeData(c);
+    DoNotOptimizeData(in);
+    benchmark::ClobberMemory();
+  }
+}
+
 template <std::size_t... sz, typename Container, typename GenInputs>
 void BM_AssignInputIterIter(benchmark::State& st, Container c, GenInputs gen) {
   auto v = gen(1, sz...);
@@ -108,6 +132,40 @@ void BM_Pushback_no_grow(benchmark::State& state, Container c) {
   }
 }
 
+template <class Container, class GenInputs>
+void BM_Insert_Iter_IterIter(benchmark::State& st, Container c, GenInputs gen) {
+  auto in        = gen(st.range(0));
+  const auto beg = in.begin();
+  const auto end = in.end();
+  for (auto _ : st) {
+    c.resize(100);
+    c.insert(c.begin() + 50, beg, end);
+    DoNotOptimizeData(c);
+    benchmark::ClobberMemory();
+  }
+}
+
+template <class Container, class GenInputs>
+void BM_Insert_Range(benchmark::State& st, Container c, GenInputs gen) {
+  auto in = gen(st.range(0));
+  for (auto _ : st) {
+    c.resize(100);
+    c.insert_range(c.begin() + 50, in);
+    DoNotOptimizeData(c);
+    benchmark::ClobberMemory();
+  }
+}
+
+template <class Container, class GenInputs>
+void BM_Append_Range(benchmark::State& st, Container c, GenInputs gen) {
+  auto in = gen(st.range(0));
+  for (auto _ : st) {
+    c.append_range(in);
+    DoNotOptimizeData(c);
+    benchmark::ClobberMemory();
+  }
+}
+
 template <class Container, class GenInputs>
 void BM_InsertValue(benchmark::State& st, Container c, GenInputs gen) {
   auto in        = gen(st.range(0));
diff --git a/libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp b/libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp
new file mode 100644
index 00000000000000..2ce10cb6d3d1b6
--- /dev/null
+++ b/libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp
@@ -0,0 +1,37 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17, c++20
+
+#include <cstdint>
+#include <cstdlib>
+#include <cstring>
+#include <deque>
+#include <functional>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "benchmark/benchmark.h"
+#include "ContainerBenchmarks.h"
+#include "../GenerateInput.h"
+
+using namespace ContainerBenchmarks;
+
+BENCHMARK_CAPTURE(BM_ConstructIterIter, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+BENCHMARK_CAPTURE(BM_ConstructFromRange, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+
+BENCHMARK_CAPTURE(BM_Assign_IterIter, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+BENCHMARK_CAPTURE(BM_Assign_Range, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+
+BENCHMARK_CAPTURE(BM_Insert_Iter_IterIter, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)
+    ->Arg(5140480);
+BENCHMARK_CAPTURE(BM_Insert_Range, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+BENCHMARK_CAPTURE(BM_Append_Range, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+
+BENCHMARK_MAIN();
\ No newline at end of file

@winner245 winner245 force-pushed the speed-up-range-function branch 8 times, most recently from b98a2ef to 4239066 Compare December 18, 2024 15:14
Copy link

github-actions bot commented Dec 18, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@winner245 winner245 force-pushed the speed-up-range-function branch 7 times, most recently from 996d1fe to 81c929a Compare December 21, 2024 17:04
Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're optimizing an algorithm (and nothing specific to vector<bool> itself) we should just benchmark that instead. That's significantly less convoluted. I'd also like to see some additional tests, especially with iterators that don't return a bool. I'm pretty sure your current implementation is completely broken with that. Lastly, I think we should move this into __copy_impl, since we might be able to unwrap iterators to __bit_iterators. I don't think we do that currently, but I see no reason we couldn't in the future. It would also be nice to improve std::move in the same way (and hopefully share the code).

@winner245
Copy link
Contributor Author

Thank you for your suggestion.

Since you're optimizing an algorithm (and nothing specific to vector<bool> itself) we should just benchmark that instead. That's significantly less convoluted.

My original motivation for this series of work was to improve the performance of vector<bool>. However, I understand your point, and I can focus on optimizing the std::copy and std::move algorithms instead, and benchmark the performance for the algorithms themselves.

I'd also like to see some additional tests, especially with iterators that don't return a bool. I'm pretty sure your current implementation is completely broken with that.

Since we are dealing with __bit_iterator, my current implementation only works for the bool return type. I plan to add template type constraints to _InIter to ensure it returns types that are either bool or convertible to bool. Do you think this approach meets your expectations?

Lastly, I think we should move this into __copy_impl, since we might be able to unwrap iterators to __bit_iterators. I don't think we do that currently, but I see no reason we couldn't in the future. It would also be nice to improve std::move in the same way (and hopefully share the code).

I agree with you and this was also what I planned to do next.

@winner245 winner245 force-pushed the speed-up-range-function branch 5 times, most recently from 05c18a3 to 7bef800 Compare January 21, 2025 04:01
@winner245 winner245 force-pushed the speed-up-range-function branch 2 times, most recently from 2f32561 to 948da90 Compare January 25, 2025 03:16
@winner245 winner245 force-pushed the speed-up-range-function branch 2 times, most recently from a4f4910 to ad33a01 Compare January 25, 2025 04:54
@winner245 winner245 changed the title [libc++] Speed-up vector<bool> range-based operations [3/3] [libc++] Speed-up {random_access, forward}_range-based operations in vector<bool>[3/3] Jan 25, 2025
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this patch is promising, but it should be rebased onto the latest main.

I would also like to encourage you to think about whether making __bit_iterator a segmented iterator would make sense and whether it might unlock various optimizations without needing such special cases just for __bit_iterator. This may end up being a misleading suggestion, but I'd like you to investigate it since it could pay off if that works.

is_convertible<typename iterator_traits<_InIter>::value_type, bool>::value,
int> = 0>
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 pair<_InIter, __bit_iterator<_Cp, false> >
operator()(_InIter __first, _Sent __last, __bit_iterator<_Cp, false> __result) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
operator()(_InIter __first, _Sent __last, __bit_iterator<_Cp, false> __result) const {
operator()(_InIter __first, _Sent __last, __bit_iterator<_Cp, /* IsConst */false> __result) const {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thought-provoking comment. As suggested, I've implemented segmented iterator inputs for std::copy. With segmented_iterator input, each input segment reduces to a forward_iterator-pair, which is the case this patch optimizes for. As a result, the performance improvements for forward_iterator-pair inputs also extend to segmented_iterator inputs, yielding a 9x speed-up in both cases. For a more detailed explanation, please refer to my updated PR description.

@winner245 winner245 force-pushed the speed-up-range-function branch 2 times, most recently from 3dbf418 to 37d52d7 Compare March 23, 2025 02:32
@winner245 winner245 changed the title [libc++] Speed-up {random_access, forward}_range-based operations in vector<bool>[3/3] [libc++] Optimize ranges::copy for forward_iterator and segmented_iterator Mar 23, 2025
@winner245 winner245 force-pushed the speed-up-range-function branch from 37d52d7 to e5c04a1 Compare March 23, 2025 03:19
@winner245 winner245 force-pushed the speed-up-range-function branch from e5c04a1 to 50eb099 Compare March 23, 2025 14:55
@winner245 winner245 force-pushed the speed-up-range-function branch from 50eb099 to d6119cd Compare March 23, 2025 15:34
@winner245 winner245 force-pushed the speed-up-range-function branch from 2876f93 to f0eb051 Compare March 23, 2025 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants