@@ -969,12 +969,31 @@ one above is possible, where the CAS at `'L213` reads `top = 0` and then spuriou
969
969
## Comparison to Target-dependent Implementations
970
970
971
971
Alternatively, we can write a deque for each target architecture in order to achieve better
972
- performance. For example, [ this paper] [ deque-bounded-tso ] presents a variant of various deques in
973
- the "bounded TSO" x86 model, where you don't need to issue the expensive ` mfence ` barrier (think:
974
- seqcst-fence) in ` pop() ` . Also, [ this paper] [ chase-lev-weak ] presents a version of Chase-Lev deque
975
- for ARMv7 that doesn't issue ` isync ` -like fences, while the proposed implementation issues
976
- some. Probably ` Consume ` is relevant for the latter case. These further optimizations are left as
977
- future work.
972
+ performance.
973
+
974
+ We believe the proposed implementation is the most efficient in the x86-TSO model. Though [ this
975
+ paper] [ deque-bounded-tso ] presents a variant of various deques in the "bounded x86-TSO" model, where
976
+ you don't need to issue the expensive ` mfence ` barrier (think: seqcst-fence) in ` pop() ` .
977
+
978
+ For ARM/POWER, you can further optimize the compilation result of the proposed implementation as
979
+ follows:
980
+
981
+ - ` 'L102 ` can be just plain load: ` 'L109 ` is the only synchronization target, and they have RW ctrl
982
+ dependency.
983
+
984
+ - ` 'L408 ` can be just plain load: ` 'L409 ` is the only synchronization target, and they have RR addr
985
+ dependency. In an ideal world, this synchronizing dependency should be expressible in C11 using
986
+ the ` Consume ` ordering.
987
+
988
+ - ` 'L404 ` can be just plain load, but ` isync/isb ` should be inserted right before ` 'L408 ` : ` 'L408 ` 's
989
+ read, ` 'L409 ` 's read, ` 'L410 ` 's read/write, and the end view of ` steal() ` in the successful case
990
+ are the synchronization targets, and they have RR/RW ctrl+` isync/isb ` dependency.
991
+
992
+ We believe [ this paper] [ chase-lev-weak ] has a bug in their ARMv7 implementation of Chase-Lev
993
+ deque. Roughly speaking, they used a plain load for ` 'L404 ` , and put ctrl+` isync/isb ` right after
994
+ ` 'L409 ` . But in that case, the reads at ` 'L408 ` and ` 'L409 ` can be reordered before ` 'L404 ` . See
995
+ the [ this tutorial] [ arm-power ] §4.2 on [ the MP+dmb+ctrl litmus test] [ mp+dmb+ctrl ] for more
996
+ details.
978
997
979
998
980
999
@@ -992,3 +1011,5 @@ future work.
992
1011
[ cppatomic ] : http://en.cppreference.com/w/cpp/atomic/atomic
993
1012
[ n3710 ] : http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html
994
1013
[ c11 ] : www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
1014
+ [ mp+dmb+ctrl ] : https://www.cl.cam.ac.uk/~pes20/arm-supplemental/arm033.html
1015
+ [ arm-power ] : https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
0 commit comments