@@ -7,10 +7,10 @@ Crossbeam, and its correctness proof. this RFC has a [companion PR to `crossbeam
7
7
8
8
# Motivation
9
9
10
- We believed that the current implementation of deque is not fully optimized yet, leaving
11
- non-negligible room for performance improvement. However, it was unclear how to further optimize the
12
- current implementation while guaranteeing its correctness. This RFC resolves this issue by proving
13
- an optimized implementation.
10
+ The current implementation of deque is not fully optimized yet, leaving non-negligible room for
11
+ performance improvement. However, it is unclear how to further optimize the current implementation
12
+ while guaranteeing its correctness. This RFC resolves this issue by proving an optimized
13
+ implementation.
14
14
15
15
16
16
@@ -23,8 +23,8 @@ consistency model, and (2) it is functionally correct, i.e. it acts like a deque
23
23
## Semantics of C/C++ relaxed-memory concurrency
24
24
25
25
For the semantics of C/C++ relaxed-memory concurrency on which the deque implementation is based, we
26
- use the state-of-art [ promising semantics] [ promising ] . For a gentle introduction to the semantics,
27
- we refer the reader to [ this blog post] [ synch-patterns ] .
26
+ use the state-of-the- art [ promising semantics] [ promising ] . For a gentle introduction to the
27
+ semantics, we refer the reader to [ this blog post] [ synch-patterns ] .
28
28
29
29
One caveat of the original promising semantics is its lack of support for memory allocation. In this
30
30
RFC, we use the following semantics for memory allocation:
@@ -33,15 +33,15 @@ RFC, we use the following semantics for memory allocation:
33
33
Q+` . Now we define ` Timestamp = Option<Q+>` , where ` None` means the location is unallocated.
34
34
Suppose ` None < Some(t) ` for all ` t ` .
35
35
36
- - At the beginning, the thread's view is ` None ` for all the locations. When a thread allocates a
36
+ - At the beginning, a thread's view is ` None ` for all the locations. When a thread allocates a
37
37
location, its view on the location becomes ` 0 ` .
38
38
39
39
- If a thread is reading a location and it's view on the location is ` None ` , i.e. it is unallocated
40
- in the thread's point of view, the behavior is undefined.
40
+ in the thread's point of view, then the behavior is undefined.
41
41
42
- We believe this definition is compatible with all the results presented in the [ paper ] [ promising ] on
43
- the promising semantics . We hope [ the Coq formalization] [ promising-coq ] of promising semantics be
44
- updated accordingly soon.
42
+ We believe this definition is compatible with all the results presented in the [ the promising
43
+ semantics paper ] [ promising ] . We hope [ its Coq formalization] [ promising-coq ] be updated accordingly
44
+ soon.
45
45
46
46
47
47
## Proposed implementation
@@ -68,9 +68,9 @@ Chase and Lev proposed [an efficient implementation of deque][chase-lev] on sequ
68
68
memory model, and Lê et. al. proposed [ its variant for ARM and C/C++ relaxed-memory
69
69
concurrency] [ chase-lev-weak ] . This RFC presents an even more optimized variant for C/C++.
70
70
71
- A Chase-Lev deque consists of (1) the ` bottom ` index for the worker , (2) the ` top ` index for both
72
- the worker and the stealers, and (3) the underlying ` buffer ` that contains items. When the buffer is
73
- too small or too large, the worker may replace it with one with more appropriate size:
71
+ A Chase-Lev deque consists of (1) the ` bottom ` index for the owner , (2) the ` top ` index for both the
72
+ owner and the stealers, and (3) the underlying ` buffer ` that contains items. When the buffer is too
73
+ small or too large, the owner may replace it with one with more appropriate size:
74
74
75
75
``` rust
76
76
struct <T : Send > Inner <T > {
@@ -231,14 +231,15 @@ thereby modifying `self.bottom` and `self.buffer`.
231
231
232
232
After ` self.buffer ` is initialized, it is modified only by the owner in ` fn resize(): 'L307 ` . Since
233
233
the contents inside a buffer is always accessed modulo the buffer's capacity (` 'L109 ` , ` 'L211 ` ,
234
- ` 'L409 ` ) and the buffer's size is always nonzero, there is no buffer overrun .
234
+ ` 'L409 ` ) and the buffer's size is always nonzero, there are no buffer overruns .
235
235
236
236
Thus it remains to prove that the buffer is not used after freed. Thanks to Crossbeam, we don't need
237
237
to take care of all the details of memory reclamation; [ this RFC] [ rfc-relaxed-memory ] says that as a
238
238
user of Crossbeam, we only need to prove that:
239
239
240
- - When a buffer is deferred to be dropped (` 'L308 ` ), there is no longer a reference to the buffer in
241
- the shared memory, and no concurrent mutator introduces the reference to the memory again.
240
+ - When a buffer is deferred to be dropped (` 'L308 ` ), there are no longer references to the buffer in
241
+ the shared memory, and no concurrent mutators introduce a reference to the buffer in the memory
242
+ again.
242
243
243
244
You can easily check these properties because the owner basically "owns" the buffer: only the
244
245
owner introduces/eliminates references to a buffer to/from the shared memory. Indeed, before a
@@ -881,33 +882,33 @@ thread calls `steal()` twice. Consider the following execution:
881
882
882
883
``` rust
883
884
// owner calls pop()
884
- 'L01 : read (bottom , 1 , Relaxed )
885
- 'L02 : write (bottom , 0 , Relaxed )
886
- 'L03 : fence (SeqCst )
887
- 'L04 : read (top , 1 , Relaxed ) // from `'L17 `
888
- 'L05 : write (bottom , 1 , Relaxed )
885
+ 'L201 : read (bottom , 1 , Relaxed )
886
+ 'L202 : write (bottom , 0 , Relaxed )
887
+ 'L203 : fence (SeqCst )
888
+ 'L204 : read (top , 1 , Relaxed ) // value from `'L410 `
889
+ 'L207 : write (bottom , 1 , Relaxed )
889
890
// return `Empty`
890
891
891
892
||
892
893
893
894
// stealer calls steal()
894
- 'L11 : read (top , 0 , Relaxed )
895
- 'L12 : fence (SeqCst )
896
- 'L13 : read (bottom , 0 , Acquire ) // from `'L02 `
895
+ 'L401 : read (top , 0 , Relaxed )
896
+ 'L402 : fence (SeqCst )
897
+ 'L404 : read (bottom , 0 , Acquire ) // value from `'L202 `
897
898
// return `Empty`
898
899
899
900
// stealer calls steal()
900
- 'L14 : read (top , 0 , Relaxed )
901
- 'L15 : fence (SeqCst )
902
- 'L16 : read (bottom , 1 , Acquire ) // from `'L05 `
903
- 'L17 : CAS (top , 0 , 1 , Release ) // success
901
+ 'L401 : read (top , 0 , Relaxed )
902
+ 'L402 : fence (SeqCst )
903
+ 'L404 : read (bottom , 1 , Acquire ) // value from `'L207 `
904
+ 'L410 : CAS (top , 0 , 1 , Release ) // success
904
905
// return `42`
905
906
```
906
907
907
908
While this execution is allowed in C11, it is called out-of-thin-air because there is a genuine
908
- dependency cycle among ` 'L04 ` , ` 'L05 ` , ` 'L16 ` , and ` 'L17 ` . There is a [ proposal to ban
909
- out-of-thin-air behaviors] [ n3710 ] in the C11 standards, but the proposal only vaguely described what
910
- does "out-of-thin-air" mean.
909
+ dependency cycle among ` 'L204 ` , ` 'L207 ` , ` 'L404 ` in the second call to ` steal() ` , and ` 'L410 ` . There
910
+ is a [ proposal to ban out-of-thin-air behaviors] [ n3710 ] in the C11 standards, but the proposal only
911
+ vaguely described what does "out-of-thin-air" mean.
911
912
912
913
You can roughly regard the promising semantics as a precise definition of out-of-thin-air
913
914
behaviors. Indeed, the promising semantics forbids this execution. See the proof of
@@ -925,55 +926,55 @@ following execution is possible, but is not linearizable:
925
926
926
927
``` rust
927
928
// owner calls pop()
928
- 'L01 : read (bottom , 1 , Relaxed )
929
- 'L02 : write (bottom , 0 , Relaxed )
930
- 'L03 : fence (SeqCst )
931
- 'L04 : read (top , 0 , Relaxed )
932
- 'L05 : CAS (top , 0 , 1 , Release ) // fail; read from from `'L17 `
933
- 'L06 : write (bottom , 1 , Relaxed )
929
+ 'L201 : read (bottom , 1 , Relaxed )
930
+ 'L202 : write (bottom , 0 , Relaxed )
931
+ 'L203 : fence (SeqCst )
932
+ 'L204 : read (top , 0 , Relaxed )
933
+ 'L213 : CAS (top , 0 , 1 , Release ) // fail; value from `'L410 `
934
+ 'L216 : write (bottom , 1 , Relaxed )
934
935
// return `Empty`
935
936
936
937
||
937
938
938
939
// stealer calls steal()
939
- 'L11 : read (top , 0 , Relaxed )
940
- 'L12 : fence (SeqCst )
941
- 'L13 : read (bottom , 0 , Relaxed ) // from `'L02 `
940
+ 'L401 : read (top , 0 , Relaxed )
941
+ 'L402 : fence (SeqCst )
942
+ 'L404 : read (bottom , 0 , Relaxed ) // value from `'L202 `
942
943
// return `Empty`
943
944
944
945
// stealer calls steal()
945
- 'L14 : read (top , 0 , Relaxed )
946
- 'L15 : fence (SeqCst )
947
- 'L16 : read (bottom , 1 , Relaxed ) // from `'L06 `
948
- 'L17 : CAS (top , 0 , 1 , Release ) // success
946
+ 'L401 : read (top , 0 , Relaxed )
947
+ 'L402 : fence (SeqCst )
948
+ 'L404 : read (bottom , 1 , Relaxed ) // value from `'L216 `
949
+ 'L410 : CAS (top , 0 , 1 , Release ) // success
949
950
// return `42`
950
951
```
951
952
952
- Note that this example is similar to the one above, but there is no dependency from ` 'L05 ` to
953
- ` 'L06 ` . So in order to forbid this execution, we have to ensure the ordering of ` 'L05 ` and ` 'L06 ` ,
954
- e.g. by either (1) release-write at ` 'L216 ` , (2) requiring seqcst ordering for the success case for
955
- the CAS at ` 'L213 ` (as done in [ this paper] [ chase-lev-weak ] ), or (3) requiring acquire ordering for
956
- the failure case for the CAS at ` 'L213 ` . We chose (3), because we believe it incurs the least
957
- performance overhead.
953
+ Note that this example is similar to the one above, but there is no dependency from ` 'L213 ` to
954
+ ` 'L216 ` . So in order to forbid this execution, we have to ensure the ordering of ` 'L213 ` and
955
+ ` 'L216 ` , e.g. by either (1) release-write at ` 'L216 ` , (2) requiring seqcst ordering for the success
956
+ case for the CAS at ` 'L213 ` (as done in [ this paper] [ chase-lev-weak ] ), or (3) requiring acquire
957
+ ordering for the failure case for the CAS at ` 'L213 ` , or (4) issuing an acquire fence after the CAS
958
+ at ` 'L213 ` fails. We chose (3) in this RFC, because we believe it incurs the least performance
959
+ overhead.
958
960
959
961
Even though release/acquire orderings are enough for the success/failure cases of the CAS at
960
962
` 'L213 ` , C11 (and effectively also LLVM and Rust) requires that "The failure argument shall be no
961
- stronger than the success argument." ([ C11] [ c11 ] 7.17.7.4 paragraph 2). So instead, we used
962
- release/relaxed for success/failure cases, and issued an acquire fence for the failure case in the
963
- implementation. This C11 requirement may be fail-safe for most use cases, but can actually be
964
- slightly inefficient in this case.
963
+ stronger than the success argument." ([ C11] [ c11 ] 7.17.7.4 paragraph 2). So instead, we chose (4) in
964
+ the companion implementation. This C11 requirement may be fail-safe for most use cases, but can
965
+ actually be slightly inefficient in this case.
965
966
966
967
It is worth nothing that the CAS at ` 'L213 ` should be strong. Otherwise, a similar execution to the
967
- one above is possible, where the CAS at ` 'L05 ` reads ` top = 0 ` and then spuriously fails.
968
+ one above is possible, where the CAS at ` 'L213 ` reads ` top = 0 ` and then spuriously fails.
968
969
969
970
970
971
## Comparison to Target-dependent Implementations
971
972
972
973
Alternatively, we can write a deque for each target architecture in order to achieve better
973
- performance. For example, [ this paper] [ deque-bounded-tso ] presents a variant of Chase-Lev deque in
974
- the "bounded TSO" x86 model, where you don't need to issue the expensive ` MFENCE ` barrier (think:
974
+ performance. For example, [ this paper] [ deque-bounded-tso ] presents a variant of various deques in
975
+ the "bounded TSO" x86 model, where you don't need to issue the expensive ` mfence ` barrier (think:
975
976
seqcst-fence) in ` pop() ` . Also, [ this paper] [ chase-lev-weak ] presents a version of Chase-Lev deque
976
- for ARMv7 that doesn't issue an ` isync ` -like fence , while the proposed implementation issues
977
+ for ARMv7 that doesn't issue ` isync ` -like fences , while the proposed implementation issues
977
978
some. Probably ` Consume ` is relevant for the latter case. These further optimizations are left as
978
979
future work.
979
980
0 commit comments