Skip to content
Draft

Ra v3 #494

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
c61f8ba
Remove deprected APIs in ra module.
kjnilsson Jan 2, 2025
adb2db6
Design doc wip
kjnilsson Jan 2, 2025
00f66ed
WIP: Introduce Ra worker process - first draft
kjnilsson Jan 31, 2025
dc6334a
remove external readers
kjnilsson Feb 17, 2025
339d51d
Add ra_machine:live_indexes callback/API
kjnilsson Feb 25, 2025
9c36303
fix tests
kjnilsson Feb 25, 2025
0a2b635
WAL: remove write strategies.
kjnilsson Feb 25, 2025
2a70310
WAL: refactor config to reduce duplication.
kjnilsson Feb 25, 2025
e228fb5
Extend ra_log_snapshot_state to support live indexes.
kjnilsson Mar 3, 2025
2f270ca
WIP: WAL sparse writes
kjnilsson Mar 5, 2025
9def150
wip
kjnilsson Mar 17, 2025
eb5a8f2
Sparse writes
kjnilsson Mar 18, 2025
dfa3db8
docs
kjnilsson Mar 24, 2025
a8f610b
wip
kjnilsson Apr 8, 2025
f35e55c
wip
kjnilsson Apr 17, 2025
7f7088b
broken wip
kjnilsson Apr 17, 2025
b843a54
ra mt sparse compat
kjnilsson Apr 23, 2025
a7bafe7
wip
kjnilsson Apr 28, 2025
bfd5218
finished ra_seq refactoring
kjnilsson May 1, 2025
451019a
wip
kjnilsson May 1, 2025
084ae3b
wip
kjnilsson May 9, 2025
d7ad27d
first cut phase 1 compaction impl
kjnilsson May 9, 2025
01b1792
wip ra_kv
kjnilsson May 20, 2025
f9891e7
snapshot api changes
kjnilsson May 21, 2025
f765da5
darned dialyzer
kjnilsson May 22, 2025
b369c42
refine compaction design
kjnilsson May 27, 2025
594eb42
add segment copy op
kjnilsson May 28, 2025
b99a647
refactor log to use range
kjnilsson May 28, 2025
bc3aab6
add assertion on log state
kjnilsson May 29, 2025
bb6beaa
Delete overwritten segment files when detected.
kjnilsson Jun 2, 2025
1d7e0fb
ra_log_reader -> ra_log_segments
kjnilsson Jun 2, 2025
630d358
Change SegRef structure
kjnilsson Jun 2, 2025
c789756
wip
kjnilsson Jun 3, 2025
a062c3e
major compactions
kjnilsson Jun 6, 2025
6cc53c8
compaction counters
kjnilsson Jun 16, 2025
1acd702
only flush live indexes in seg writer
kjnilsson Jun 16, 2025
e3d2673
wip
kjnilsson Jun 18, 2025
6f2c3fa
bug fix
kjnilsson Jun 18, 2025
d2163a8
Take live data sizes into account when grouping for compaction
kjnilsson Jun 23, 2025
bd06298
Fix snapshot live indexes replication bug
kjnilsson Jun 24, 2025
c499800
fix msg
kjnilsson Jun 27, 2025
f15cece
Sparse write fixes
kjnilsson Jun 30, 2025
1fbe83b
Fix multiple issues around interrupted snapshot replication
kjnilsson Jul 1, 2025
84ec18d
fixes
kjnilsson Jul 2, 2025
7e9c5c0
suspension fixes
kjnilsson Jul 3, 2025
27a4989
compaction refactoring
kjnilsson Jul 4, 2025
edceffa
[skip ci] Tool for stress-testing Ra KV to test Ra v3
mkuratczyk Jul 7, 2025
fe2abcd
avoid concurrent compactions
kjnilsson Jul 7, 2025
738a1b3
mosty dialyzer fixes
kjnilsson Jul 7, 2025
e23fb6a
add ra_kv:remove_member/3
kjnilsson Jul 7, 2025
d831936
harness improvements
kjnilsson Jul 7, 2025
7348b99
fixes
kjnilsson Jul 7, 2025
4fdf866
tweaks
kjnilsson Jul 7, 2025
c6dee9f
Avoid complete segments re-init when releasing resources.
kjnilsson Jul 8, 2025
d5d7bea
ra_kv_harness: add kill_wal and kill_member ops
mkuratczyk Jul 8, 2025
0c826cb
Harness tweaks
mkuratczyk Jul 8, 2025
cbe94c8
Enable maybe_expre for OTP26 pipelines
mkuratczyk Jul 8, 2025
df8cfa9
fixes
kjnilsson Jul 8, 2025
dd13a19
fix badkey
kjnilsson Jul 9, 2025
3e3d113
handle missing segments event after follower recovery
kjnilsson Jul 9, 2025
cdec12d
Make snapshots with a pre phase restartable
kjnilsson Jul 11, 2025
a495e39
wip
kjnilsson Jul 16, 2025
016921a
Reset log after aborted snapshot installation.
kjnilsson Jul 17, 2025
8b26ed6
Remove dangling symlinks on log init.
kjnilsson Jul 17, 2025
5a34bec
ra_log:set_last_index/2 fix
kjnilsson Jul 18, 2025
fd0e2d7
fixes
kjnilsson Jul 21, 2025
992f301
Handle `undefined` from whereis/1
mkuratczyk Jul 21, 2025
561a303
Detect sparse log on init
kjnilsson Jul 22, 2025
b81cbdc
Preform ra_kv:get calls on a member
mkuratczyk Jul 22, 2025
f486b94
fixes
kjnilsson Jul 23, 2025
1a1333f
fixes
kjnilsson Jul 24, 2025
91c1087
log
kjnilsson Jul 24, 2025
592104e
Consistent Aux commands
kjnilsson Aug 28, 2025
61abc99
ra_kv:get fixes
kjnilsson Aug 28, 2025
7a7b08e
harness
kjnilsson Aug 29, 2025
44b9d4b
Various optimisations
kjnilsson Sep 4, 2025
f3481db
Fix regression around snapshot machine versions.
kjnilsson Sep 5, 2025
bc1c663
fix
kjnilsson Sep 15, 2025
a963d67
Support read_entries calls in more states
kjnilsson Oct 6, 2025
33ea677
Optimise snapshotting by moving the call to ra_machine:live_indexes
kjnilsson Oct 6, 2025
d4575bd
ra_kv_harness: add network partitions
mkuratczyk Oct 8, 2025
83966e1
at least it builds again
kjnilsson Oct 10, 2025
4cdc6d9
minor refactoring
kjnilsson Oct 10, 2025
02e75d0
remove ra_log_prop again after rebase
kjnilsson Oct 10, 2025
b1da8e5
dialyzer fixes
kjnilsson Oct 10, 2025
e3ea015
revert accidental change
kjnilsson Oct 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,4 @@ config
doc/*

/.vscode/
.DS_store
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dep_aten = hex 0.6.0
dep_seshat = hex 1.0.0
DEPS = aten gen_batch_server seshat

TEST_DEPS = proper meck eunit_formatters inet_tcp_proxy
TEST_DEPS = proper meck inet_tcp_proxy

BUILD_DEPS = elvis_mk

Expand Down
14 changes: 0 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,20 +348,6 @@ is available in a separate repository.
<td>Indicate whether the wal should compute and validate checksums. Default: `true`</td>
<td>Boolean</td>
</tr>
<tr>
<td>wal_write_strategy</td>
<td>
<ul>
<li>
<code>default</code>: used by default. <code>write(2)</code> system calls are delayed until a buffer is due to be flushed. Then it writes all the data in a single call then <code>fsync</code>s. Fastest option but incurs some additional memory use.
</li>
<li>
<code>o_sync</code>: Like default but will try to open the file with O_SYNC and thus won't need the additional <code>fsync(2)</code> system call. If it fails to open the file with this flag this mode falls back to default.
</li>
</ul>
</td>
<td>Enumeration: <code>default</code> | <code>o_sync</code></td>
</tr>
<tr>
<td>wal_sync_method</td>
<td>
Expand Down
271 changes: 271 additions & 0 deletions docs/internals/COMPACTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,271 @@
# Ra log compaction

This is a living document capturing current work on log compaction.

## Overview


Compaction in Ra is intrinsically linked to the snapshotting
feature. Standard Raft snapshotting removes all entries in the Ra log
below the snapshot index where the snapshot is a full representation of
the state machine state.

The high level idea of compacting in Ra is that instead of deleting all
segment data that precedes the snapshot index the snapshot data can emit a list
of live raft indexes which will be kept, either in their original segments
or written to new compacted segments. The data for these indexes can then
be omitted from the snapshot to reduce its size and the write amplification
incurred by writing the snapshot.

### Log sections

Two named sections of the log then emerge:

#### Normal log section

The normal log section is the contiguous log that follows the last snapshot.

#### Compacting log section

The compacting log section consists of all live raft indexes that are lower
than or equal to the last snapshot taken.

![compaction](compaction1.jpg)


## Compaction phases

### Phase 1

Delete whole segments. This is the easiest and most efficient form of "compaction"
and will run immediately after each snapshot is taken.

The run will start with the oldest segment and move towards the newest segment
in the compacting log section. Every segment that has no entries in the live
indexes list returned by the snapshot state will be deleted. Standard Raft
log truncation is achieved by returning an empty list of live indexes.

TODO: how to ensure segments containing overwritten entries only are cleaned
up.

### Compacted segments: naming (phase 3 compaction)

Segment files in a Ra log have numeric names incremented as they are written.
This is essential as the order is required to ensure log integrity.

Desired Properties of phase 3:

* Retain immutability, entries will never be deleted from a segment. Instead they
will be written to a new segment.
* lexicographic sorting of file names needs to be consistent with order of writes
* Compaction walks from the old segment to new
* Easy to recover after unclean shutdown

Segments will be compacted when 2 or more adjacent segments fit into a single
segment.

During compaction the target segment will have the naming format `001-002-003.compacting`
such that each segment (001, 002, 003) name is present in the compacting name.
An upper limit on the maximum number of source segments will have to be set to
ensure the compacting file name doesn't get ridiculously long. E.g. 8.

Once the compacting segment has been synced the lowest numbered segment will
be hard linked to the compacting segment. Each of the compacted
higher numbered segments (003, 004) will then have a symlink created (e.g. 003.link)
pointing to the lowest numbered segment (002)
then the link is renamed to the source file: `003.link -> 003` (NB not atomic).

`001-002-003.compacting` is then deleted (but 002 is still hard linked so the data
will remain).

This naming format means it is easy to identify partially compacted segments
after an unclean exit. All `*.compacting` files with a link count of 1 will
be deleted as it is not clear at what stage the unclean exit occurred.

If a compacting file has a link count of 2 (or more???) the compacting writes
completed and the lowest numbered segment was hard linked to the compacting
segment. We don't know if all symlinks were created correctly so we need to ensure
this during recovery.

Once we've ensured there are hard or symlinks for all the source files the compacting
file can be deleted.

The symlinks are there so that any pending read references to the old
segment name are still valid for some time after but the disk space for the
source segment will still be reclaimed when the links replace the files.

Some time later the symbolic links can be removed.

Single segment compaction would work the same as we can directly rename
e.g. the compacted segment `001.compacting` to `001.segment` without breaking
any references to the segment. Single segment compaction should only be triggered
when a certain limit has been reached, e.g. > 50% of indexes can be cleared up.

TODO: how to handle compaction of segments that have indexes that never were
committed, i.e. overwritten?


#### When does phase 3 compaction run?

Options:

* On a timer
* After phase 1 if needed based on a ratio of live to dead indexes in the compacting section
* After phase 1 if needed based on disk use / ratio of live data to dead.
* Explicitly through a new ra API.

![segments](compaction2.jpg)

### Phase 4 compaction (optional)

At some point the number of live indexes could become completely sparse (no
adjacent indexes) and large which is sub optimal memory wise.

At this point the state machine could implement a "rewrite" command (or we
provide one in Ra) to rewrite a subset or all of the indexes at the head of
the Ra log to "clump" their indexes better together.

This is ofc optional and has replication costs but could be a manually triggered
maintenance option perhaps.

### Ra Server log worker responsibilities

* Write checkpoints and snapshots
* Perform compaction runs
* Report segments to be deleted back to the ra server (NB: the worker does
not perform the segment deletion itself, it needs to report changes back to the
ra server first). The ra server log worker maintains its own list of segments
to avoid double processing


```mermaid
sequenceDiagram
participant segment-writer
participant ra-server
participant ra-server-log

segment-writer--)ra-server: new segments
ra-server-)+ra-server-log: new segments
ra-server-log->>ra-server-log: phase 1 compaction
ra-server-log-)-ra-server: segment changes (new, to be deleted)
ra-server-)+ra-server-log: new snapshot
ra-server-log->>ra-server-log: write snapshot
ra-server-log->>ra-server-log: phase 1 compaction
ra-server-log-)-ra-server: snapshot written, segment changes
```

#### Impact on segment writer process

The segment writer process as well as the WAL relies heavily on the
`ra_log_snapshot_state` table to avoid writing data that is no longer
needed. This table contains the latest snapshot index for every
ra server in the system. In order to do the same for a compacting state machine
the segment writer would need access to the list of live indexes when flushing
the WAL to segments.

Options:

* It could read the latest snapshot to find out the live indexes
* Live indexes are stored in the `ra_log_snapshot_state` table for easy access.

Snapshots can be taken ahead of the segment part of the log meaning that the
segment writer and log worker may modify the log at different times. To allow
this there needs to be an invariant that the log worker never marks the last
segment for deletion as it may have been appended to after or concurrently
to when the log worker evaluated it's state.

The segment writer can query the `ra_log_snapshot_table` to see if the server
is using compaction (i.e. have preceding live entries) and if so read the
live indexes from the snapshot directory (however it is stored). Then it
can proceed writing any live indexes in the compacting section as well as
contiguous entries in the normal log section.


Segment range: (1, 1000)

Memtable range: (1001, 2000)

Snapshot: 1500, live indexes `[1501, 1999]`,


Alt: if the log worker / Ra server is alive the segment writer could call into
the log worker and ask it to do the log flush and thus make easy use of the
live indexes list. If the Ra server is not running but is still registered
the segment writer will flush all entries (if compacting), even those preceding
last snapshot index. This option minimises changes to segment writer but could
delay flush _if_ log worker is busy (doing compaction perhaps) when
the flush request comes in.



### Snapshot replication

With the snapshot now defined as the snapshot state + live preceding raft indexes
the default snapshot replication approach will need to change.

The snapshot sender process (currently transient) first sends all live
entries for the given snapshot, then performs normal chunk based
snapshot replication.


#### Snapshot install procedure

* Sender begins with sending negotiating which live indexes are needed. It is
probably sufficient for the receiver to return it's `last_applied` index and the
sender will send all sparse entries after that index
* Then it proceeds to send the live indexes _before_ the snapshot (so in it's
natural log order).
* The receiving ra process then writes these commands to the WAL as normal but
using a special command / flag to avoid the WAL triggering its' gap detection.
Ideally the specialised command would include the previous idx so that we can
still do gap detection in the sparse sequence (should all sends include prior
sequence so this is the only mode?).
* The sparse writes are written to a new memtable using a new `ra_mt:sparse_write/2`
API that bypasses gap validation and stores a sparse sequence instead of range


#### How to work out which live indexes the follower needs

Follower term indexes:

`{{100, 1000}, 1}, {{1001, 1500}, 2}`

Incoming snapshot at 2000 in term 3

live indexes: `[100, 600, 1200, 1777]`

Follower needs all live indexes greater than it's `last_applied` index.
This is the only safe way to ensure that the live indexes are not stale entries.

If follower `last applied` is: 1500 then follower needs: `[1777]` only.
If follower `last_applied` is: 1100 then follower needs `[1200, 1777]`


#### How to store live indexes with snapshot

* Separate file (that can be rebuilt if needed from the snapshot).


### WAL impact

The WAL will use the `ra_log_snapshot_state` to avoid writing entries that are
lower than a server's last snapshot index. This is an important optimisation
that can help the WAL catch up in cases where it is running a longer mailbox
backlog.

`ra_log_snapshot_state` is going to have to be extended to not just store
the snapshot index but also the machine's smallest live raft index (at time of
snapshot) such that the WAL can use that to reduce write workload instead of
the snapshot index.

Ideally the WAL would only write indexes the precede the snapshot if they're in
the live indexes, however this would no doubt be performance impacting so in this
case it is probably better just to have a secondary lowest live index to use
for decision making.


WAL needs to accept sparse writes without a higher snapshot idx (snap install)
WAL needs to accept contiguous writes with a higher snap idx with and without live indexes
WAL will send ra_seq of entries written in a WAL
SegWriter needs to flush the live indexes preceeding the snapshot index which
_should_ be covered in the sparse sequence of written indexes.
Binary file added docs/internals/compaction1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/internals/compaction2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading