A fault-tolerant, linearizable key-value store built on a from-scratch Raft consensus implementation in Go. Long-term goal: a minimal etcd — the kind of thing Kubernetes uses for cluster state.
- Raft consensus — leader election, log replication, and term-based safety
- Log compaction — snapshotting with
InstallSnapshotso logs don't grow unbounded - Linearizable KV — versioned
Put/Getoperations; stale reads are rejected - Partition tolerance — correctly handles split-brain, heals on reconnect, checks term + leadership before committing
flowchart TD
subgraph client["Client Layer"]
C["Clerk\nclient.go"]
end
subgraph kv["KV Layer — kvraft1/"]
KVS["KVServer\nserver.go"]
RSM["RSM reader loop\nrsm/rsm.go"]
Store["In-memory KV store\nversioned map[key]→{val, version}"]
end
subgraph raft["Raft Layer — raft1/"]
RL["Raft leader\nraft.go"]
RF1["Raft follower"]
RF2["Raft follower"]
P["Persister\nraft state + snapshot"]
end
C -->|"Get / Put RPC"| KVS
KVS -->|"Submit(op)\nblocks until committed"| RSM
RSM -->|"Start(cmd)\nstartCh → immediate AppendEntries"| RL
RL -->|"AppendEntries RPC"| RF1
RL -->|"AppendEntries RPC"| RF2
RL -.->|"InstallSnapshot RPC\nlagging follower"| RF1
RL -->|"applyCh ApplyMsg\ncommitted index + command"| RSM
RSM -->|"DoOp()"| Store
RSM -.->|"Snapshot()\nwhen log ≥ 90% maxraftstate"| RL
RL -->|"persist()"| P
Each write goes through the Raft log. Reads block until the server confirms it's still the leader (no stale reads). The RSM layer owns the applyCh drain loop — it executes committed ops, wakes blocked Submit() callers via a per-index channel, and triggers snapshotting when the log approaches maxraftstate. On log growth, the KV layer serializes its state and Raft discards the log prefix, shipping the snapshot to lagging followers via InstallSnapshot.
Versioned puts over CAS — every key carries a monotonic version. Put(key, val, version) is rejected if the version doesn't match, giving clients optimistic concurrency without distributed locks. This maps cleanly to how etcd's watch + revision model works.
Snapshot-driven log truncation — when the KV layer signals a snapshot, Raft replaces the log prefix with (lastIncludedIndex, lastIncludedTerm) and ships the snapshot to lagging followers via InstallSnapshot. This is the same mechanism etcd uses to onboard new members without replaying the full history.
Fast path on Start() — Start() signals a dedicated channel that immediately triggers AppendEntries to all peers, rather than waiting for the next heartbeat tick. Cuts commit latency under load.
cd src/raft1
go test -run TestBasicAgree
go test -run TestSnapshotInstall
go test -run TestConcurrentClientsThe immediate target is a standalone gRPC server that speaks a subset of the etcd v3 API — enough to swap in as a Kubernetes datastore for a small cluster. That requires:
- gRPC transport replacing the in-process RPC shim
- Persistent WAL on disk (currently in-memory via
Persister) - Watch streams (key-range notifications on commit)
- Membership changes (Raft joint consensus, §6 of the paper)
- Read index / lease-based reads for lower-latency queries
- In Search of an Understandable Consensus Algorithm — Ongaro & Ousterhout
- etcd internals — membership and storage design