Skip to content

henryqingmo/KubeKV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

KubeKV

A fault-tolerant, linearizable key-value store built on a from-scratch Raft consensus implementation in Go. Long-term goal: a minimal etcd — the kind of thing Kubernetes uses for cluster state.

What it does

  • Raft consensus — leader election, log replication, and term-based safety
  • Log compaction — snapshotting with InstallSnapshot so logs don't grow unbounded
  • Linearizable KV — versioned Put/Get operations; stale reads are rejected
  • Partition tolerance — correctly handles split-brain, heals on reconnect, checks term + leadership before committing

Architecture

flowchart TD
    subgraph client["Client Layer"]
        C["Clerk\nclient.go"]
    end

    subgraph kv["KV Layer  — kvraft1/"]
        KVS["KVServer\nserver.go"]
        RSM["RSM  reader loop\nrsm/rsm.go"]
        Store["In-memory KV store\nversioned map[key]→{val, version}"]
    end

    subgraph raft["Raft Layer  — raft1/"]
        RL["Raft leader\nraft.go"]
        RF1["Raft follower"]
        RF2["Raft follower"]
        P["Persister\nraft state + snapshot"]
    end

    C -->|"Get / Put RPC"| KVS
    KVS -->|"Submit(op)\nblocks until committed"| RSM
    RSM -->|"Start(cmd)\nstartCh → immediate AppendEntries"| RL
    RL -->|"AppendEntries RPC"| RF1
    RL -->|"AppendEntries RPC"| RF2
    RL -.->|"InstallSnapshot RPC\nlagging follower"| RF1
    RL -->|"applyCh ApplyMsg\ncommitted index + command"| RSM
    RSM -->|"DoOp()"| Store
    RSM -.->|"Snapshot()\nwhen log ≥ 90% maxraftstate"| RL
    RL -->|"persist()"| P
Loading

Each write goes through the Raft log. Reads block until the server confirms it's still the leader (no stale reads). The RSM layer owns the applyCh drain loop — it executes committed ops, wakes blocked Submit() callers via a per-index channel, and triggers snapshotting when the log approaches maxraftstate. On log growth, the KV layer serializes its state and Raft discards the log prefix, shipping the snapshot to lagging followers via InstallSnapshot.

Key design decisions

Versioned puts over CAS — every key carries a monotonic version. Put(key, val, version) is rejected if the version doesn't match, giving clients optimistic concurrency without distributed locks. This maps cleanly to how etcd's watch + revision model works.

Snapshot-driven log truncation — when the KV layer signals a snapshot, Raft replaces the log prefix with (lastIncludedIndex, lastIncludedTerm) and ships the snapshot to lagging followers via InstallSnapshot. This is the same mechanism etcd uses to onboard new members without replaying the full history.

Fast path on Start()Start() signals a dedicated channel that immediately triggers AppendEntries to all peers, rather than waiting for the next heartbeat tick. Cuts commit latency under load.

Running

cd src/raft1
go test -run TestBasicAgree
go test -run TestSnapshotInstall
go test -run TestConcurrentClients

Where this is headed

The immediate target is a standalone gRPC server that speaks a subset of the etcd v3 API — enough to swap in as a Kubernetes datastore for a small cluster. That requires:

  • gRPC transport replacing the in-process RPC shim
  • Persistent WAL on disk (currently in-memory via Persister)
  • Watch streams (key-range notifications on commit)
  • Membership changes (Raft joint consensus, §6 of the paper)
  • Read index / lease-based reads for lower-latency queries

References

About

Simplified etcd-style metadata store in Go, built on Raft with leader election, replicated logs, crash persistence, snapshots, and a linearizable key-value service.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages