#Distributed File System using RAFT
Given is a simple implementation of a distributed file system which makes use of the RAFT consensus protocol in order to distribute the file system across a cluster of servers (5 in this case) and ensure that each of the servers agrees on the series of actions to be performed locally in their state machines. It is comprised of the following parts
fs is a simple network file server. Access to the server is via a simple telnet compatible API. Each file has a version number, and the server keeps the latest version. There are four commands, to read, write, compare-and-swap and delete the file.
fs files have an optional expiry time attached to them. In combination with the cas
command, this facility can be used as a coordination service, much like Zookeeper.
> go run server.go &
> telnet localhost 8080
Connected to localhost.
Escape character is '^]'
read foo
ERR_FILE_NOT_FOUND
write foo 6
abcdef
OK 1
read foo
CONTENTS 1 6 0
abcdef
cas foo 1 7 0
ghijklm
OK 2
read foo
CONTENTS 2 7 0
ghijklm
The format for each of the four commands is shown below,
Command | Success Response | Error Response |
---|---|---|
read filename \r\n | CONTENTS version numbytes exptime remaining\r\n content bytes\r\n |
ERR_FILE_NOT_FOUND |
write filename numbytes [exptime]\r\n content bytes\r\n |
OK version\r\n | |
cas filename version numbytes [exptime]\r\n content bytes\r\n |
OK version\r\n | ERR_VERSION newversion |
delete filename \r\n | OK\r\n | ERR_FILE_NOT_FOUND |
In addition the to the semantic error responses in the table above, all commands can get two additional errors. ERR_CMD_ERR
is returned on a malformed command, ERR_INTERNAL
on, well, internal errors. There is an additional error ERR_REDIRECT <url>
that the client receives in case it is sending the commands to the node which is not the leader.
For write
and cas
and in the response to the read
command, the content bytes is on a separate line. The length is given by numbytes in the first line.
Files can have an optional expiry time, exptime, expressed in seconds. A subsequent cas
or write
cancels an earlier expiry time, and imposes the new time. By default, exptime is 0, which represents no expiry.
raft is the layer which is responsible for command replication over each of the file servers in the cluster, and to ensure their consistency. It ensures that the command is handed over to the state machine of a given file server instance of the cluster, only if it has been successfully replicated over a majority of the servers in the cluster.
This coordination is achieved by the leader node, which is one of the nodes in the cluster, which has been voted for by a majority in the cluster.
The format for each of the input events for the raft state machine is shown below,
Event | Success Response | Error Response |
---|---|---|
Append (data) | Commit (node id, index, term, data, nil) | Commit (node id, -1, 0, data, error message) |
Timeout () | Alarm (timeout) | |
AppendEntriesRequest (term, logterm, leaderId, prevLogIndex, prevLogTerm, data, leaderCommit) | AppendEntriesResponseEvent (id, index, term, data, true) | AppendEntriesResponseEvent (id, index, term, data, false) |
VoteRequestEvent (candidateId, term, logIndex, logTerm) | VoteResponseEvent(id, term, true) | VoteRequestEvent(id, term, false) |
The format of the output actions generated by the raft state machine is given below,
Action | Description |
---|---|
Send (peerId, event) | Sends the event i.e AppendEntries Request/Response or Vote Resquest/Response to the given peerId |
Commit (index, data, err) | Sends index+data if commit is successful, else sends data+err to report an error |
Alarm (time) | Sends a Timeout event after time milliseconds |
LogStore (index, term, data) | Indication to append data to the log at the given index |
SaveState (curTerm, votedFor, commitIndex) | Stores the given data i.e. curTerm, votedFor, commitIndex to persistent storage for recovery |
go get github.com/SrishT/assignment4
go test github.com/SrishT/assignment4/...