-
Notifications
You must be signed in to change notification settings - Fork 33
MDC4 Keyspace Sync Notes
Kresten Krab Thorup
The current status of AAE-related code in core, kv, repl, and yokozuna shows some traces of copy-and-paste code. For instnce, the module riak_kv_index_hashtree
exists in a number of variations that makes it difficult to bring improvements forward because such improvements or API changes would need to be cross ported to these various copies.
The exact reason for or variation in behavior of these variants are not clear, and thus is is a significant undertaking to integrate and unify these.
The on-disk format for AAE trees (as implemented in riak_core/hashtree
) uses more space than strictly necessary.
The current on-disk format is that each KV is stored in the tree file as
<<$t,TreeId:22/binary,$s,Segment:64/integer,Key/binary>>
TreeId
is 22 bytes (20 bytes from the vnode id, and 2 bytes for N value). The implementation in hashtree uses 1M segments, and the Segment
is stored in a 64 bit unsiged. Finally, the Key
is usually term_to_binary({Bucket,Key})
.
See https://github.com/basho/riak_core/blob/develop/src/hashtree.erl#L651-L662
All in all, we can shave this down significantly, saving valuable I/O.
Imagine this:
Key = <<BucketLen:16,Bucket/binary,Key/binary>>,
TreeId = <<Partition:24, N:8>>
<<$t, TreeId/binary, $s,Segment:32,Key/binary>>
This would shave 33 bytes off each such entry. For a system with 1B keys, N=3, that is an over all reduction of the storage load with 100GB. That's a lot of data in need of being read to do a rehash.
Distinct Q sync means syncing between clusters with different ring size.
Distinct N sync means syncing between clusters in which the same bucket is configured with distinct N values.
In the trifork/repl project, we started out trying to tackle only the former challenge. I believe that was a bad decision; if we have to go through the effort of implementing all this we need a structure that can deal with both.
The way we would be dealing with distinct Q syncs was to use the maximum of the two cluster's Q values. This imples that one side may have to recompute it's hash trees, which could be an unreasonably slow operation.
The difficult part when syncing between clusters in which a given bucket may have different N values is the following:
- either one has to do per-bucket sync, so that this information can be dealt with explicitly; or
- we have to sync only at the "primary responsible", and group objects regardless of N. With this strategy (assuming same Q), you would or each vnode sync all the objects who's key hashes to exactly said vnode (1st replica in any preflist regardless of the applied N value) to the same vnode in the other cluster. For distinct Q, you would need to do the same but for the subspaces.
The second approach has the major drawback, that it needs all nodes to be up in order to do a full sync.
Thus - to do a proper fullsync with distinct N within the current general architecture; I believe you'd need to group objects by bucket somehow.
I'd suggest creating a separate AAE infrastructure for replication.
This would assume that intra-cluster AAE stays the same; and repl requirements can be separate. This has several advantages, because the design space for intra-cluster AAE has very different requirements and assumptions. One is the latency between nodes, another is the availability and consistency requirement may well have other characteristics.
Assume we have a new riak_core
app, called riak_aae
. This would be deplyed with a large fixed size ring (say, 1024) and N=2. 1024 should be chosen so that it is always larger than any Q value in a set of clusters to be synced. N=2 is chosen so as to provide "fullsync availability". If properly implemented, an even larger "aae ring" sizes should not be an issue.
The functionality of riak_aae
would be similar to a riak_kv, but with storage optimized to share leveldb instances across aae-vnodes just like hashtree does today.
This app could be deployed on a separate set of clusters in the same data centers as the riak clusters to be synced, or on the same set of hardware as riak itself, or adjusted to run inside riak by some indirect sync vnode addressing scheme. For the present discussion, we will assume that it runs in separate erlang nodes.
For every put/delete in the real riak, a post_commit hook would forward a (Key, Hash) put to the riak_aae system. The riak_aae system uses the first 10 bits of SHA1(Key) to decide where to do the put.
Now, the high-level functionality of a riak_aae system is the compare
opration, which points one riak_aae system at the address of another riak_aae system, and responds with a stream of differences. The compare operation takes as an argument a range ( From..To ) where both From and To are fracions P/Q; where Q is the ring size of the real riak, where Q must be less than the configured maximum 1024.
Periodically (perhaps when also rebuilding AAE trees for intra-cluster exchange) primary responsible vnodes will issue a rebuild of the sync dataset of it's responsible range. These rebuilds only occur for preflists for which the given vnode is primary responsible, as such vnodes necessarily holds the entire keyset (for all values of N).
Running a compare between two riak_aae
clusters is similar to how riak currently runs AAE-based sync (perhaps with a smaller fan-out for it's hashtrees, or otherwise optimized to higher latency links).
- Because they always share ring sizes (1024) there is no need to handshake and rebuild aae trees.
- The riak_aae clusters are also oblivious to N values for the two riak clusters
- this subsytem always has a fixed N=2 for fullsync availability.
As hinted above, such a setup sets the stage for a better set of design decistions that are not tied to the intra-cluster AAE requirements; and also decouples many things.
The nodes to deal with inter-cluster replication can be configured with other operating requirements, network and disk setups, etc. Also from a business pespective, repl is suddently a separate service that could potentially also be used to sync other data sources, not just riak.
With riak_aae
as envisioned above, there would also be opportunities to reduce storage requirements even further - both on disk and for network traffic.
**Unresolved Issue: Bucket-level Sync **
One challenge not covered in the above, is that users may want to do replication for certain buckets to only some peer clusters.
In general, it is painful as far as I can see, if bucket properties need to be recovered dynamically whenever sync trees are scanned / recreated.
Perhaps this could be dealt with by introducing a notion of "sync sets", a tag that can be added to a bucket's configuration; and for which there pressumably exists a limited set. In a sense, this would just be another high-level scoping for buckets; and two data centers syncing would need to agree on the sync_set - bucket mapping.
Then when syncing, you'd want to say "I want to sync sync set x, y and z with cluster B." Then the riak_aae
application would do 3 fullsyncs with cluster B, one for each sync set.
**Unresolved Issue: Integration with Real-Time Repl **
If fulsync is moved into it's own subsystem, the integration and reliability of information regarding realtime and fullsync "how in sync are we" kind of quiestions can perhaps be trycky to establish.
Yes, seems like we could save a lot of pain improving that.
- I thought the original plan was to keep trees for max(Q) across clusters which were called subspaces in the original writeup https://github.com/basho/riak_repl/wiki/MDC4-Independent-Fullsync-RFC
- Modify AAE to compute trees based on tag - determined on demand from segments and cached in memory. I thought we would probably have to cache the bucket properties to work out if buckets should be replicated, added to search etc.
- To sync the two clusters, you have to sync each subspace and active N value
I remember you worked out there were flaws in the proposal, but I can't remember what they were or if you wrote them down.
Paraphrase: Establish sync service with high ring size, N = 2
Questions:
-
what writes to replicas? KV style FSMs?
-
would you use fallbacks, or keep a persistent queue back to primaries?
-
For N=1 buckets, it would double the overhead, for N = 3 space, overhead would be 2/3, even better savings for higher N values ued by strong consistency.
-
What mechanism synchronizes the N=2 AAE data replicas?
-
If put/deletes forward to the AAE system put coord FSM postcommit hook
- what happens on lost message from coord? discovered during vnode syncs?
- how do you handle dropped delete message to riak_aae, vnodes remove all but now present in riak_aae - do you remove it, or is it on a fallback somewhere?
- what happens on partition? How do you combine hashes?
I definitely agree we need to
- Get control of the AAE implementations and get back to a single implementation.
- Reduce the impact of calculating them
We did consider (back in 2011) something similar to what you're describing for doing fullsync in MDC, but the work got abandoned due to resource issues. We've been talking internally about how to synchronize with other data stores and I think we hit a similar family of problems.
My thoughts on how to take AAE forward are around building many smaller trees and using commutative hashes to combine them - keeping the AAE metadata closer to the buckets that use them and then combining them together to do the syncs. I've got some ideas based on our discussions about object epochs being used to calcuate how recently synced replicas were as well as improving the distributed tombstone garbage collection story.
Now we're out of annual roadmapping I'll try and find some time to explore them and write them up.