-
Notifications
You must be signed in to change notification settings - Fork 113
Description
Maps have a union_with_key method, like union but with a callback to decide what to do when a key exists on both sides:
pub fn union_with_key<F>(self, other: Self, mut f: F) -> Self
where
F: FnMut(&K, V, V) -> VBoth HashMap and OrdMap are internally immutable trees with sharable copy-on-write nodes. Rc::make_mut is used to mutate a node, cloning it if it was previously shared. When many possibly-large maps exist that are created by cloning a previous map and making a possibly-small number of changes, this sharing can be very significant and minimizing the number of make_mut calls can be impactful.
Would you be open to adding a new API for this?
Stage 1
The implementation of union_with_key consumes and iterates one the input map, while it mutates then returns the other one. Because the signature of the callback involves owned V values however, it needs to always call remove then insert. This leads to more make_mut calls than necessary when the value that was already in the map being mutated ends up being used.
Adding a new method where the callback takes borrowed values would help in that case:
enum MergeResult<V> {
UseLeftValue,
UseRightValue,
UseNewValue(V),
} pub fn union_with_merge<F>(self, other: Self, mut merge: F) -> Self
where
F: FnMut(&K, /* left: */ &V, /* right: */ &V) -> MergeResult<V>
{
for (key, right_value) in other {
match self.get(&key) { // get() does not use make_mut() where remove() does
None => {
self.insert(key, right_value);
}
Some(left_value) => {
match merge(&key, left_value, &right_value) {
MergeResult::UseLeftValue => {}, // No insert() here
MergeResult::UseRightValue => {
self.insert(key, right_value);
},
MergeResult::UseNewValue(new_value) => {
self.insert(key, new_value);
},
}
}
}
}
self
}(Omitted: swapping the two maps to mutate the larger one: #163. That swap is where MergeResult::UseRightValue becomes useful.)
Stage 2
This is a bit fuzzy. More of an intuition that something might be possible, than a fully-formed plan.
Writing this issue is motivated by an optimization effort of an algorithm in Mercurial (copy tracing). Currently that algorithm does the equivalent of Stage 1 above when one of the two maps to merge is much (2×) larger than the other. When they are of similar size it goes even further to maximize re-use: call OrdMap::diff and collect all respective key-value pairs that we’d need to insert to transform each of the two maps into a merged map. Only then we pick the side with the smaller set of changes.
Since #113 diff short-circuits shared internal subtrees based on ptr_eq. So more node sharing make diff faster and avoids redundant merge computations, which in turn increases sharing in a virtuous cycle.
Still, collecting two sets of changes before applying one of them feels like more work than necessary.
A union_with_merge method in the im crate could potentially walk internal trees directly instead of using ConsumingIter, merge one sub-tree at a time, short-circuit shared sub-trees based on ptr_eq, and… handwave… have some better algorithm than the one based of OrdMap::diff.
Walking internal trees while mutating them to insert merged value and maintaining all invariants would likely be tricky (diff is comparatively easier since read-only). But does this sound possible at all?
Activity