Skip to content

Reuse-optimized variant of union_with_key? #166

@SimonSapin

Description

@SimonSapin
Contributor

Maps have a union_with_key method, like union but with a callback to decide what to do when a key exists on both sides:

    pub fn union_with_key<F>(self, other: Self, mut f: F) -> Self
    where
        F: FnMut(&K, V, V) -> V

Both HashMap and OrdMap are internally immutable trees with sharable copy-on-write nodes. Rc::make_mut is used to mutate a node, cloning it if it was previously shared. When many possibly-large maps exist that are created by cloning a previous map and making a possibly-small number of changes, this sharing can be very significant and minimizing the number of make_mut calls can be impactful.

Would you be open to adding a new API for this?

Stage 1

The implementation of union_with_key consumes and iterates one the input map, while it mutates then returns the other one. Because the signature of the callback involves owned V values however, it needs to always call remove then insert. This leads to more make_mut calls than necessary when the value that was already in the map being mutated ends up being used.

Adding a new method where the callback takes borrowed values would help in that case:

enum MergeResult<V> {
    UseLeftValue,
    UseRightValue,
    UseNewValue(V),
}
    pub fn union_with_merge<F>(self, other: Self, mut merge: F) -> Self
    where
        F: FnMut(&K, /* left: */ &V, /* right: */ &V) -> MergeResult<V>
    {
        for (key, right_value) in other {
            match self.get(&key) {  // get() does not use make_mut() where remove() does
                None => {
                    self.insert(key, right_value);
                }
                Some(left_value) => {
                    match merge(&key, left_value, &right_value) {
                        MergeResult::UseLeftValue => {}, // No insert() here
                        MergeResult::UseRightValue => {
                            self.insert(key, right_value);
                        },
                        MergeResult::UseNewValue(new_value) => {
                            self.insert(key, new_value);
                        },
                    }
                }
            }
        }
        self
    }

(Omitted: swapping the two maps to mutate the larger one: #163. That swap is where MergeResult::UseRightValue becomes useful.)

Stage 2

This is a bit fuzzy. More of an intuition that something might be possible, than a fully-formed plan.

Writing this issue is motivated by an optimization effort of an algorithm in Mercurial (copy tracing). Currently that algorithm does the equivalent of Stage 1 above when one of the two maps to merge is much (2×) larger than the other. When they are of similar size it goes even further to maximize re-use: call OrdMap::diff and collect all respective key-value pairs that we’d need to insert to transform each of the two maps into a merged map. Only then we pick the side with the smaller set of changes.

Since #113 diff short-circuits shared internal subtrees based on ptr_eq. So more node sharing make diff faster and avoids redundant merge computations, which in turn increases sharing in a virtuous cycle.

Still, collecting two sets of changes before applying one of them feels like more work than necessary.

A union_with_merge method in the im crate could potentially walk internal trees directly instead of using ConsumingIter, merge one sub-tree at a time, short-circuit shared sub-trees based on ptr_eq, and… handwave… have some better algorithm than the one based of OrdMap::diff.

Walking internal trees while mutating them to insert merged value and maintaining all invariants would likely be tricky (diff is comparatively easier since read-only). But does this sound possible at all?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @SimonSapin

        Issue actions

          Reuse-optimized variant of union_with_key? · Issue #166 · bodil/im-rs