Skip to content

Hash consing

Koz Ross edited this page Jan 30, 2025 · 1 revision

Introduction

To ensure the most compact code, as well as the easiest analysis, Covenant is represented as a directed acyclic graph, rather than a tree-like structure as is more often the case. As we provide a means of building Covenant programmatically in Haskell in this library, we want to ensure that this happens by construction and cannot go wrong. While we could do something like this using mutable references, this is undesirable in a language like Haskell where not absolutely necessary.

Instead, we will use hash consing. This is a pure technique which allows us to maintain a directed acyclic graph structure dynamically, without relying on mutable references. We can also ensure that we maintain as much computational sharing as possible in the program represented by a directed acyclic graph 'automagically' using this technique.

In this writeup, we will describe the problem this technique seeks to address, as well as how we solve it. We will provide simplified examples in Haskell, but these will help those who want to understand how we implemented types like ASGNode and ASGBuilder see the bigger picture of the techniques we use.

Terminology

We use DAG to mean directed acyclic graph. We use ASG, standing for 'abstract syntax graph', analogously to how 'AST' is used to mean 'abstract syntax tree'. An ASG is similar to an AST, except that it is allowed to be a DAG, rather than a tree specifically.

The problem

Consider the following data type:

data AST = 
    Lit Natural | 
    Add AST AST |
    Mul AST AST
    deriving stock (Eq, Show)

This represents computations in the semiring of the natural numbers. As we defined it as an ADT, we naturally get a tree structure. Furthermore, we can use this structure to define other operations. For example, we could define a (static, and inefficient) exponentiation operation:

powOf :: AST -> Natural -> AST
powOf x = \case
    0 -> Lit 1
    1 -> x
    n -> Mul x . powOf x $ n - 1

However, even accounting for the inefficiency of our approach, we create a far larger computation than we need. If we represent the result of powOf x 10 for some x as a tree, we see the issue:

graph TD
  a1[x]
  a2[x]
  a3[x]
  a4[x]
  a5[x]
  a6[x]
  a7[x]
  a8[x]
  a9[x]
  a10[x]
  mul1[Mul]
  mul2[Mul]
  mul3[Mul]
  mul4[Mul]
  mul5[Mul]
  mul6[Mul]
  mul7[Mul]
  mul8[Mul]
  mul9[Mul]
  mul1 --> a1
  mul1 --> mul2
  mul2 --> a2
  mul2 --> mul3
  mul3 --> a3
  mul3 --> mul4
  mul4 --> a4
  mul4 --> mul5
  mul5 --> a5
  mul5 --> mul6
  mul6 --> a6
  mul6 --> mul7
  mul7 --> a7
  mul7 --> mul8
  mul8 --> a8
  mul8 --> mul9
  mul9 --> a9
  mul9 --> a10
Loading

As we duplicate x ten times, this means that if x was a large sub-expression, we would grow the AST by a lot. Furthermore, this obscures the intent: x doesn't need to be recomputed ten times here, even though the AST suggests this. While we could avoid these issues in several ways, it would be easier if instead, we could produce a structure like so:

graph TD
  x
  mul1[Mul]
  mul2[Mul]
  mul3[Mul]
  mul4[Mul]
  mul5[Mul]
  mul6[Mul]
  mul7[Mul]
  mul8[Mul]
  mul9[Mul]
  mul1 --> x
  mul1 --> mul2
  mul2 --> x
  mul2 --> mul3
  mul3 --> x
  mul3 --> mul4
  mul4 --> x
  mul4 --> mul5
  mul5 --> x
  mul5 --> mul6
  mul6 --> x
  mul6 --> mul7
  mul7 --> x
  mul7 --> mul8
  mul8 --> x
  mul8 --> mul9
  mul9 --> x
  mul9 --> x
Loading

Not only is this a smaller representation, it also conveys our intent much better: x now only occurs once. However, this structure is no longer a tree, but a DAG, and thus, AST (or indeed, a single ADT in general) can't represent it.

There is a range of solutions to this problem in various languages (including Haskell), mostly involving mutable references. While this works, it's not ideal: mutability is mostly a performance improvement, and it is easy to produce nonsensical data, such as references that go nowhere, cycles in (supposedly) a DAG, etc. We would like a solution that preserves immutability, doesn't produce situations that make no sense (that is, 'correct by construction') and also enables as much sharing as possible.

The solution

The approach we will use is hash consing. Essentially, this simulates mutable references by replacing self-recursive ADT references in structures similar to AST with an explicit identifier (or reference) value. We then track unique pairings of references and nodes in an auxilliary structure, usually called a bimap. Lastly, we provide a controlled interface to ensure that issues of unrestrained mutable references cannot arise.

More precisely, we need to do several things. Firstly, we need to provide a type representing references. The constructor of this type should not be exposed to our users, to ensure that the invariants we are about to establish cannot be broken. We will also need it to be an instance of Ord, though this is for implementation reasons.

-- Constructor is not exposed to users
newtype Id = Id Word64
    deriving (Eq, Ord) via Word64
    deriving stock Show

What exactly we use for identifiers doesn't matter, as long as they can be compared for equality, have an ordering, and that their representation has sufficient values for us not to 'run out'.

Next, we need a type of ASG node, which is a 'flattened' form of our AST type from before. More precisely, we duplicate the same sum type 'arms', but every self-reference to the type becomes an Id instead. As previously, we do not expose its constructors, and provide it an instance of Ord.

-- Constructors are not exposed to users
data ASGNode = 
    ASGLiteral Natural |
    ASGPlus Id Id |
    ASGTimes Id Id
    deriving stock (Eq, Ord, Show)

We note here that, unlike previously, an ASGNode is not aware of what, if anything, its Ids 'point to'. Whereas in the AST, we had a clear subcomputation structure, in ASGNode, this structure is implicit, and requires an environment specifying what Ids refer to.

To provide such an environment, we need a way of tracking two-way pairings. Something like a regular associative array is not sufficient here, as these only ensure that two different values do not share a key, while we need to ensure that any pair of key and value is unique. For such a structure, we will use Bimap Id ASGNode from Data.Bimap, but there are several ways we could implement such a thing. This structure becomes the environment that specifies the ASG's connections, as well as telling us what any given Id represents.

Lastly, we need to provide an interface for building up an ASG that maintains this environment, while also ensuring that we don't do anything we shouldn't. We do this by wrapping up a State monad, to produce a 'builder' or 'code generator' monad of our own:

-- Constructor is not exposed to users
newtype ASGBuilder (a :: Type) = ASGBuilder (State (Bimap Id ASGNode) a)
    deriving (Functor, Applicative, Monad) via (State (Bimap Id ASGNode))

Note that we do not derive MonadState for this type, as this would potentially allow modifications that don't make sense.

To tie this all together, we provide the following operation. We do not expose this operation directly to users, instead using it to define other functionality which will be:

-- We assume Data.Bimap is imported qualified as Bimap here

idOf :: ASGNode -> ASGBuilder Id
idOf node = ASGBuilder $ do
    binds <- get
    -- Check if this node has an Id assigned already
    case Bimap.lookupR node binds of
        -- We've had a node like this before, so provide the Id it has already
        been assigned
        Just nodeId -> pure nodeId
        -- This node is one we _haven't_ seen before, so we need to find the
        next unused Id, cache this assignment, and return that Id
        Nothing -> do
            let newId = if Bimap.null binds
                        then Id 0
                        -- This is safe, as we know there's at least one entry
                        else let Id highest = fst . Bimap.findMax $ binds
                                in Id (highest + 1)
            newId <$ put (Bimap.insert newId e binds)

This operation acts as a lookup for a given ASGNode's Id, together with a caching capability to ensure that identical nodes are given identical, unique Ids. Thus, no matter how many times we re-create a given ASGNode, its Id will not change, and can be used as a stable reference to it. This ensures that we do not duplicate any computations in our ASG: any given ASGNode can occur once, and only once. This is also why we needed an Ord instance for both Id and ASGNode: Bimap requires these to work.

We can now provide an interface to users that is completely safe:

lit :: Natural -> ASGBuilder Id
lit = idOf . ASGLiteral

add :: Id -> Id -> ASGBuilder Id
add x y = idOf (ASGPlus x y)

mul :: Id -> Id -> ASGBuilder Id
mul x y = idOf (ASGTimes x y)

It is worth noting why this interface is entirely safe. Because Id is not available for us to construct directly, the only way we can receive one to use is by way of one of these three operations. Therefore, Ids are always under the control of the ASGBuilder, which, by way of idOf, ensures that they are always uniquely associated with some ASGNode. Furthermore, to construct 'complex' expressions that have subsexpressions, we must use Ids to refer to subcomputations, which means that the system is 'closed'. Furthermore, because the ASGNode is 'flat', comparisons on it are cheap, allowing this system to work efficiently.

This system is is no way less capable than AST. We can define powOf as before:

newPowOf :: Id -> Natural -> ASGBuilder Id
newPowOf ref = \case
    0 -> lit 1
    1 -> pure ref
    n -> mul ref <$> newPowOf ref $ n - 1

The translation from powOf to newPowOf is straightforward: we replace all ASTs with Id, have the result return in ASGBuilder, and use the provided helper functions to build computations instead of data constructors. In all other regards, the definition is the same, but we benefit from the sharing enabled by the ASG.

ASGBuilder is a code generation monad, rather than an ASG. If we want to 'execute' an ASGBuilder to make an ASG, the easiest thing we can do is return a combination of the Bimap (representing the structure of the graph) and an Id (corresponding to the 'topmost' computation):

runASGBuilder :: ASGBuilder Id -> (Id, Bimap Id ASGNode)
runASGBuilder (ASGBuilder comp) = runState comp Bimap.empty

This is not the most useful interface however, as ASGNode is an opaque type. However, it can be used as a basis for a more useful type, or converted to something else more suited to use as a DAG, without too much effort.

Conclusion

By using hash consing, we solve a problem that ADT-style representations of computations suffer from: a lack of sufficient sharing. Our definition allows us to represent the maximum possible amount of sharing, while also being immutable and safe. Indeed, the solution Covenant itself uses is based on exactly the definition given above.

At the same time, this solution is simplistic, as it doesn't take into account lambdas. This is something Covenant solves in a slightly different way, but is outside the scope of this article.

Clone this wiki locally