-
Notifications
You must be signed in to change notification settings - Fork 1
Hash consing
To ensure the most compact code, as well as the easiest analysis, Covenant is represented as a directed acyclic graph, rather than a tree-like structure as is more often the case. As we provide a means of building Covenant programmatically in Haskell in this library, we want to ensure that this happens by construction and cannot go wrong. While we could do something like this using mutable references, this is undesirable in a language like Haskell where not absolutely necessary.
Instead, we will use hash consing. This is a pure technique which allows us to maintain a directed acyclic graph structure dynamically, without relying on mutable references. We can also ensure that we maintain as much computational sharing as possible in the program represented by a directed acyclic graph 'automagically' using this technique.
In this writeup, we will describe the problem this technique seeks to address,
as well as how we solve it. We will provide simplified examples in Haskell, but
these will help those who want to understand how we implemented types like
ASGNode
and ASGBuilder
see the bigger picture of the techniques we use.
We use DAG to mean directed acyclic graph. We use ASG, standing for 'abstract syntax graph', analogously to how 'AST' is used to mean 'abstract syntax tree'. An ASG is similar to an AST, except that it is allowed to be a DAG, rather than a tree specifically.
Consider the following data type:
data AST =
Lit Natural |
Add AST AST |
Mul AST AST
deriving stock (Eq, Show)
This represents computations in the semiring of the natural numbers. As we defined it as an ADT, we naturally get a tree structure. Furthermore, we can use this structure to define other operations. For example, we could define a (static, and inefficient) exponentiation operation:
powOf :: AST -> Natural -> AST
powOf x = \case
0 -> Lit 1
1 -> x
n -> Mul x . powOf x $ n - 1
However, even accounting for the inefficiency of our approach, we create a far
larger computation than we need. If we represent the result of powOf x 10
for
some x
as a tree, we see the issue:
graph TD
a1[x]
a2[x]
a3[x]
a4[x]
a5[x]
a6[x]
a7[x]
a8[x]
a9[x]
a10[x]
mul1[Mul]
mul2[Mul]
mul3[Mul]
mul4[Mul]
mul5[Mul]
mul6[Mul]
mul7[Mul]
mul8[Mul]
mul9[Mul]
mul1 --> a1
mul1 --> mul2
mul2 --> a2
mul2 --> mul3
mul3 --> a3
mul3 --> mul4
mul4 --> a4
mul4 --> mul5
mul5 --> a5
mul5 --> mul6
mul6 --> a6
mul6 --> mul7
mul7 --> a7
mul7 --> mul8
mul8 --> a8
mul8 --> mul9
mul9 --> a9
mul9 --> a10
As we duplicate x
ten times, this means that if x
was a large
sub-expression, we would grow the AST by a lot. Furthermore, this obscures the
intent: x
doesn't need to be recomputed ten times here, even though the AST
suggests this. While we could avoid these issues in several ways, it would be
easier if instead, we could produce a structure like so:
graph TD
x
mul1[Mul]
mul2[Mul]
mul3[Mul]
mul4[Mul]
mul5[Mul]
mul6[Mul]
mul7[Mul]
mul8[Mul]
mul9[Mul]
mul1 --> x
mul1 --> mul2
mul2 --> x
mul2 --> mul3
mul3 --> x
mul3 --> mul4
mul4 --> x
mul4 --> mul5
mul5 --> x
mul5 --> mul6
mul6 --> x
mul6 --> mul7
mul7 --> x
mul7 --> mul8
mul8 --> x
mul8 --> mul9
mul9 --> x
mul9 --> x
Not only is this a smaller representation, it also conveys our intent much
better: x
now only occurs once. However, this structure is no longer a tree,
but a DAG, and thus, AST
(or indeed, a single ADT in general) can't represent
it.
There is a range of solutions to this problem in various languages (including Haskell), mostly involving mutable references. While this works, it's not ideal: mutability is mostly a performance improvement, and it is easy to produce nonsensical data, such as references that go nowhere, cycles in (supposedly) a DAG, etc. We would like a solution that preserves immutability, doesn't produce situations that make no sense (that is, 'correct by construction') and also enables as much sharing as possible.
The approach we will use is hash consing. Essentially, this simulates
mutable references by replacing self-recursive ADT references in structures
similar to AST
with an explicit identifier (or reference) value. We then track
unique pairings of references and nodes in an auxilliary structure, usually
called a bimap. Lastly, we provide a controlled interface to ensure that
issues of unrestrained mutable references cannot arise.
More precisely, we need to do several things. Firstly, we need to provide a type
representing references. The constructor of this type should not be exposed to
our users, to ensure that the invariants we are about to establish cannot be
broken. We will also need it to be an instance of Ord
, though this is for
implementation reasons.
-- Constructor is not exposed to users
newtype Id = Id Word64
deriving (Eq, Ord) via Word64
deriving stock Show
What exactly we use for identifiers doesn't matter, as long as they can be compared for equality, have an ordering, and that their representation has sufficient values for us not to 'run out'.
Next, we need a type of ASG node, which is a 'flattened' form of our AST
type
from before. More precisely, we duplicate the same sum type 'arms', but every
self-reference to the type becomes an Id
instead. As previously, we do not
expose its constructors, and provide it an instance of Ord
.
-- Constructors are not exposed to users
data ASGNode =
ASGLiteral Natural |
ASGPlus Id Id |
ASGTimes Id Id
deriving stock (Eq, Ord, Show)
We note here that, unlike previously, an ASGNode
is not aware of what, if
anything, its Id
s 'point to'. Whereas in the AST
, we had a clear
subcomputation structure, in ASGNode
, this structure is implicit, and requires
an environment specifying what Id
s refer to.
To provide such an environment, we need a way of tracking two-way pairings.
Something like a regular associative array is not sufficient here, as these only
ensure that two different values do not share a key, while we need to ensure
that any pair of key and value is unique. For such a structure, we will use
Bimap Id ASGNode
from Data.Bimap
, but there are several ways we could
implement such a thing. This structure becomes the environment that specifies
the ASG's connections, as well as telling us what any given Id
represents.
Lastly, we need to provide an interface for building up an ASG that maintains
this environment, while also ensuring that we don't do anything we shouldn't. We
do this by wrapping up a State
monad, to produce a 'builder' or 'code
generator' monad of our own:
-- Constructor is not exposed to users
newtype ASGBuilder (a :: Type) = ASGBuilder (State (Bimap Id ASGNode) a)
deriving (Functor, Applicative, Monad) via (State (Bimap Id ASGNode))
Note that we do not derive MonadState
for this type, as this would potentially
allow modifications that don't make sense.
To tie this all together, we provide the following operation. We do not expose this operation directly to users, instead using it to define other functionality which will be:
-- We assume Data.Bimap is imported qualified as Bimap here
idOf :: ASGNode -> ASGBuilder Id
idOf node = ASGBuilder $ do
binds <- get
-- Check if this node has an Id assigned already
case Bimap.lookupR node binds of
-- We've had a node like this before, so provide the Id it has already
been assigned
Just nodeId -> pure nodeId
-- This node is one we _haven't_ seen before, so we need to find the
next unused Id, cache this assignment, and return that Id
Nothing -> do
let newId = if Bimap.null binds
then Id 0
-- This is safe, as we know there's at least one entry
else let Id highest = fst . Bimap.findMax $ binds
in Id (highest + 1)
newId <$ put (Bimap.insert newId e binds)
This operation acts as a lookup for a given ASGNode
's Id
, together with a
caching capability to ensure that identical nodes are given identical, unique
Id
s. Thus, no matter how many times we re-create a given ASGNode
, its Id
will not change, and can be used as a stable reference to it. This ensures that
we do not duplicate any computations in our ASG: any given ASGNode
can occur
once, and only once. This is also why we needed an
Ord
instance for both Id
and ASGNode
: Bimap
requires these to work.
We can now provide an interface to users that is completely safe:
lit :: Natural -> ASGBuilder Id
lit = idOf . ASGLiteral
add :: Id -> Id -> ASGBuilder Id
add x y = idOf (ASGPlus x y)
mul :: Id -> Id -> ASGBuilder Id
mul x y = idOf (ASGTimes x y)
It is worth noting why this interface is entirely safe. Because Id
is not
available for us to construct directly, the only way we can receive one to use
is by way of one of these three operations. Therefore, Id
s are always under
the control of the ASGBuilder
, which, by way of idOf
, ensures that they are
always uniquely associated with some ASGNode
. Furthermore, to construct
'complex' expressions that have subsexpressions, we must use Id
s to refer to
subcomputations, which means that the system is 'closed'. Furthermore, because
the ASGNode
is 'flat', comparisons on it are cheap, allowing this system to
work efficiently.
This system is is no way less capable than AST
. We can define powOf
as
before:
newPowOf :: Id -> Natural -> ASGBuilder Id
newPowOf ref = \case
0 -> lit 1
1 -> pure ref
n -> mul ref <$> newPowOf ref $ n - 1
The translation from powOf
to newPowOf
is straightforward: we replace all
AST
s with Id
, have the result return in ASGBuilder
, and use the provided
helper functions to build computations instead of data constructors. In all
other regards, the definition is the same, but we benefit from the sharing
enabled by the ASG.
ASGBuilder
is a code generation monad, rather than an ASG. If we want to
'execute' an ASGBuilder
to make an ASG, the easiest thing we can do is return
a combination of the Bimap
(representing the structure of the graph) and an
Id
(corresponding to the 'topmost' computation):
runASGBuilder :: ASGBuilder Id -> (Id, Bimap Id ASGNode)
runASGBuilder (ASGBuilder comp) = runState comp Bimap.empty
This is not the most useful interface however, as ASGNode
is an opaque type.
However, it can be used as a basis for a more useful type, or converted to
something else more suited to use as a DAG, without too much effort.
By using hash consing, we solve a problem that ADT-style representations of computations suffer from: a lack of sufficient sharing. Our definition allows us to represent the maximum possible amount of sharing, while also being immutable and safe. Indeed, the solution Covenant itself uses is based on exactly the definition given above.
At the same time, this solution is simplistic, as it doesn't take into account lambdas. This is something Covenant solves in a slightly different way, but is outside the scope of this article.