Skip to content

rfds: Add signature RFD #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
230 changes: 230 additions & 0 deletions rfds/0001-signing-commits/text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# Signatures in Keyhive/Beelay

Keyhive gives us an access control primitive which allows us to determine which devices - identified by Ed25519 public keys - are allowed to write to which parts of the history of a document. Specifically, the content of a document is a commit graph and the keyhive auth graph tells us which ranges of the commit graph are writable by particular keys.

This is great, to be useful though we have to actually know who authored a particular commit. Clearly we want some way of signing commits, but this is a little more fiddly than it might first appear. There are two high level problems:

1. We have to handle the situation where there are multiple signatures for the same commit
Copy link
Member

@expede expede Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these thoughts are easier just to put in one place rather than interspersed throughout the original proposal.


I would argue that we consider them to be distinct commits. The question is less about "which signature" more "which author" (which the signature is a proof of).

Consider the following:

add(char: "a", after: vec![0x1234], by: alex.id)
add(char: "a", after: vec![0x1234], by: brooke.id)

I would expect these to be semantically distinct. Vector clocks suggest these semantics more directly:

// Alex's commit
add(char: "a", clock: {alex: 2, brooke: 1, pvh: 99})

// Brooke's commit
add(char: "a", clock: {alex: 1, brooke: 2, pvh: 99}
flowchart TD
    prior["prior commit 0x1234"]
    alex["alex commits 'a'"] --> prior
    brooke["brooke commits 'a'"] --> prior
Loading

By analogy, this is related to equality-by-value or equality-by-identity. We can always translate them into an unambiguous form by expanding them out to a form with an author field (or being able to infer authors another way).

I'm not so certain about multiple signatures per commit, but will try to justify why on that line.

2. We need to find a way to compress signatures when building chunks in sedimentree
Copy link
Member

@expede expede Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously signatures themselves don't compress, but we can do the following:

  1. Drop all but the last signature before switching authors
  2. Table signatures {usize => Sig} (but probably as a Vec)

There's a variant of 2 where we only need to keep one signature per author in a chunk, but it's a bit more complex to book keep. Essentially have agents sign the last commit in their op stream, and all ops have to include refs to that author's prior commit (no gaps). IMO this is way more fiddly so I'd rather take the minor hit on a couple signatures.

Reference Scenario

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> op3 ==> op1
    op6 ==> op5 ==> op3
    op6 ==> op4


    sig1 -.-> op1
    sig2 -.-> op2
    sig3 -.-> op3
    sig4 -.-> op4
    sig5 -.-> op5
    sig6 -.-> op6

    subgraph sigs
      sig1
      sig2
      sig3
      sig4
      sig5
      sig6
    end
Loading

The naive mechanism for this is to include the signature on each op and include it in the hash, but then validation needs all of those intermediate signatures:

flowchart TD
    op1
    op2 ==> sig1
    op3
    op4
    op5 ==> sig4
    op6

    op4 ==> sig2
    op3 ==> sig1
    op4 ==> sig3
    op6 ==> sig5
    op6 ==> sig4


    sig1 -.-> op1
    sig2 -.-> op2
    sig3 -.-> op3
    sig4 -.-> op4
    sig5 -.-> op5
    sig6 -.-> op6
Loading

This has a lot of redundant information, especially with the insight that what we really care about is the author ID which is WAY more compressible

Signature GC

Signatures over hash linked data can use one signature to cryptographically attest to the entire graph. This is how signing over Merkle trees works, for example.

Mainly we want to make sure that we don't take accountability for data that we didn't generate by dropping ALL inner signatures. We only need to keep the last signature before switching authors.

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> op3 ==> op1
    op6 ==> op5 ==> op3
    op6 ==> op4


    sig3 -.-> op3
    sig4 -.-> op4
    sig6 -.-> op6

    subgraph sigs

      sig3
      sig4
      sig6
    end

    subgraph alex
      op1
      op2
      op4
    end

    subgraph brooke
      op3
      op5
      op6
    end
Loading

There is one challenge here depending on the exact encoding of e.g. RGA. Generally you directly fork from the point that you want to insert characters, but we may have thrown out intermediate signatures. There's a couple ways to handle this:

1. Merkle Proofs

Point at the signature that contains the op that you want to retain the attestation for, and either include the explicit merkle proof (heavy) or let the validator check if the op in question was attested by the correct author (this strategy is extremely memoizable)

2. Author Streams

Again, closer to a vector clock, only maintain the last signature of each author, and now we can attest to any op by checking if its in the author's set. This is a 1:1 mapping if the ID contributes to the op's hash.

We actually have a choice for the specific layout.

This variant is similar to the diagram above. My initial gut reaction with this kind of design is "oh no, signatures aren't in the hash graph!" but as noted above we can include the author ID in the hash graph top get uniqueness, and signatures off to the side to prove them. My guess is that this version would require fewer changes to Automerge, plus it's actually cleaner in the details versus the alternative (see next option)

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> op3 ==> op1
    op6 ==> op5 ==> op3
    op6 ==> op4

    sig4 -.-> op4
    sig6 -.-> op6

    subgraph sigs
      sig4
      sig6
    end

    subgraph alex
      op1
      op2
      op4
    end

    subgraph brooke
      op3
      op5
      op6
    end
Loading

This one initially feels more satisfying since it's a reduction of the naive version, but you can run into problems missing relevant intermediate signatures that have been GCed in cases like RGA. Reference pointers are also either to a signature or the ops itself, which is kind of annoying. I don't love this version.

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> sig3
    op3 ==> sig2
    op6 ==> op5 ==> op3
    op6 ==> op4

    sig2 -.-> op2

    sig3 -.-> op3
    sig4 -.-> op4
    sig6 -.-> op6


    subgraph alex
      op1
      sig2
      op2
      sig4
      op4
    end

    subgraph brooke
      sig3
      op3

      op5
      
      sig6
      op6
    end
Loading

Tabling

I think this one is pretty obvious. If many authors branch off of the same point in history, we don't want to repeat the 64 bytes over and over. We can maintain a table that references signatures as a usize during compression.


The combination of these problems is tricky and I think suggests that we need to choose one of these two designs:

* Allow there to be multiple signatures per commit which will result in additional complexity in the keyhive/beelay codebase in return for looser requirements on the document types and a more flexible concept of attestation

Check failure on line 12 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "beelay". Suggested alternatives: "Beelay", "belay", "bee lay", "bee-lay" If you want to ignore this message, add beelay to the ignore file at .github/workflows/dictionary.txt

Check failure on line 12 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "codebase". Suggested alternatives: "co debase", "co-debase", "code base", "code-base", "baseboard" If you want to ignore this message, add codebase to the ignore file at .github/workflows/dictionary.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: is this basically 1. "let more than one author attest to an op" and use that information to infer separate op, or 2. have multiple authors for any given op?

I don't love 2. I think easier to do the other suggestion, since we can either bake that info into the hash-relevant metadata on the op, or we do integrity checking at a level before and stripping that info out and handing it off to Automerge. I'm generally in favour of a hash-relevant author field because it RLE compresses really well and gives us pretty great properties (though it would make it harder to access control data that doesn't have a concept of an author or other equivalent info like a vector clock)

* Require that the commits in a keyhive document (i.e. Automerge commits) have some concept of an author which can be related to keyhive identities. This would be an additional requirement we would need to add to Automerge and any other document type which wants to be used with Keyhive

## Multiple Signatures?

The only thing we currently require of a commit in keyhive is that it be some sequence of bytes with some identifier and the identifiers of it's parents. For an Automerge document the sequence of bytes is the encoded commit and the identifier of the commit and it's parents is the hash of the commit and the hash of it's parents. A commit then looks something like this;

```rust
struct Commit<Id> {
hash: Id,
parents: Vec<Id>,
contents: Vec<u8>,
}
```

A straightforward way to add a signature to this would be something like:

```rust
struct Commit<Id> {
hash: Id,
parents: Id,
contents: Vec<u8>,
signature: ed25519_dalek::Signature,
}
```

Where the signature is over the `hash`, `parents`, and `contents`. The difficulty with this is that there is nothing stopping multiple devices from producing the same commit. What then should we do when we receive two commits with the same contents but different signatures, who should we say is the author of this commit? A scenario which I think is very likely to lead to this happening is keyhive nodes importing changes into a keyhive document from outside of a keyhive context (i.e. changes produced with vanilla Automerge somewhere, like a legacy app), multiple nodes might import such changes concurrently.

We have to arrive at the same state on every node which has received the same messages, so I think we have these options:

* Choose the author with the lexicographically smallest signature (or some other arbitrary but deterministic rule)

Check failure on line 42 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "lexicographically". Suggested alternatives: "lexicographic ally", "lexicographic-ally", "lexicographical", "lexicographic", "choreographically", "lithographically" If you want to ignore this message, add lexicographically to the ignore file at .github/workflows/dictionary.txt
* Ignore commits with multiple signatures
* Allow commits to have multiple authors

The first two seem quite undesirable to me, in the first case the author of a commit would change for no obvious reason and in the second the document would appear to lose data. This entire scenario also suggests that "author" is not really the correct framing for this kind of signature and instead we should think of it more as "attestation". Accepting multiple signatures on a change means that if at least one signature has write permission then it should be accepted in the document. This model is much more flexible and would allow workflows where a user is revoked but then some other user who has access to the document can still attest to changes the revoked user made if those changes are deemed good.

### Compressing signed commits

Another problem with signing commits is that signatures do not compress well. We have done a great deal of work to compress away the hashes on each commit in an Automerge document, retaining the signatures for each commit would undo all that work. We need to know who is attesting to each commit though, what can we do?

Well, we can perform the same kind of trick that we do with the change hashes. The insight is that in linear runs of commits we only need the signature of the end of the run. E.g. in this graph where hexagons are signed by Bob, rhombuses by Alice, and circles by Charlie.

```mermaid
graph TD
A{{A}}
B{{B}}
C((C))
D((D))
E{E}
F{F}
A --> B
B --> C
C --> D
A --> E
E --> F
```

Conceptually we only need to store the last signature in each run, plus a counter saying how many changes back the run goes. So we would store something like:

| Head of run | Count |
| ----------- | ----- |
| D | 2 |
| F | 2 |
| B | 2 |

All well and good, but what are we actually signing at the end of each run? If the signature is just over the commit (its hash and its parent hashes) then when we receive a run of commits we have no way of verifying them. Say we had this run of commits originally:

```mermaid
graph LR
A["`A
signed by bob`"]
B["`B
signed by charlie`"]
C["`C
signed by bob`"]
A --> B
B --> C
```

But Bob sends you a compressed chunk containing A, B, and C, Bob can just send you the signature on C and tell you the run was three commits long and you will infer that Bob was the author of B!

The problem is the same as with the multiple signatures on a commit problem, we have no way of binding the contents of a commit to an author. In this case we can solve the problem by requiring that signatures not be over the commit directly, but over a wrapper we will call an attestation:

```rust
struct AttestationPayload<CommitId> {
commit: CommitId,
parent_attestation: AttestationHash,
author: ed25519_dalek::VerifyingKey
}

struct Attestation<CommitId> {
payload: AttestationPayload<CommitId>,
signature: ed25519_dalek::Signature,
}

// The hash of a n`AttestationPayload
struct AttestationHash(blake3::Hash)
```

Now, in order to verify the run we can start at head of the run and walk backwards `count` steps, then we can construct the `AttestationPayload` for each commit in the run, we can check that the hash of the penultimate `AttestationPayload` matches the `parent_attestation` in the `Attestation` for the head of the run before validating the signature.

## Attestations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the version in my earlier comment can be seen as a simplification of this approach. Lemme know if I'm missing something!


Putting all this together, we end up with a design where signatures are not made over commits directly, but instead over an envelope which includes the commit ID and a link to a similar envelope around the parent commits. That is, we have our original commit graph, and then a mirrored graph of attestations

![](./attestationgraph.png)

This is a very flexible structure which allows us to have multiple signatures per commit. We can compress these attestations into attestation runs when compressing a chunk and we can union the attestation runs for two chunks when compressing that chunk into a larger chunk. The tricky part is storing and synchronising attestations.

Check failure on line 119 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "synchronising". Suggested alternatives: "synchronizing", "synchronicity", "synchronize" If you want to ignore this message, add synchronising to the ignore file at .github/workflows/dictionary.txt

### Synchronising attestations

Check failure on line 121 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "Synchronising". Suggested alternatives: "Synchronizing", "Synchronicity", "Synchronize" If you want to ignore this message, add Synchronising to the ignore file at .github/workflows/dictionary.txt

Attestations are not part of a commit, or a chunk. That's the whole point of this setup, it decouples the signature from the commit. This means that we will need to store attestations alongside the commits that they refer to. Likewise we will have to store attestation runs separately to the chunks which they are part of. Most upsettingly, we will have to synchronise them separately. Now, we already have two phases when synchronising a document - one for the CGKA ops and one for the sedimentree. CGKA ops should change relatively infrequently so this isn't too big a deal, but attestations will likely change as frequently as the document does.

Check failure on line 123 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "upsettingly". Suggested alternatives: "upset tingly", "upset-tingly", "upsetting" If you want to ignore this message, add upsettingly to the ignore file at .github/workflows/dictionary.txt

Check failure on line 123 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "synchronise". Suggested alternatives: "synchronize", "synchronicity", "synchrony" If you want to ignore this message, add synchronise to the ignore file at .github/workflows/dictionary.txt

Check failure on line 123 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "synchronising". Suggested alternatives: "synchronizing", "synchronicity", "synchronize" If you want to ignore this message, add synchronising to the ignore file at .github/workflows/dictionary.txt

The document content is synchronised via sedimentree sync. What that means is that we download the minimal set of chunk boundaries which the other end says is needed to cover the whole document content. Due to the fact that exponentially larger chunks are created as the commit graph gets larger this means we end up needing to download around $log_{10}(n)$ hashes (where $n$ is the number of commits) in order to figure out what we are missing. Including all the attestations in this summary would enormously increase the amount of information to download. Instead, I think we should put on each chunk a hash of all the attestations we are aware of for that chunk. Then if one end discovers that the other end has a different attestation hash for some chunk it can re-download all the attestations. This should be a relatively uncommon operation and not too much bandwidth anyway.

Check failure on line 125 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "synchronised". Suggested alternatives: "synchronized", "synchronize", "synchronicity" If you want to ignore this message, add synchronised to the ignore file at .github/workflows/dictionary.txt

### API

At a high level the advantage of the attestation approach is that Beelay can manage attribution for the application. This means that Beelay can ensure that new commits added by the application are attested to by the active peer, and can perform the logic of validating attestations received over the network. This in turn means that the application (Automerge) needs no modifications in order to attest to commits - just call `Beelay.addCommits`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do most of this at the Beelay level with the active peer. Beelay does need to check that it's syncing with an authorised peer (on connection), but they may not have performed any writes whatsoever (maybe they're another sync server and can't even read the content!), and are just themselves relaying data from other authors that were e.g. sneakernetted. We do need to check that the an encrypted chunk envelope has at least one signed head (all of the heads should be signed) by someone that was authorised at some stage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that with this API Automerge doesn't need to know how to do signing when creating new commits, it just passes commits to beelay via addCommits and Beelay handles signing the commit, much like we already do with the encryption logic.


It's slightly more complex to load documents because attestations are separate from the commits and chunks which make up a document, which means the API has to return them separately. More irritatingly, the application has to be able to accept attestations appearing later. Thus the API ends up looking a bit like this:

```typescript
class Beelay {
// Adding commits doesn't require any signing logic
function addCommits(doc: DocumentId, commits: Commit[]) {}
// Load a document returning the current commits and attestations
function loadDocument(): Loaded | null {}
// Listen for changes to a document after it is loaded.
// as you will see from the DocEvent type below, an event is fired when
// the content changes, or when the attestations change
function on("doc_event", callback: DocEvent => void)
}

type Loaded = {
// The actual content of the document
commits: CommitOrChunk[]
// The ranges of the commit graph which have known attestors
attestations: Attestation[]
// Ranges of the commit graph each attestor has write access to
ranges: WriteRange[]
}

type Attestation = {
start: CommitHash,
end: CommitHash,
author: PeerId
}

type WriteRange = {
attestor: PeerId
start: CommitHash[],
end: CommitHash[],
}

type DocEvent =
| { type: "content", data: CommitOrChunk }
| { type: "attestation", attestation: Attestation }


type CommitOrChunk =
| { type: "commit", data: Uint8Array }
| { type: "chunk", data: Uint8Array }

// A hex encoded ed25519 public key
type PeerId = string
```

This seems awkward because Automerge now has to know how to handle commit attestations changing after having loaded them, which means that at any time the visibility of commits in the document could just change. However, this is something Automerge has to be able to handle anyway in order to handle revocation events which will change the `WriteRange` which a particular attestor has access to.

Check failure on line 179 in rfds/0001-signing-commits/text.md

View workflow job for this annotation

GitHub Actions / spellcheck

Misspelled word

Misspelled word "attestor". Suggested alternatives: "attest or", "attest-or", "testator", "attest", "pastorate", "estimator" If you want to ignore this message, add attestor to the ignore file at .github/workflows/dictionary.txt

### Decompressing Chunks
Beelay has no understanding of how chunks are compressed, but in order to validate the attestation runs it needs to have access to the commit graph of a chunk. This means that in order for Beelay to own the attestation validation work we will need to add an extra API for Beeelay to request that Automerge decompress a chunk into a change graph. This might look something like this:

* Beelay emits an event saying "I need the commit graph for this compressed chunk"
* Automerge picks up the chunk, decompresses the commit graph, and passes it back to Beelay
* Beelay walks the commit graph along with any attestations it has available and then emits a "doc_changed" event with any new attestations it discovered

Beelay can cache validated attestations, so we would only need to do this when new attestations are received for a chunk.

## Or just do it in Automerge

All of this attestation malarkey is downstream of the fact that there is no way to bind the content of a commit to the public key which authored it. The mapping from public key to authorised range of the commit graph which Keyhive gives us is not much use to Automerge without some way of knowing the author of a commit. However, instead of making this Keyhive's problem we could make it Automerge's problem.

Concretely what this would look like is that we would add a signature field to Automerge commits. We could them perform a similar compression trick to the attestation chains, but within Automerge. For simplicity we should probably only support Ed25519 signatures. This would mean that Automerge would natively know who the author of a commit was and there would be no need to handle multiple signatures for the same commit or for Beelay to know about the signatures at all.

This would have some consequences for Automerge:

* When importing changes into a document (via `merge` or `applyChanges` or `loadIncremental`) we would have to decide what to do with changes which _don't_ have a signature. In a Beelay context we probably want to either:
* Throw an error
* Rewrite the commits with the importing device as the author. If multiple devices perform such an import concurrently then we will end up with duplicated ops - until we have cherry-pick at which point we would implement this kind of import as a cherry-pick.
* When creating changes we now need to provide a signature for the change. We want to support using WebCrypto or other asynchronous APIs for signing, which means that we would need to introduce some new API for creating changes - maybe `Automerge.change` would become asynchronous, maybe we have some kind of queue within a document which doesn't "publish" changes until we've received a signature for them.

## Pros and Cons

### Attestations

#### Pros
* Much more general, we can support workflows where changes from a revoked peer are approved by some other peer
* Document type generic, the underlying document only has to provide a commit DAG, no concept of authorship
* In particular, the underlying document type does not need to implement asynchronous signing mechanisms, those are handled by Beelay

#### Cons
* More complex to implement and a more complex API
* We have to synchronise another data type and a bit more data
* Attestations can arrive out of sync with commits, which means they could not arrive at all

### Automerge native commits

#### Pros
* Much simpler model, all commits have a single author
* Keeps complexity out of Beelay
* The authorship information can be used to provide better change history APIs in Automerge - e.g. "give me patches grouped by author"
* Not really more data to sync
* No strange scenarios where the document content has synchronised but not the signatures

Cons
* Automerge APIs for creating changes now become more complex due to asynchronous signature requirements
* Requires cherry-pick to properly handle importing changes from unsigned documents
* Less general - we require any document type which wants to sync via Beelay to implement it's own signature layer.