Skip to content

rfds: Add signature RFD #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

rfds: Add signature RFD #120

wants to merge 1 commit into from

Conversation

alexjg
Copy link
Member

@alexjg alexjg commented Apr 11, 2025

Here's a rendered version of this RFD.

Inspired by Oxide's RFD process I'm adding this document to discuss how we should sign commits in Beelay. The core problem is that Automerge has no concept of a commit Author at the moment, which means

a) We can receive multiple versions of the same commit from different peers
b) We need to figure out how to handle compressing signatures when we're compressing chunks of commits

Lot's more detail in the doc.

@alexjg alexjg requested review from expede, ept and pvh as code owners April 11, 2025 13:51
@expede
Copy link
Member

expede commented Apr 14, 2025

@alexjg do you think it's worth moving the existing (WIP) design docs into this format too?

@alexjg
Copy link
Member Author

alexjg commented Apr 14, 2025

Yeah might be a good idea, we could also consider adding some more metadata like Oxide do (e.g. the authors, status, and date) to make it easier to put the documents in context.


This is great, to be useful though we have to actually know who authored a particular commit. Clearly we want some way of signing commits, but this is a little more fiddly than it might first appear. There are two high level problems:

1. We have to handle the situation where there are multiple signatures for the same commit
Copy link
Member

@expede expede Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these thoughts are easier just to put in one place rather than interspersed throughout the original proposal.


I would argue that we consider them to be distinct commits. The question is less about "which signature" more "which author" (which the signature is a proof of).

Consider the following:

add(char: "a", after: vec![0x1234], by: alex.id)
add(char: "a", after: vec![0x1234], by: brooke.id)

I would expect these to be semantically distinct. Vector clocks suggest these semantics more directly:

// Alex's commit
add(char: "a", clock: {alex: 2, brooke: 1, pvh: 99})

// Brooke's commit
add(char: "a", clock: {alex: 1, brooke: 2, pvh: 99}
flowchart TD
    prior["prior commit 0x1234"]
    alex["alex commits 'a'"] --> prior
    brooke["brooke commits 'a'"] --> prior
Loading

By analogy, this is related to equality-by-value or equality-by-identity. We can always translate them into an unambiguous form by expanding them out to a form with an author field (or being able to infer authors another way).

I'm not so certain about multiple signatures per commit, but will try to justify why on that line.

This is great, to be useful though we have to actually know who authored a particular commit. Clearly we want some way of signing commits, but this is a little more fiddly than it might first appear. There are two high level problems:

1. We have to handle the situation where there are multiple signatures for the same commit
2. We need to find a way to compress signatures when building chunks in sedimentree
Copy link
Member

@expede expede Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously signatures themselves don't compress, but we can do the following:

  1. Drop all but the last signature before switching authors
  2. Table signatures {usize => Sig} (but probably as a Vec)

There's a variant of 2 where we only need to keep one signature per author in a chunk, but it's a bit more complex to book keep. Essentially have agents sign the last commit in their op stream, and all ops have to include refs to that author's prior commit (no gaps). IMO this is way more fiddly so I'd rather take the minor hit on a couple signatures.

Reference Scenario

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> op3 ==> op1
    op6 ==> op5 ==> op3
    op6 ==> op4


    sig1 -.-> op1
    sig2 -.-> op2
    sig3 -.-> op3
    sig4 -.-> op4
    sig5 -.-> op5
    sig6 -.-> op6

    subgraph sigs
      sig1
      sig2
      sig3
      sig4
      sig5
      sig6
    end
Loading

The naive mechanism for this is to include the signature on each op and include it in the hash, but then validation needs all of those intermediate signatures:

flowchart TD
    op1
    op2 ==> sig1
    op3
    op4
    op5 ==> sig4
    op6

    op4 ==> sig2
    op3 ==> sig1
    op4 ==> sig3
    op6 ==> sig5
    op6 ==> sig4


    sig1 -.-> op1
    sig2 -.-> op2
    sig3 -.-> op3
    sig4 -.-> op4
    sig5 -.-> op5
    sig6 -.-> op6
Loading

This has a lot of redundant information, especially with the insight that what we really care about is the author ID which is WAY more compressible

Signature GC

Signatures over hash linked data can use one signature to cryptographically attest to the entire graph. This is how signing over Merkle trees works, for example.

Mainly we want to make sure that we don't take accountability for data that we didn't generate by dropping ALL inner signatures. We only need to keep the last signature before switching authors.

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> op3 ==> op1
    op6 ==> op5 ==> op3
    op6 ==> op4


    sig3 -.-> op3
    sig4 -.-> op4
    sig6 -.-> op6

    subgraph sigs

      sig3
      sig4
      sig6
    end

    subgraph alex
      op1
      op2
      op4
    end

    subgraph brooke
      op3
      op5
      op6
    end
Loading

There is one challenge here depending on the exact encoding of e.g. RGA. Generally you directly fork from the point that you want to insert characters, but we may have thrown out intermediate signatures. There's a couple ways to handle this:

1. Merkle Proofs

Point at the signature that contains the op that you want to retain the attestation for, and either include the explicit merkle proof (heavy) or let the validator check if the op in question was attested by the correct author (this strategy is extremely memoizable)

2. Author Streams

Again, closer to a vector clock, only maintain the last signature of each author, and now we can attest to any op by checking if its in the author's set. This is a 1:1 mapping if the ID contributes to the op's hash.

We actually have a choice for the specific layout.

This variant is similar to the diagram above. My initial gut reaction with this kind of design is "oh no, signatures aren't in the hash graph!" but as noted above we can include the author ID in the hash graph top get uniqueness, and signatures off to the side to prove them. My guess is that this version would require fewer changes to Automerge, plus it's actually cleaner in the details versus the alternative (see next option)

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> op3 ==> op1
    op6 ==> op5 ==> op3
    op6 ==> op4

    sig4 -.-> op4
    sig6 -.-> op6

    subgraph sigs
      sig4
      sig6
    end

    subgraph alex
      op1
      op2
      op4
    end

    subgraph brooke
      op3
      op5
      op6
    end
Loading

This one initially feels more satisfying since it's a reduction of the naive version, but you can run into problems missing relevant intermediate signatures that have been GCed in cases like RGA. Reference pointers are also either to a signature or the ops itself, which is kind of annoying. I don't love this version.

flowchart TD
    op1
    op2
    op3
    op4
    op5
    op6

    op4 ==> op2 ==> op1
    op4 ==> sig3
    op3 ==> sig2
    op6 ==> op5 ==> op3
    op6 ==> op4

    sig2 -.-> op2

    sig3 -.-> op3
    sig4 -.-> op4
    sig6 -.-> op6


    subgraph alex
      op1
      sig2
      op2
      sig4
      op4
    end

    subgraph brooke
      sig3
      op3

      op5
      
      sig6
      op6
    end
Loading

Tabling

I think this one is pretty obvious. If many authors branch off of the same point in history, we don't want to repeat the 64 bytes over and over. We can maintain a table that references signatures as a usize during compression.


The combination of these problems is tricky and I think suggests that we need to choose one of these two designs:

* Allow there to be multiple signatures per commit which will result in additional complexity in the keyhive/beelay codebase in return for looser requirements on the document types and a more flexible concept of attestation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: is this basically 1. "let more than one author attest to an op" and use that information to infer separate op, or 2. have multiple authors for any given op?

I don't love 2. I think easier to do the other suggestion, since we can either bake that info into the hash-relevant metadata on the op, or we do integrity checking at a level before and stripping that info out and handing it off to Automerge. I'm generally in favour of a hash-relevant author field because it RLE compresses really well and gives us pretty great properties (though it would make it harder to access control data that doesn't have a concept of an author or other equivalent info like a vector clock)


### API

At a high level the advantage of the attestation approach is that Beelay can manage attribution for the application. This means that Beelay can ensure that new commits added by the application are attested to by the active peer, and can perform the logic of validating attestations received over the network. This in turn means that the application (Automerge) needs no modifications in order to attest to commits - just call `Beelay.addCommits`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do most of this at the Beelay level with the active peer. Beelay does need to check that it's syncing with an authorised peer (on connection), but they may not have performed any writes whatsoever (maybe they're another sync server and can't even read the content!), and are just themselves relaying data from other authors that were e.g. sneakernetted. We do need to check that the an encrypted chunk envelope has at least one signed head (all of the heads should be signed) by someone that was authorised at some stage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that with this API Automerge doesn't need to know how to do signing when creating new commits, it just passes commits to beelay via addCommits and Beelay handles signing the commit, much like we already do with the encryption logic.


Now, in order to verify the run we can start at head of the run and walk backwards `count` steps, then we can construct the `AttestationPayload` for each commit in the run, we can check that the hash of the penultimate `AttestationPayload` matches the `parent_attestation` in the `Attestation` for the head of the run before validating the signature.

## Attestations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the version in my earlier comment can be seen as a simplification of this approach. Lemme know if I'm missing something!

@ept
Copy link
Member

ept commented Apr 20, 2025

I think the nicest solution here is to include an author ID field in every Automerge change, which contains a public key, but which can be opaque to Automerge. The author ID is included in the hash of the change; since it's more or less constant per-device it will compress well. Apps that don't use Keyhive but want some other form of author attribution might set this field to another value, such as a human-readable username or a numeric user ID from a centralised auth system. The signature should not be included in the hash, since otherwise we would have to store signatures forever in order to reconstruct the hash graph.

When we have the signature by some pubkey for some change, we can drop signatures by the same pubkey on earlier changes it depends on. It doesn't matter if changes by multiple devices are interleaved. By signing a change X, a device attests that all changes causally prior to X that have the same author ID field were in fact authored by that device.

Not sure how to best sync the signatures. Having a whole separate attestation layer seems like a lot of extra complexity, but it would mean Automerge wouldn't have to know about signatures. Maybe it would be simpler to just make Automerge aware of the signatures as well, but to make the signatures optional for apps that want only Automerge but not Keyhive.

I just realised there's another edge case with discarding old signatures. Say we have a sequence of changes signed by pubkey A, and we keep only the most recent of A's signatures. However, now imagine A is revoked halfway through that sequence, so some of the changes remain authorised, but the most recent changes become unauthorised because they are concurrent with the revocation. In that case we have to roll back the unauthorised changes. But we've already discarded the signature on the last change by A that preceded the revocation, because we didn't know at the time that we would need it.

In this case I think we need to keep the unauthorised changes, and send them to other peers along with the most recent signature by A, since otherwise the peer won't be able to verify that the changes prior to the revocation were indeed performed by A. It's a bit weird to have to store changes that we know are unauthorised, but the alternative would be having to store every signature on the off chance that later changes by the same device later turn out to be unauthorised, which would take a lot more space. Also, different peers may have chosen to discard different signatures (based on what they had synced at the time when they learnt about the revocation). I don't think that breaks anything, it will just require some care to handle correctly.

@ept
Copy link
Member

ept commented Apr 20, 2025

Another little detail: it might be better to sign the hash of a change, not the change itself, so that a signature can be validated given only the hash (without having to reconstruct the actual change from its compressed representation). This shouldn't make a difference security-wise, as we're already assuming the hash function is collision-resistant.

@expede
Copy link
Member

expede commented Apr 22, 2025

it might be better to sign the hash of a change

Hmm yeah that makes sense.

Wacky idea: hashing as part of signing/verifying often already happens under the hood in most signature schemes, including Ed25519. I wonder if we can get away with validating the hash directly. ed25519_dalek (Rust) has options for this (signing & verifying), but that's not guaranteed to be in every possible Ed25519 library. Given recent discussions of hashing overhead suggests that using the existing hash is perhaps advantageous since it cuts down hashing by half versus secondary hashing.

That said, we're already generating these hashes as part of hash linking, so maybe not worth binding ourselves to the internal hashing of the signature scheme.

@ept
Copy link
Member

ept commented Apr 23, 2025

ed25519_dalek (Rust) has options for this

My instinct is that it's probably better to stick with the off-the-shelf algorithm, as then we could use non-extractable keys if the underlying system (say WebCrypto, or the secure element on a phone) supports Ed25519. We already need to compute the hash of a change anyway as it becomes part of the heads. If we only need to verify the most recent signature per author ID, and not a signature on every single change, then the overhead of the additional hashing inside the signature verification should be negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants