From 3a73a50d25108216945856393b73c20225da4b1e Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Tue, 15 Jul 2025 14:28:38 +0000 Subject: [PATCH 1/5] [Docs] Outline the problem statement for VIndex --- vindex/README.md | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/vindex/README.md b/vindex/README.md index 5b40ebd..e7c5326 100644 --- a/vindex/README.md +++ b/vindex/README.md @@ -9,9 +9,28 @@ Discussions are welcome, please join us on [Transparency-Dev Slack](https://tran ## Overview -The core idea is basically to construct an index like you would find in the back of a book, i.e. search terms are mapped to a _pointer_ to where the data can be found. +### Problem + +Logs support verifiable lookup of leaves by index, but there is no support for _verifiably_ returning leaves matching any other criteria. +This is important because in many cases, logs contain data where only a small subset of the data is relevant to a particular actor. +For example: + - CT: a domain owner is only interested in the small set of certs for domains they operate. Even for a large domain owner such as Google, this will be under 1% of all data in a log. + - Package Repository: a package owner is only interested in the packages they maintain, but this will be dwarved by entries for packages they don't maintain. + +Without a Verifiable Index, these actors must choose one of 2 approaches to discover entries in a log that relate to them: + 1. Download every entry from the log in order to perform the filtering locally + 2. Rely on a non-verifiable index, e.g. [CT Monitors](https://certificate.transparency.dev/monitors/). + +In the first case, they need to download a large amount of irrelevant data in order to stay secure. +In the second case, they need to rely on a service that breaks the chain of verifiability; a non-verifiable index may not return the full set of leaves to the requester, whether intentionally or accidentally. + +### Solution + +The core idea is to construct an index, similar to a [back-of-the-book index](https://en.wikipedia.org/wiki/Index_(publishing)), i.e. search terms are mapped to a _pointer_ to where the data can be found. A verifiable index represents an efficient data structure to allow point lookups to common queries over a single log. -For example, a verifiable index over a module/package repository could be constructed to allow efficient lookup of all modules/packages with a given name. +Examples: + - CT: a verifiable index over a CT log would allow certs to be efficiently and verifiably searched by domain name. + - Package Repository: a verifiable index over a module/package repository would allow lookup of all modules/packages with a given name. The result of looking up a key in a verifiable index is a list of uint64 pointers to the origin log, i.e. a list of indices in the origin log where the leaf data matches the index function. The index has a checkpoint that commits to its state at any particular log size. From 4a9d2edfb7409994ffc64a2e44e6c1e9176a4da5 Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Tue, 15 Jul 2025 14:43:46 +0000 Subject: [PATCH 2/5] Rewritten overview to have a stronger hook --- vindex/README.md | 38 +++++++++++++++++--------------------- 1 file changed, 17 insertions(+), 21 deletions(-) diff --git a/vindex/README.md b/vindex/README.md index e7c5326..aa67ad6 100644 --- a/vindex/README.md +++ b/vindex/README.md @@ -9,33 +9,29 @@ Discussions are welcome, please join us on [Transparency-Dev Slack](https://tran ## Overview -### Problem +### The Problem: Verifiability vs. Efficiency -Logs support verifiable lookup of leaves by index, but there is no support for _verifiably_ returning leaves matching any other criteria. -This is important because in many cases, logs contain data where only a small subset of the data is relevant to a particular actor. -For example: - - CT: a domain owner is only interested in the small set of certs for domains they operate. Even for a large domain owner such as Google, this will be under 1% of all data in a log. - - Package Repository: a package owner is only interested in the packages they maintain, but this will be dwarved by entries for packages they don't maintain. +Logs, such as those used in Certificate Transparency or Software Supply Chains, provide a strong foundation for verifiability. You can prove that an entry exists in a log. However, they lack a critical feature: the ability to _verifiably_ query for entries based on their content. -Without a Verifiable Index, these actors must choose one of 2 approaches to discover entries in a log that relate to them: - 1. Download every entry from the log in order to perform the filtering locally - 2. Rely on a non-verifiable index, e.g. [CT Monitors](https://certificate.transparency.dev/monitors/). +This forces users who need to find specific data, like a domain owner finding their certificates, or a developer finding their software packages, into a painful choice: -In the first case, they need to download a large amount of irrelevant data in order to stay secure. -In the second case, they need to rely on a service that breaks the chain of verifiability; a non-verifiable index may not return the full set of leaves to the requester, whether intentionally or accidentally. +1. **Massive Inefficiency**: Download and process the _entire_ log, which can be terabytes of mostly irrelevant data, just to find the few entries that matter to you. +2. **Broken Trust**: Rely on a third-party service to index the data. This breaks the chain of verifiability, as the index operator could, by accident or design, fail to show you all the results. You are forced to trust them. -### Solution +Neither option is acceptable. Users should not have to sacrifice efficiency for security, or security for efficiency. -The core idea is to construct an index, similar to a [back-of-the-book index](https://en.wikipedia.org/wiki/Index_(publishing)), i.e. search terms are mapped to a _pointer_ to where the data can be found. -A verifiable index represents an efficient data structure to allow point lookups to common queries over a single log. -Examples: - - CT: a verifiable index over a CT log would allow certs to be efficiently and verifiably searched by domain name. - - Package Repository: a verifiable index over a module/package repository would allow lookup of all modules/packages with a given name. +### The Solution: A Verifiable "Back-of-the-Book" Index -The result of looking up a key in a verifiable index is a list of uint64 pointers to the origin log, i.e. a list of indices in the origin log where the leaf data matches the index function. -The index has a checkpoint that commits to its state at any particular log size. -Every point lookup (i.e. query) in the map is verifiable, as is the construction of the index itself. -The verifiable index commits to all evolutions of its state by committing to all published index roots in a witnessed output log. +A Verifiable Index resolves this conflict by providing a third option: an efficient, cryptographically verifiable way to query log data without compromise. + +At its core, it works like a familiar back-of-the-book index. It maps search terms (like a domain or package name) to the exact locations (pointers) in the main log where that data can be found. + +This provides two key guarantees: + +- **Efficiency**: Users can look up data by a meaningful key and receive a small, targeted list of pointers back, avoiding the need to download the entire log. +- **Verifiability**: Every query comes with a cryptographic proof. This proof guarantees that the list of results is complete and that the index operator has not omitted any entries for your query. + +The result is a system that extends the verifiability of the underlying log to its queries, preserving the end-to-end chain of trust while providing the efficiency modern systems require. The index's own state is committed to a witnessed "Output Log", ensuring its entire history is also verifiable. ## Applications From 13ab1a5a8985c4a85c0fc5b02cf5ee3acd468a62 Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Wed, 16 Jul 2025 14:54:25 +0000 Subject: [PATCH 3/5] Refresh the milestones while I'm here --- vindex/README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/vindex/README.md b/vindex/README.md index aa67ad6..0e81457 100644 --- a/vindex/README.md +++ b/vindex/README.md @@ -201,14 +201,17 @@ You will also have a WAL file at `~/sumdb.wal`, which will make future boots fas | # | Step | Status | | :-: | --------------------------------------------------------- | :----: | | 1 | Public code base and documentation for prototype | ✅ | -| 2 | Implementation of Merkle Radix Tree | ✅ | +| 2 | Implementation of in-memory Merkle Radix Tree | ✅ | | 3 | Incremental update | ✅ | | 4 | Example written for mapping SumDB | ✅ | -| 5 | Example written for mapping CT | ⚠️ | +| 5 | Proofs served on Lookup | ❌ | | 6 | Output log | ❌ | -| 7 | Proofs served on Lookup | ❌ | -| 8 | MapFn defined in WASM | ❌ | -| 9 | Proper repository for this code to live long-term | ❌ | -| 10 | Support reading directly from Input Log instead of Clone | ❌ | +| 7 | Storage backed verifiable-map | ❌ | +| 8 | Example written for mapping CT | ⚠️ | +| 9 | MapFn defined in WASM | ❌ | +| 10 | Proper repository for this code to live long-term | ❌ | +| 11 | Support reading directly from Input Log instead of Clone | ❌ | | N | Production ready | ❌ | + +Note that a storage-backed map needs to be implemented before this can be applied to larger logs, e.g. CT. From 129624a4bbd972fae9cf54f7d39d0159927e8fb4 Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Thu, 17 Jul 2025 08:55:59 +0000 Subject: [PATCH 4/5] Expanded back-of-the-book Feedback was that it wasn't clear as a standalone term --- vindex/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vindex/README.md b/vindex/README.md index 0e81457..39b9afd 100644 --- a/vindex/README.md +++ b/vindex/README.md @@ -24,7 +24,7 @@ Neither option is acceptable. Users should not have to sacrifice efficiency for A Verifiable Index resolves this conflict by providing a third option: an efficient, cryptographically verifiable way to query log data without compromise. -At its core, it works like a familiar back-of-the-book index. It maps search terms (like a domain or package name) to the exact locations (pointers) in the main log where that data can be found. +At its core it works like a familiar index, much like one would find in the back of a book. It maps search terms (like a domain or package name) to the exact locations (pointers) in the main log where that data can be found. This provides two key guarantees: From 2feb75d9e85919629530d5ddbf77d11cb7a459f3 Mon Sep 17 00:00:00 2001 From: Martin Hutchinson Date: Thu, 17 Jul 2025 11:16:29 +0000 Subject: [PATCH 5/5] Review comments --- vindex/README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/vindex/README.md b/vindex/README.md index 39b9afd..1b5e263 100644 --- a/vindex/README.md +++ b/vindex/README.md @@ -11,27 +11,27 @@ Discussions are welcome, please join us on [Transparency-Dev Slack](https://tran ### The Problem: Verifiability vs. Efficiency -Logs, such as those used in Certificate Transparency or Software Supply Chains, provide a strong foundation for verifiability. You can prove that an entry exists in a log. However, they lack a critical feature: the ability to _verifiably_ query for entries based on their content. +Logs, such as those used in Certificate Transparency or Software Supply Chains, provide a strong foundation for discoverability. You can prove that an entry exists in a log. However, they lack a critical feature: the ability to _verifiably_ query for entries based on their content. This forces users who need to find specific data, like a domain owner finding their certificates, or a developer finding their software packages, into a painful choice: 1. **Massive Inefficiency**: Download and process the _entire_ log, which can be terabytes of mostly irrelevant data, just to find the few entries that matter to you. -2. **Broken Trust**: Rely on a third-party service to index the data. This breaks the chain of verifiability, as the index operator could, by accident or design, fail to show you all the results. You are forced to trust them. +2. **Losing Verifiability**: Rely on a third-party service to index the data. This breaks the chain of verifiability, as the index operator could, by accident or design, fail to show you all the results. You are forced to trust them. Neither option is acceptable. Users should not have to sacrifice efficiency for security, or security for efficiency. ### The Solution: A Verifiable "Back-of-the-Book" Index -A Verifiable Index resolves this conflict by providing a third option: an efficient, cryptographically verifiable way to query log data without compromise. +A Verifiable Index resolves this conflict by providing a third option: an efficient, cryptographically verifiable way to query log data. At its core it works like a familiar index, much like one would find in the back of a book. It maps search terms (like a domain or package name) to the exact locations (pointers) in the main log where that data can be found. This provides two key guarantees: - **Efficiency**: Users can look up data by a meaningful key and receive a small, targeted list of pointers back, avoiding the need to download the entire log. -- **Verifiability**: Every query comes with a cryptographic proof. This proof guarantees that the list of results is complete and that the index operator has not omitted any entries for your query. +- **Verifiability**: Every lookup response comes with a cryptographic proof. This proof guarantees that the list of results is complete and that the index operator has not omitted any entries for your query. -The result is a system that extends the verifiability of the underlying log to its queries, preserving the end-to-end chain of trust while providing the efficiency modern systems require. The index's own state is committed to a witnessed "Output Log", ensuring its entire history is also verifiable. +The result is a system that extends the verifiability of the underlying log to its queries, preserving the end-to-end chain of trust while providing the efficiency modern systems require. ## Applications