-
Notifications
You must be signed in to change notification settings - Fork 32
docs: ADR for re-processing of documents #1913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer's GuideThis PR introduces an architecture decision record for document re-processing and implements a full data migration subsystem that allows re-processing stored SBOM and advisory documents during schema changes. It adds concrete migrations for extracting SBOM properties and advisory vulnerability scores, refactors migration orchestration and storage initialization, integrates new CLI commands for data migrations, centralizes storage backend setup and enforces async trait requirements, and updates dependencies across multiple crates. ER diagram for new and updated tables: SBOM properties and advisory vulnerability scoreserDiagram
SBOM {
UUID sbom_id PK
JSON properties
}
ADVISORY_VULNERABILITY_SCORE {
UUID id PK
UUID advisory_id FK
STRING vulnerability_id FK
ENUM type
STRING vector
FLOAT score
ENUM severity
}
ADVISORY_VULNERABILITY {
UUID advisory_id PK
STRING vulnerability_id PK
}
ADVISORY_VULNERABILITY_SCORE ||--|{ ADVISORY_VULNERABILITY : "advisory_id, vulnerability_id"
ADVISORY_VULNERABILITY_SCORE }o--|| SBOM : "(none, but SBOM now has properties)"
ADVISORY_VULNERABILITY_SCORE }o--|| ADVISORY : "advisory_id"
ADVISORY_VULNERABILITY_SCORE }o--|| VULNERABILITY : "vulnerability_id"
Class diagram for the new data migration subsystem and document handlersclassDiagram
class Runner {
+String database_url
+Option<String> database_schema
+DispatchBackend storage
+Direction direction
+Vec<String> migrations
+Options options
+run<M: MigratorWithData>()
}
class MigratorWithData {
+data_migrations() Vec<Box<MigrationTraitWithData>>
}
class MigrationTraitWithData {
+up(manager: SchemaDataManager)
+down(manager: SchemaDataManager)
}
class SchemaDataManager {
+SchemaManager manager
+DispatchBackend storage
+Options options
+process<D, N>(name, f)
}
class Handler~D~ {
+call(document: D, model: D::Model, tx)
}
class Document {
+all(tx)
+source(model, storage, tx)
}
class Sbom {
+CycloneDx
+Spdx
+Other
}
class Advisory {
+Cve
+Csaf
+Osv
+Other
}
Runner --> MigratorWithData
MigratorWithData --> MigrationTraitWithData
MigrationTraitWithData --> SchemaDataManager
SchemaDataManager --> Handler
Handler --> Document
Document <|-- Sbom
Document <|-- Advisory
Class diagram for the new advisory vulnerability score entityclassDiagram
class AdvisoryVulnerabilityScore {
+UUID id
+UUID advisory_id
+String vulnerability_id
+ScoreType type
+String vector
+f64 score
+Severity severity
}
class ScoreType {
+V2_0
+V3_0
+V3_1
+V4_0
}
class Severity {
+None
+Low
+Medium
+High
+Critical
}
AdvisoryVulnerabilityScore --> ScoreType
AdvisoryVulnerabilityScore --> Severity
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1913 +/- ##
==========================================
+ Coverage 68.02% 68.19% +0.17%
==========================================
Files 367 382 +15
Lines 20590 21320 +730
Branches 20590 21320 +730
==========================================
+ Hits 14006 14539 +533
- Misses 5747 5918 +171
- Partials 837 863 +26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ctron - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
- Invalid trait bound
+ use<>in return type. (link) - Invalid trait bound
+ use<>in return type. (link)
General comments:
- The example migration currently hardcodes
FileSystemBackend::for_test, so make the storage backend configurable (e.g., allow S3) to support real‐world migrations instead of only test scenarios. - The
retrievemethod signature was updated with ause<Self>bound, which looks like a typo—verify that you intended a lifetime bound (for example+ 'a + Send) and correct the signature accordingly. - In
DocumentProcessor::process, wrapping all errors asDbErr::Migration(format!(…))loses structured error context; consider using a custom error type or propagating the original error to make debugging easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The example migration currently hardcodes `FileSystemBackend::for_test`, so make the storage backend configurable (e.g., allow S3) to support real‐world migrations instead of only test scenarios.
- The `retrieve` method signature was updated with a `use<Self>` bound, which looks like a typo—verify that you intended a lifetime bound (for example `+ 'a + Send`) and correct the signature accordingly.
- In `DocumentProcessor::process`, wrapping all errors as `DbErr::Migration(format!(…))` loses structured error context; consider using a custom error type or propagating the original error to make debugging easier.
## Individual Comments
### Comment 1
<location> `modules/storage/src/service/s3.rs` </location>
<code_context>
- async fn retrieve<'a>(
+ async fn retrieve(
&self,
StorageKey(key): StorageKey,
- ) -> Result<Option<impl Stream<Item = Result<Bytes, Self::Error>> + 'a>, Self::Error> {
</code_context>
<issue_to_address>
Invalid trait bound `+ use<>` in return type.
`+ use<>` is invalid Rust syntax and will not compile. Please correct this trait bound.
</issue_to_address>
### Comment 2
<location> `modules/storage/src/service/fs.rs` </location>
<code_context>
- async fn retrieve<'a>(
+ async fn retrieve(
&self,
key: StorageKey,
- ) -> Result<Option<impl Stream<Item = Result<Bytes, Self::Error>> + 'a>, Self::Error> {
</code_context>
<issue_to_address>
Invalid trait bound `+ use<>` in return type.
This trait bound will cause a compilation error; please update it to valid Rust syntax.
</issue_to_address>
### Comment 3
<location> `docs/adrs/00008-re-process-documents.md:17` </location>
<code_context>
+When making changes to the database structure, we also have a migration process, which takes care of upgrading the
+database structures during an upgrade.
+
+However, in some cases, changing the database structure actually means to extract more information from documents and is
+currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
+all documents affected by this change.
+
</code_context>
<issue_to_address>
Grammatical error: 'means to extract more information from documents and is currently stored' should be 'means extracting more information from documents than is currently stored'.
It should be: 'changing the database structure actually means extracting more information from documents than is currently stored in the database.'
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
However, in some cases, changing the database structure actually means to extract more information from documents and is
currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
all documents affected by this change.
=======
However, in some cases, changing the database structure actually means extracting more information from documents than is
currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
all documents affected by this change.
>>>>>>> REPLACE
</suggested_fix>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
55a099b to
b0866ec
Compare
|
Just some thoughts:
|
| enabled). The process will migrate schema and data. This might block the startup for a bit. But would be fast and | ||
| simple for small systems. | ||
|
|
||
| ### Approach 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There maybe a third way eg. only using default_transaction_read_only which will switch the db to read only mode ... note this is session based so maybe this is just a matter of keeping original conn for normal operations and generating a new conn (without setting default_transaction_read_only) to do any mutations. I think I would also use default_transaction_read_only for blue/green as well, but thought I would mention this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think working with read-only transactions makes sense in general. Some operations just being read-only by nature.
We could ask the user to re-configure the trustify for read-only transactions. And then run the migrations. But that wouldn't be much of a difference to the blue/green approach? We'd just need to ensure we can enable the user to do so.
|
|
||
| ## Consequences | ||
|
|
||
| * The migration will block the upgrade process until it is finished |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failure during these kind of migrations will need extra testing ... and maybe specific TX handling (transactional DDL is your friend).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's why I started adding tests for migrations (with data) as well.
| * 👍 Upgrade process is faster and less complex | ||
| * 👎 Requires some coordination between instances (only one processor at a time, maybe one after the other) | ||
|
|
||
| ### Option 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there maybe other options - there are a lot of pg ext for this sort of thing - ex. https://github.com/xataio/pgroll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely worth checking out. However, it seems to add a lot of complexity which needs to be handles by us then. And I'm not sure it is worth the effort.
|
|
||
| ### Approach 2 | ||
|
|
||
| The user uses a green/blue deployment. Switching the application to use green and run migrations against blue. Once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth mentioning that amazon has blue/green pg deployments as part of their service though I do not have much experience with them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my though here. I'd expect other PG services to have similar setups available. However, it's up to the user to provide a database. We'd just try to ensure we can work with a model like this.
I don't think there would be a way of doing this automatically. Assuming the Sea ORM migrations drive this, it would be sequential. As part of the data migration most likely there would also be a schema migration. However, if we allow the data migrations to be run up-front, we could bundle such processing. What definitely would work is to run multiple processors (in a single process/pod) in parallel. We could improve on that, but it would add much more complexity. And I'd like to avoid that in the beginning.
Very good idea. However, that also conflicts with the idea of running multiple steps/migrations in parallel. So each document must be upgraded sequentially. However that process could run in parallel.
I though about that too. However, I believe, that the current preferred idea of having A/B (green/blue) deployments would but that information not on the UI, but on some other process, detached from the UI. Assuming a user is running blue/green. The upgrade would run on blue, but the UI would serve (read-only) from green. So it would not see the state of blue until it's finished. |
b0866ec to
dbf13e3
Compare
| This would also require to prevent users from creating new documents during that time. Otherwise, we would need to | ||
| re-process documents ingested during the migration time. A way of doing this could be to leverage PostgreSQL's ability | ||
| to switch into read-only mode. Having mutable operations fail with a 503 (Service Unavailable) error. This would also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear: users would be unable to ingest new data while an update/migration is in progress? And that could potentially take more than a day? I think that's a bad look.
I believe it's worth the effort to design a system that allows the re-processing of docs ingested during migration. We should aim for "zero downtime".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can add another option to the document describing your idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the code and db schema have been updated, why can't new docs (post update) be ingested while old docs (pre update) are being re-ingested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you're adding a new, mandatory field. How would that be populated during schema migration? How would new code, relying on that data, work when that data is not present? Why should docs be re-ingested, and not just missing data be amended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not suggesting we add a new field. I'm asking why new docs can't be ingested into the new schema with the new code while the old docs are being re-ingested into the new schema with the new code. "Old" docs could be identified by their ingestion date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a use case we need to cover. Adding a new field.
dbf13e3 to
0a96d93
Compare
391b790 to
ef04e5d
Compare
bf41417 to
ad42a6c
Compare
20e763c to
c6bcac7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New security issues found
c6bcac7 to
012d4d4
Compare
Signed-off-by: Dejan Bosanac <[email protected]>
Signed-off-by: Dejan Bosanac <[email protected]>
Update csaf-rs dependency to use the "cvss" library where CVSS scores are stored as raw JSON values instead of pre-parsed objects. Assisted-By: Claude Signed-off-by: Dejan Bosanac <[email protected]>
0771fba to
b6fc4ab
Compare
| * 👎 Can't fully migrate database (new mandatory field won't work) | ||
| * 👍 Upgrade process is faster and less complex | ||
| * 👎 Original sources might no longer have the documents | ||
| * 👎 Won't work for manual (API) uploads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if manual API uploads could be attestations of assertions, which could also be treated as source of truth document blobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual uploads end up in the S3 store in the same way.
|
|
||
| ### Option 3 | ||
|
|
||
| We change ingestion in a way to it is possible to just re-ingest every document. Meaning, we re-ingest from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that we do with our internal SCILo system (mentioned in our kubecon talk), we run a backfill job that will run based on our source of truth. Keeping the source of truth to one table and a blob store. For our internal infrastructure its not too bad to run it over the entire dataset (in TBs).
We then read in all the inputs and do a replacement of all the entries in question. We discussed having a table of SoT -> lastProcessedVersion, as a way to keep a list of things that needed to be processed. This would prevent long lasting periods of unavailability as the dataset grows.
In that way we can do this in a "live" state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core idea of this alternative was to re-ingest from the original sources (e.g. CSAF trusted provider, OSV git repo). However, that idea is impractical due to the fact that original sources might not hold that content anymore and that is wouldn't be available for API uploads in the first place (see below).
We already have an S3 store holding the original content. So it seems like a reasonable idea to actually use it.
Preview: https://github.com/ctron/trustify/blob/feature/adr_rescan_1/docs/adrs/00008-re-process-documents.md
Summary by Sourcery
Add an architecture decision record outlining strategies for re-processing ingested documents after schema changes.
Documentation:
Summary by Sourcery
Enable re-processing of stored documents during schema upgrades by introducing an ADR, a data migration framework, new migrations for SBOM properties and advisory scores, CLI commands for data migrations, and enhancements to storage and migrator modules.
New Features:
Enhancements:
Documentation:
Tests: