Skip to content

Conversation

@ctron
Copy link
Contributor

@ctron ctron commented Aug 5, 2025

Preview: https://github.com/ctron/trustify/blob/feature/adr_rescan_1/docs/adrs/00008-re-process-documents.md


Summary by Sourcery

Add an architecture decision record outlining strategies for re-processing ingested documents after schema changes.

Documentation:

  • Introduce ADR 00008 describing the context and assumptions for re-processing documents stored in the system
  • Evaluate three migration options (in-place re-processing during DB migration, separate processor module, and full re-ingestion) with pros and cons
  • List open items, alternative approaches, and consequences for the chosen strategy

Summary by Sourcery

Enable re-processing of stored documents during schema upgrades by introducing an ADR, a data migration framework, new migrations for SBOM properties and advisory scores, CLI commands for data migrations, and enhancements to storage and migrator modules.

New Features:

  • Add ADR documenting strategies for re-processing ingested documents after schema changes
  • Implement data migration framework to re-process stored documents as part of database migrations
  • Introduce CLI commands and options to run individual data migrations (db data)

Enhancements:

  • Extend migrator to support combined schema and data migrations with configurable storage backends and partitioned concurrency
  • Add SBOM properties and advisory vulnerability scores to the data model and ingest pipelines
  • Refactor storage configuration to unify filesystem and S3 backends with async support

Documentation:

  • Add ADR 00008/00009 describing context, assumptions, and alternatives for document re-processing

Tests:

  • Add integration tests for running data migrations and verifying SBOM and advisory score migrations

@ctron ctron added the ADR label Aug 5, 2025
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Aug 5, 2025

Reviewer's Guide

This PR introduces an architecture decision record for document re-processing and implements a full data migration subsystem that allows re-processing stored SBOM and advisory documents during schema changes. It adds concrete migrations for extracting SBOM properties and advisory vulnerability scores, refactors migration orchestration and storage initialization, integrates new CLI commands for data migrations, centralizes storage backend setup and enforces async trait requirements, and updates dependencies across multiple crates.

ER diagram for new and updated tables: SBOM properties and advisory vulnerability scores

erDiagram
    SBOM {
        UUID sbom_id PK
        JSON properties
    }
    ADVISORY_VULNERABILITY_SCORE {
        UUID id PK
        UUID advisory_id FK
        STRING vulnerability_id FK
        ENUM type
        STRING vector
        FLOAT score
        ENUM severity
    }
    ADVISORY_VULNERABILITY {
        UUID advisory_id PK
        STRING vulnerability_id PK
    }
    ADVISORY_VULNERABILITY_SCORE ||--|{ ADVISORY_VULNERABILITY : "advisory_id, vulnerability_id"
    ADVISORY_VULNERABILITY_SCORE }o--|| SBOM : "(none, but SBOM now has properties)"
    ADVISORY_VULNERABILITY_SCORE }o--|| ADVISORY : "advisory_id"
    ADVISORY_VULNERABILITY_SCORE }o--|| VULNERABILITY : "vulnerability_id"
Loading

Class diagram for the new data migration subsystem and document handlers

classDiagram
    class Runner {
        +String database_url
        +Option<String> database_schema
        +DispatchBackend storage
        +Direction direction
        +Vec<String> migrations
        +Options options
        +run<M: MigratorWithData>()
    }
    class MigratorWithData {
        +data_migrations() Vec<Box<MigrationTraitWithData>>
    }
    class MigrationTraitWithData {
        +up(manager: SchemaDataManager)
        +down(manager: SchemaDataManager)
    }
    class SchemaDataManager {
        +SchemaManager manager
        +DispatchBackend storage
        +Options options
        +process<D, N>(name, f)
    }
    class Handler~D~ {
        +call(document: D, model: D::Model, tx)
    }
    class Document {
        +all(tx)
        +source(model, storage, tx)
    }
    class Sbom {
        +CycloneDx
        +Spdx
        +Other
    }
    class Advisory {
        +Cve
        +Csaf
        +Osv
        +Other
    }
    Runner --> MigratorWithData
    MigratorWithData --> MigrationTraitWithData
    MigrationTraitWithData --> SchemaDataManager
    SchemaDataManager --> Handler
    Handler --> Document
    Document <|-- Sbom
    Document <|-- Advisory
Loading

Class diagram for the new advisory vulnerability score entity

classDiagram
    class AdvisoryVulnerabilityScore {
        +UUID id
        +UUID advisory_id
        +String vulnerability_id
        +ScoreType type
        +String vector
        +f64 score
        +Severity severity
    }
    class ScoreType {
        +V2_0
        +V3_0
        +V3_1
        +V4_0
    }
    class Severity {
        +None
        +Low
        +Medium
        +High
        +Critical
    }
    AdvisoryVulnerabilityScore --> ScoreType
    AdvisoryVulnerabilityScore --> Severity
Loading

File-Level Changes

Change Details Files
Implement data migration subsystem for document re-processing
  • Define MigrationTraitWithData, MigratorExt, MigratorWithData and Runner
  • Implement document partitioning, handlers and sbom/advisory macros
  • Add migration/src/data module and CLI in migration/src/bin/data.rs
  • Integrate db data subcommand in trustd and common/db data_migrate
migration/src/lib.rs
migration/src/data/*
migration/src/bin/data.rs
trustd/src/db.rs
common/db/src/lib.rs
Add advisory vulnerability scoring pipeline
  • Implement ScoreCreator and ScoreInformation types
  • Extract scores in osv, cve and csaf modules
  • Update loaders to create and persist scores
  • Add advisory_vulnerability_score entity and migration
modules/ingestor/src/service/advisory/osv/mod.rs
modules/ingestor/src/service/advisory/cve/mod.rs
modules/ingestor/src/service/advisory/csaf/mod.rs
modules/ingestor/src/graph/cvss.rs
entity/src/advisory_vulnerability_score.rs
migration/src/m0002010_add_advisory_scores.rs
Introduce SBOM properties extraction and storage
  • Add JSON properties column to SBOM entity
  • Implement extract_properties_json in common
  • Extend SBOM Graph models to include properties
  • Create data migration for SBOM properties
entity/src/sbom.rs
common/src/advisory/cyclonedx.rs
modules/ingestor/src/graph/sbom/*
migration/src/m0002000_add_sbom_properties.rs
Centralize storage backend initialization and enforce Send on async storage traits
  • Implement StorageConfig.into_storage(initializer)
  • Replace inline FS/S3 init in server and importer
  • Add Send bounds to StorageBackend methods
  • Remove duplicated storage setup code
modules/storage/src/config.rs
modules/storage/src/service/mod.rs
modules/storage/src/service/dispatch.rs
modules/storage/src/service/s3.rs
modules/storage/src/service/fs.rs
server/src/profile/api.rs
server/src/profile/importer.rs
trustd/src/db.rs
Refactor migration registration and orchestration
  • Introduce build_migrations and into_migrations builder pattern
  • Combine schema and data migrations in MigratorExt
  • Simplify MigratorTrait::migrations implementation
migration/src/lib.rs
Add ADR for re-processing of documents
  • Document the context, options and consequences for re-processing during migrations
docs/adrs/00009-re-process-documents.md

Possibly linked issues

  • #Unify how document re-ingestion behaves for all document types: The PR provides a unified framework for re-processing documents during database migrations, directly addressing the issue's goal of unifying re-ingestion behavior for all document types to add new data.
  • #TC-2731: The PR adds a framework for re-processing ingested documents during database schema migrations, addressing the issue's need to refresh database information.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link

codecov bot commented Aug 5, 2025

Codecov Report

❌ Patch coverage is 72.54658% with 221 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.19%. Comparing base (390833b) to head (b6fc4ab).

Files with missing lines Patch % Lines
entity/src/advisory_vulnerability_score.rs 21.05% 45 Missing ⚠️
migration/src/bin/data.rs 0.00% 37 Missing ⚠️
migration/src/data/mod.rs 75.65% 22 Missing and 6 partials ⚠️
modules/ingestor/src/graph/cvss.rs 81.35% 20 Missing and 2 partials ⚠️
trustd/src/db.rs 0.00% 22 Missing ⚠️
migration/src/data/run.rs 58.82% 12 Missing and 2 partials ⚠️
migration/src/data/partition.rs 64.00% 9 Missing ⚠️
common/src/advisory/cyclonedx.rs 57.89% 6 Missing and 2 partials ⚠️
modules/ingestor/src/service/advisory/osv/mod.rs 87.93% 7 Missing ⚠️
migration/src/data/document/mod.rs 76.00% 1 Missing and 5 partials ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1913      +/-   ##
==========================================
+ Coverage   68.02%   68.19%   +0.17%     
==========================================
  Files         367      382      +15     
  Lines       20590    21320     +730     
  Branches    20590    21320     +730     
==========================================
+ Hits        14006    14539     +533     
- Misses       5747     5918     +171     
- Partials      837      863      +26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ctron
Copy link
Contributor Author

ctron commented Aug 7, 2025

@sourcery-ai review

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ctron - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Invalid trait bound + use<> in return type. (link)
  • Invalid trait bound + use<> in return type. (link)

General comments:

  • The example migration currently hardcodes FileSystemBackend::for_test, so make the storage backend configurable (e.g., allow S3) to support real‐world migrations instead of only test scenarios.
  • The retrieve method signature was updated with a use<Self> bound, which looks like a typo—verify that you intended a lifetime bound (for example + 'a + Send) and correct the signature accordingly.
  • In DocumentProcessor::process, wrapping all errors as DbErr::Migration(format!(…)) loses structured error context; consider using a custom error type or propagating the original error to make debugging easier.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The example migration currently hardcodes `FileSystemBackend::for_test`, so make the storage backend configurable (e.g., allow S3) to support real‐world migrations instead of only test scenarios.
- The `retrieve` method signature was updated with a `use<Self>` bound, which looks like a typo—verify that you intended a lifetime bound (for example `+ 'a + Send`) and correct the signature accordingly.
- In `DocumentProcessor::process`, wrapping all errors as `DbErr::Migration(format!(…))` loses structured error context; consider using a custom error type or propagating the original error to make debugging easier.

## Individual Comments

### Comment 1
<location> `modules/storage/src/service/s3.rs` </location>
<code_context>
-    async fn retrieve<'a>(
+    async fn retrieve(
         &self,
         StorageKey(key): StorageKey,
-    ) -> Result<Option<impl Stream<Item = Result<Bytes, Self::Error>> + 'a>, Self::Error> {
</code_context>

<issue_to_address>
Invalid trait bound `+ use<>` in return type.

`+ use<>` is invalid Rust syntax and will not compile. Please correct this trait bound.
</issue_to_address>

### Comment 2
<location> `modules/storage/src/service/fs.rs` </location>
<code_context>
-    async fn retrieve<'a>(
+    async fn retrieve(
         &self,
         key: StorageKey,
-    ) -> Result<Option<impl Stream<Item = Result<Bytes, Self::Error>> + 'a>, Self::Error> {
</code_context>

<issue_to_address>
Invalid trait bound `+ use<>` in return type.

This trait bound will cause a compilation error; please update it to valid Rust syntax.
</issue_to_address>

### Comment 3
<location> `docs/adrs/00008-re-process-documents.md:17` </location>
<code_context>
+When making changes to the database structure, we also have a migration process, which takes care of upgrading the
+database structures during an upgrade.
+
+However, in some cases, changing the database structure actually means to extract more information from documents and is
+currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
+all documents affected by this change.
+
</code_context>

<issue_to_address>
Grammatical error: 'means to extract more information from documents and is currently stored' should be 'means extracting more information from documents than is currently stored'.

It should be: 'changing the database structure actually means extracting more information from documents than is currently stored in the database.'
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
However, in some cases, changing the database structure actually means to extract more information from documents and is
currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
all documents affected by this change.
=======
However, in some cases, changing the database structure actually means extracting more information from documents than is
currently stored in the database. Or information is extracted in a different way. This requires a re-processing of
all documents affected by this change.
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ctron ctron force-pushed the feature/adr_rescan_1 branch 5 times, most recently from 55a099b to b0866ec Compare August 7, 2025 12:16
@carlosthe19916
Copy link
Contributor

Just some thoughts:

  • Would it be possible to trigger the "re-processing" in parallel? E.g. multiple pods. What would be the technical challenges?
  • Would it be helpful to mark the documents (e.g. SBOMs) with a version number to indicate which is his current "re-ingest" version? similar to database migration controls
  • Just like the importers I wonder if it would be worth sharing "how many documents were processed, how long should we wait for the process to finish, etc (estimations)"

enabled). The process will migrate schema and data. This might block the startup for a bit. But would be fast and
simple for small systems.

### Approach 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There maybe a third way eg. only using default_transaction_read_only which will switch the db to read only mode ... note this is session based so maybe this is just a matter of keeping original conn for normal operations and generating a new conn (without setting default_transaction_read_only) to do any mutations. I think I would also use default_transaction_read_only for blue/green as well, but thought I would mention this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think working with read-only transactions makes sense in general. Some operations just being read-only by nature.

We could ask the user to re-configure the trustify for read-only transactions. And then run the migrations. But that wouldn't be much of a difference to the blue/green approach? We'd just need to ensure we can enable the user to do so.


## Consequences

* The migration will block the upgrade process until it is finished
Copy link
Contributor

@JimFuller-RedHat JimFuller-RedHat Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failure during these kind of migrations will need extra testing ... and maybe specific TX handling (transactional DDL is your friend).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's why I started adding tests for migrations (with data) as well.

* 👍 Upgrade process is faster and less complex
* 👎 Requires some coordination between instances (only one processor at a time, maybe one after the other)

### Option 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there maybe other options - there are a lot of pg ext for this sort of thing - ex. https://github.com/xataio/pgroll

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worth checking out. However, it seems to add a lot of complexity which needs to be handles by us then. And I'm not sure it is worth the effort.


### Approach 2

The user uses a green/blue deployment. Switching the application to use green and run migrations against blue. Once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth mentioning that amazon has blue/green pg deployments as part of their service though I do not have much experience with them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my though here. I'd expect other PG services to have similar setups available. However, it's up to the user to provide a database. We'd just try to ensure we can work with a model like this.

@ctron
Copy link
Contributor Author

ctron commented Aug 11, 2025

Just some thoughts:

  • Would it be possible to trigger the "re-processing" in parallel? E.g. multiple pods. What would be the technical challenges?

I don't think there would be a way of doing this automatically. Assuming the Sea ORM migrations drive this, it would be sequential. As part of the data migration most likely there would also be a schema migration.

However, if we allow the data migrations to be run up-front, we could bundle such processing. What definitely would work is to run multiple processors (in a single process/pod) in parallel.

We could improve on that, but it would add much more complexity. And I'd like to avoid that in the beginning.

  • Would it be helpful to mark the documents (e.g. SBOMs) with a version number to indicate which is his current "re-ingest" version? similar to database migration controls

Very good idea. However, that also conflicts with the idea of running multiple steps/migrations in parallel. So each document must be upgraded sequentially. However that process could run in parallel.

  • Just like the importers I wonder if it would be worth sharing "how many documents were processed, how long should we wait for the process to finish, etc (estimations)"

I though about that too. However, I believe, that the current preferred idea of having A/B (green/blue) deployments would but that information not on the UI, but on some other process, detached from the UI.

Assuming a user is running blue/green. The upgrade would run on blue, but the UI would serve (read-only) from green. So it would not see the state of blue until it's finished.

Comment on lines +43 to +46
This would also require to prevent users from creating new documents during that time. Otherwise, we would need to
re-process documents ingested during the migration time. A way of doing this could be to leverage PostgreSQL's ability
to switch into read-only mode. Having mutable operations fail with a 503 (Service Unavailable) error. This would also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear: users would be unable to ingest new data while an update/migration is in progress? And that could potentially take more than a day? I think that's a bad look.

I believe it's worth the effort to design a system that allows the re-processing of docs ingested during migration. We should aim for "zero downtime".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can add another option to the document describing your idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the code and db schema have been updated, why can't new docs (post update) be ingested while old docs (pre update) are being re-ingested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you're adding a new, mandatory field. How would that be populated during schema migration? How would new code, relying on that data, work when that data is not present? Why should docs be re-ingested, and not just missing data be amended?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not suggesting we add a new field. I'm asking why new docs can't be ingested into the new schema with the new code while the old docs are being re-ingested into the new schema with the new code. "Old" docs could be identified by their ingestion date.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a use case we need to cover. Adding a new field.

@ctron ctron force-pushed the feature/adr_rescan_1 branch from dbf13e3 to 0a96d93 Compare August 21, 2025 08:37
@ctron ctron force-pushed the feature/adr_rescan_1 branch 2 times, most recently from 391b790 to ef04e5d Compare September 10, 2025 11:03
@ctron ctron force-pushed the feature/adr_rescan_1 branch 3 times, most recently from bf41417 to ad42a6c Compare September 22, 2025 06:49
@ctron ctron mentioned this pull request Oct 6, 2025
@ctron ctron force-pushed the feature/adr_rescan_1 branch from 20e763c to c6bcac7 Compare October 7, 2025 07:14
sourcery-ai[bot]
sourcery-ai bot previously requested changes Oct 7, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New security issues found

@ctron ctron force-pushed the feature/adr_rescan_1 branch from c6bcac7 to 012d4d4 Compare October 13, 2025 12:45
ctron and others added 28 commits November 3, 2025 09:36
Signed-off-by: Dejan Bosanac <[email protected]>
Update csaf-rs dependency to use the "cvss" library where CVSS scores are stored as raw JSON values instead of pre-parsed objects.

Assisted-By: Claude
Signed-off-by: Dejan Bosanac <[email protected]>
@ctron ctron force-pushed the feature/adr_rescan_1 branch from 0771fba to b6fc4ab Compare November 3, 2025 08:36
* 👎 Can't fully migrate database (new mandatory field won't work)
* 👍 Upgrade process is faster and less complex
* 👎 Original sources might no longer have the documents
* 👎 Won't work for manual (API) uploads
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if manual API uploads could be attestations of assertions, which could also be treated as source of truth document blobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual uploads end up in the S3 store in the same way.


### Option 3

We change ingestion in a way to it is possible to just re-ingest every document. Meaning, we re-ingest from the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that we do with our internal SCILo system (mentioned in our kubecon talk), we run a backfill job that will run based on our source of truth. Keeping the source of truth to one table and a blob store. For our internal infrastructure its not too bad to run it over the entire dataset (in TBs).

We then read in all the inputs and do a replacement of all the entries in question. We discussed having a table of SoT -> lastProcessedVersion, as a way to keep a list of things that needed to be processed. This would prevent long lasting periods of unavailability as the dataset grows.

In that way we can do this in a "live" state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core idea of this alternative was to re-ingest from the original sources (e.g. CSAF trusted provider, OSV git repo). However, that idea is impractical due to the fact that original sources might not hold that content anymore and that is wouldn't be available for API uploads in the first place (see below).

We already have an S3 store holding the original content. So it seems like a reasonable idea to actually use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants