Create DB and storage dumps with each commit for testing migration #2050

ctron · 2025-10-21T09:01:41Z

This creates a DB and storage dump, by ingesting a specific dataset, which we checked in to the repository.

This results in a "most recent" DB format dump, which can then be used by PRs to test if the DB migration would also run with existing data (non-empty DB). This can also be used to test data migrations.

This PR is a preparation step towards this. It only covers the creation of dump and does not yet use them. This would be part of a follow up PR.

See: #2040

Summary by Sourcery

Add database and storage dump generation to the xtask CLI for migration testing and automate their creation and upload via a new GitHub Actions workflow.

New Features:

Enable optional storage dump generation in xtask, producing a tarball of the storage backend
Support ingestion of file or directory datasets in xtask through a new paths configuration field
Add GitHub Actions workflow that runs generate-dump, compresses the SQL dump, and uploads dumps and checksums to S3

Enhancements:

Refactor xtask config loading to include working directory context and support the new paths field
Add FileSystemBackend::for_test_with to create temporary storage with specified compression
Update dataset YAML schema to use mapping-style importer definitions

Build:

Add xtask dependencies: bytes, tar, walkdir, walker-common, and trustify-module-ingestor

CI:

Introduce migration-upload workflow to automate dump generation and S3 upload on pushes to main and release branches

sourcery-ai · 2025-10-21T09:01:48Z

Reviewer's Guide

This PR enhances the xtask generate-dump command to produce filesystem storage dumps alongside SQL dumps, adds configurable ingestion of file paths, extends the storage backend for test compression, updates dataset schema and configs for new options, introduces required dependencies, and adds a GitHub Actions workflow to automate dump generation and upload to S3.

ER diagram for updated dataset config schema

erDiagram
    INSTRUCTIONS {
        Vec import
        Vec paths
    }
    IMPORTER_CONFIGURATION {
        type
        config
    }
    INSTRUCTIONS ||--o{ IMPORTER_CONFIGURATION : contains
    IMPORTER_CONFIGURATION {
        sbom
        cve
        osv
        csaf
    }

Class diagram for updated GenerateDump and Instructions structures

classDiagram
    class GenerateDump {
        +PathBuf output
        +Option<PathBuf> storage_output
        +Option<PathBuf> input
        +Option<PathBuf> working_dir
        +fn load_config(&self) -> anyhow::Result<(PathBuf, Instructions)>
        +async fn ingest(&self, runner: ImportRunner) -> anyhow::Result<()>
    }
    class Instructions {
        +Vec<ImporterConfiguration> import
        +Vec<PathBuf> paths
    }
    GenerateDump --> Instructions : uses
    Instructions o-- ImporterConfiguration
    ImporterConfiguration <|-- CveImporter
    ImporterConfiguration <|-- SbomImporter
    ImporterConfiguration <|-- OsvImporter
    ImporterConfiguration <|-- CsafImporter

Class diagram for FileSystemBackend changes

classDiagram
    class FileSystemBackend {
        +async fn for_test_with(compression: Compression) -> anyhow::Result<(Self, TempDir)>
        +async fn new(path: Path, compression: Compression) -> anyhow::Result<Self>
    }
    FileSystemBackend --> Compression
    FileSystemBackend --> TempDir

File-Level Changes

Change	Details	Files
Support storage dump creation and file path ingestion in xtask	Add `storage_output` CLI option to GenerateDump Extend Instructions with new `paths` field Modify load_config to return working directory and config tuple Implement tar archive creation of storage_path when storage_output is set Add logic to walk directories, decompress and ingest files via IngestorService	`xtask/src/dataset.rs`
Introduce configurable test storage backend	Add `for_test_with` method accepting a Compression parameter Switch GenerateDump to use `for_test_with` in test mode	`modules/storage/src/service/fs.rs` `xtask/src/dataset.rs`
Update dataset schema and examples for mapping syntax and paths	Add `paths` support to generate-dump JSON schema Convert YAML import entries to new mapping syntax Provide default migration-dump config including `paths` under .github/scripts	`xtask/schema/generate-dump.json` `etc/datasets/ds4.yaml` `.github/scripts/migration-dump/config.yaml`
Add CI workflow for dump generation and S3 upload	Create `migration-upload` GitHub Actions workflow Invoke xtask generate-dump with config and storage-output Compress SQL dump with xz and upload both dumps and checksums to S3 with branch/commit structure	`.github/workflows/migration-upload.yaml`
Add new dependencies for storage and ingestion support	Include crates `bytes`, `tar`, `walkdir`, `walker-common`, and `trustify-module-ingestor` in xtask Cargo.toml	`xtask/Cargo.toml`

Possibly linked issues

Test migration with data #2040: PR adds GitHub Action to create DB and storage dumps on push, fulfilling initial steps of the issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

ctron · 2025-10-21T09:02:14Z

@mrrajan maybe you're interested in that too?

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Blocking issues:

An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `xtask/src/dataset.rs:192-195` </location>
<code_context>
+            IngestorService::new(Graph::new(runner.db.clone()), runner.storage.clone(), None);
+        for path in config.paths {
+            log::info!("Ingesting: {}", path.display());
+            let path = wd.join(path).canonicalize()?;
+            log::info!(" Resolved: {}", path.display());
+
</code_context>

<issue_to_address>
**suggestion:** Canonicalizing a path that does not exist will return an error; consider a fallback or clearer error message.

Consider handling the error from canonicalize to provide a clearer message or fallback when the path does not exist.

```suggestion
        for path in config.paths {
            log::info!("Ingesting: {}", path.display());
            let resolved_path = match wd.join(path).canonicalize() {
                Ok(p) => p,
                Err(e) => {
                    log::error!(
                        "Failed to resolve path '{}': {}. Skipping ingestion for this path.",
                        path.display(),
                        e
                    );
                    continue;
                }
            };
            log::info!(" Resolved: {}", resolved_path.display());
```
</issue_to_address>

### Comment 2
<location> `xtask/src/dataset.rs:226-235` </location>
<code_context>
+                };
+                let data = detector.decompress(data).map_err(|err| anyhow!("{err}"))?;
+
+                let result = service
+                    .ingest(&data, Format::Unknown, (), None, Cache::Skip)
+                    .await?;
+                log::info!("  id: {}", result.id);
+                if !result.warnings.is_empty() {
</code_context>

<issue_to_address>
**suggestion:** Consider logging ingestion failures for individual files to aid debugging.

Logging the file name and error on ingestion failure will make it easier to pinpoint and resolve issues with specific files.

```suggestion
                match service
                    .ingest(&data, Format::Unknown, (), None, Cache::Skip)
                    .await
                {
                    Ok(result) => {
                        log::info!("  id: {}", result.id);
                        if !result.warnings.is_empty() {
                            log::warn!("  warnings:");
                            for warning in result.warnings {
                                log::warn!("    - {}", warning);
                            }
                        }
                    }
                    Err(err) => {
                        log::error!("Failed to ingest file '{}': {err}", name);
                    }
                }
```
</issue_to_address>

### Comment 3
<location> `.github/workflows/migration-upload.yaml:22` </location>
<code_context>
      - uses: Swatinem/rust-cache@v2
</code_context>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

xtask/src/dataset.rs

.github/workflows/migration-upload.yaml

ctron · 2025-10-21T09:08:48Z

@sourcery-ai review

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Blocking issues:

An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `xtask/src/dataset.rs:192-61` </location>
<code_context>
+
+        let service =
+            IngestorService::new(Graph::new(runner.db.clone()), runner.storage.clone(), None);
+        for path in config.paths {
+            log::info!("Ingesting: {}", path.display());
+            let path = wd.join(path);
+            let path = path
+                .canonicalize()
+                .with_context(|| format!("failed to canonicalize '{}'", path.display()))?;
+            log::info!(" Resolved: {}", path.display());
+
+            let mut files = vec![];
+
+            if path.is_dir() {
+                for entry in walkdir::WalkDir::new(path).follow_links(true) {
+                    let entry = entry?;
+                    if !entry.file_type().is_file() {
+                        continue;
+                    }
+                    if entry.file_name().to_string_lossy().starts_with(".") {
+                        continue;
</code_context>

<issue_to_address>
**suggestion (performance):** Consider limiting the number of files ingested to avoid resource exhaustion.

A configurable file count limit or warning for large directories would help prevent excessive memory or IO usage.

Suggested implementation:

```rust
            let mut files = vec![];

            // Set a configurable file count limit (default: 1000)
            let max_files = config.max_files.unwrap_or(1000);

            if path.is_dir() {
                let mut file_count = 0;
                for entry in walkdir::WalkDir::new(path).follow_links(true) {
                    let entry = entry?;
                    if !entry.file_type().is_file() {
                        continue;
                    }
                    if entry.file_name().to_string_lossy().starts_with(".") {
                        continue;
                    }
                    if file_count >= max_files {
                        log::warn!(
                            "File count limit ({}) reached for '{}'. Only ingesting the first {} files.",
                            max_files,
                            path.display(),
                            max_files
                        );
                        break;
                    }
                    files.push(entry.into_path());
                    file_count += 1;
                }
            } else {
                files.push(path);
            }

```

- Ensure that `config.max_files` exists and is an `Option<usize>`. If not, add it to your config struct and make it configurable (e.g., via CLI or config file).
- You may want to document the new `max_files` option for users.
</issue_to_address>

### Comment 2
<location> `xtask/src/dataset.rs:221-223` </location>
<code_context>
+                let name = file.as_os_str().to_string_lossy().to_string();
+
+                log::info!("Loading: {name}");
+                let data: Bytes = fs::read(file).await?.into();
+
+                let detector = Detector {
</code_context>

<issue_to_address>
**suggestion (performance):** Large file ingestion may block or exhaust memory; consider streaming or chunking.

Streaming or reading files in chunks will help prevent memory issues and improve scalability for large files.

```suggestion
                use tokio::io::AsyncReadExt;
                use tokio::fs::File;

                // Stream file in chunks to avoid loading entire file into memory
                let mut f = File::open(&file).await?;
                let mut buffer = Vec::new();
                let mut chunk = [0u8; 8 * 1024]; // 8KB chunk size

                loop {
                    let n = f.read(&mut chunk).await?;
                    if n == 0 {
                        break;
                    }
                    buffer.extend_from_slice(&chunk[..n]);
                }
                let data: Bytes = buffer.into();

                let detector = Detector {
```
</issue_to_address>

### Comment 3
<location> `xtask/src/dataset.rs:227` </location>
<code_context>
+                log::info!("Loading: {name}");
+                let data: Bytes = fs::read(file).await?.into();
+
+                let detector = Detector {
+                    file_name: Some(name.as_str()),
+                    ..Default::default()
+                };
+                let data = detector.decompress(data).map_err(|err| anyhow!("{err}"))?;
+
+                let result = service
</code_context>

<issue_to_address>
**suggestion:** Consider logging decompression errors with file context.

Including the file name in error logs will make it easier to identify which file caused the decompression issue.

```suggestion
                let data = match detector.decompress(data) {
                    Ok(data) => data,
                    Err(err) => {
                        log::error!("Failed to decompress file '{}': {err}", name);
                        return Err(anyhow!("Failed to decompress file '{}': {err}", name));
                    }
                };
```
</issue_to_address>

### Comment 4
<location> `.github/workflows/migration-upload.yaml:22` </location>
<code_context>
      - uses: Swatinem/rust-cache@v2
</code_context>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

xtask/src/dataset.rs

.github/workflows/migration-upload.yaml

codecov · 2025-10-21T09:41:13Z

Codecov Report

❌ Patch coverage is 0% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.40%. Comparing base (943e52e) to head (9a8dbd0).
⚠️ Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
xtask/src/dataset.rs	0.00%	70 Missing ⚠️
modules/storage/src/service/fs.rs	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2050      +/-   ##
==========================================
- Coverage   68.64%   68.40%   -0.25%     
==========================================
  Files         362      362              
  Lines       20240    20307      +67     
  Branches    20240    20307      +67     
==========================================
- Hits        13893    13890       -3     
- Misses       5557     5626      +69     
- Partials      790      791       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dejanb

Looks good

ctron · 2025-10-21T11:35:29Z

@sourcery-ai dismiss

Automated Sourcery review dismissed.

ctron added 4 commits October 21, 2025 08:34

feat: allow dump generator to consume files directly

a69b5dd

chore: make paths optional

23f7055

chore: fix up YAML file

669f071

ci: add a migration dump generator and upload the result to S3

38476e6

ctron requested a review from dejanb October 21, 2025 09:01

ctron requested a review from mrizzi October 21, 2025 09:01

sourcery-ai bot previously requested changes Oct 21, 2025

View reviewed changes

xtask/src/dataset.rs Show resolved Hide resolved

xtask/src/dataset.rs Show resolved Hide resolved

.github/workflows/migration-upload.yaml Show resolved Hide resolved

sourcery-ai bot previously requested changes Oct 21, 2025

View reviewed changes

xtask/src/dataset.rs Show resolved Hide resolved

xtask/src/dataset.rs Show resolved Hide resolved

xtask/src/dataset.rs Outdated Show resolved Hide resolved

.github/workflows/migration-upload.yaml Show resolved Hide resolved

chore: provide better error context

9a8dbd0

ctron force-pushed the feature/gen_migration_dump_1 branch from 3d5a625 to 9a8dbd0 Compare October 21, 2025 09:20

dejanb approved these changes Oct 21, 2025

View reviewed changes

ctron enabled auto-merge October 21, 2025 11:33

ctron added this pull request to the merge queue Oct 21, 2025

Merged via the queue into guacsec:main with commit 0d1884c Oct 21, 2025
3 of 6 checks passed

ctron deleted the feature/gen_migration_dump_1 branch October 21, 2025 11:57

Create DB and storage dumps with each commit for testing migration #2050

Create DB and storage dumps with each commit for testing migration #2050

Uh oh!

Conversation

ctron commented Oct 21, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

ER diagram for updated dataset config schema

Class diagram for updated GenerateDump and Instructions structures

Class diagram for FileSystemBackend changes

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

ctron commented Oct 21, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ctron commented Oct 21, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dejanb left a comment

Choose a reason for hiding this comment

Uh oh!

ctron commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ctron commented Oct 21, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Oct 21, 2025 •

edited

Loading

codecov bot commented Oct 21, 2025 •

edited

Loading