Skip to content

Conversation

@ctron
Copy link
Contributor

@ctron ctron commented Oct 21, 2025

This creates a DB and storage dump, by ingesting a specific dataset, which we checked in to the repository.

This results in a "most recent" DB format dump, which can then be used by PRs to test if the DB migration would also run with existing data (non-empty DB). This can also be used to test data migrations.

This PR is a preparation step towards this. It only covers the creation of dump and does not yet use them. This would be part of a follow up PR.

See: #2040

Summary by Sourcery

Add database and storage dump generation to the xtask CLI for migration testing and automate their creation and upload via a new GitHub Actions workflow.

New Features:

  • Enable optional storage dump generation in xtask, producing a tarball of the storage backend
  • Support ingestion of file or directory datasets in xtask through a new paths configuration field
  • Add GitHub Actions workflow that runs generate-dump, compresses the SQL dump, and uploads dumps and checksums to S3

Enhancements:

  • Refactor xtask config loading to include working directory context and support the new paths field
  • Add FileSystemBackend::for_test_with to create temporary storage with specified compression
  • Update dataset YAML schema to use mapping-style importer definitions

Build:

  • Add xtask dependencies: bytes, tar, walkdir, walker-common, and trustify-module-ingestor

CI:

  • Introduce migration-upload workflow to automate dump generation and S3 upload on pushes to main and release branches

@ctron ctron requested a review from dejanb October 21, 2025 09:01
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 21, 2025

Reviewer's Guide

This PR enhances the xtask generate-dump command to produce filesystem storage dumps alongside SQL dumps, adds configurable ingestion of file paths, extends the storage backend for test compression, updates dataset schema and configs for new options, introduces required dependencies, and adds a GitHub Actions workflow to automate dump generation and upload to S3.

ER diagram for updated dataset config schema

erDiagram
    INSTRUCTIONS {
        Vec import
        Vec paths
    }
    IMPORTER_CONFIGURATION {
        type
        config
    }
    INSTRUCTIONS ||--o{ IMPORTER_CONFIGURATION : contains
    IMPORTER_CONFIGURATION {
        sbom
        cve
        osv
        csaf
    }
Loading

Class diagram for updated GenerateDump and Instructions structures

classDiagram
    class GenerateDump {
        +PathBuf output
        +Option<PathBuf> storage_output
        +Option<PathBuf> input
        +Option<PathBuf> working_dir
        +fn load_config(&self) -> anyhow::Result<(PathBuf, Instructions)>
        +async fn ingest(&self, runner: ImportRunner) -> anyhow::Result<()>
    }
    class Instructions {
        +Vec<ImporterConfiguration> import
        +Vec<PathBuf> paths
    }
    GenerateDump --> Instructions : uses
    Instructions o-- ImporterConfiguration
    ImporterConfiguration <|-- CveImporter
    ImporterConfiguration <|-- SbomImporter
    ImporterConfiguration <|-- OsvImporter
    ImporterConfiguration <|-- CsafImporter
Loading

Class diagram for FileSystemBackend changes

classDiagram
    class FileSystemBackend {
        +async fn for_test_with(compression: Compression) -> anyhow::Result<(Self, TempDir)>
        +async fn new(path: Path, compression: Compression) -> anyhow::Result<Self>
    }
    FileSystemBackend --> Compression
    FileSystemBackend --> TempDir
Loading

File-Level Changes

Change Details Files
Support storage dump creation and file path ingestion in xtask
  • Add storage_output CLI option to GenerateDump
  • Extend Instructions with new paths field
  • Modify load_config to return working directory and config tuple
  • Implement tar archive creation of storage_path when storage_output is set
  • Add logic to walk directories, decompress and ingest files via IngestorService
xtask/src/dataset.rs
Introduce configurable test storage backend
  • Add for_test_with method accepting a Compression parameter
  • Switch GenerateDump to use for_test_with in test mode
modules/storage/src/service/fs.rs
xtask/src/dataset.rs
Update dataset schema and examples for mapping syntax and paths
  • Add paths support to generate-dump JSON schema
  • Convert YAML import entries to new mapping syntax
  • Provide default migration-dump config including paths under .github/scripts
xtask/schema/generate-dump.json
etc/datasets/ds4.yaml
.github/scripts/migration-dump/config.yaml
Add CI workflow for dump generation and S3 upload
  • Create migration-upload GitHub Actions workflow
  • Invoke xtask generate-dump with config and storage-output
  • Compress SQL dump with xz and upload both dumps and checksums to S3 with branch/commit structure
.github/workflows/migration-upload.yaml
Add new dependencies for storage and ingestion support
  • Include crates bytes, tar, walkdir, walker-common, and trustify-module-ingestor in xtask Cargo.toml
xtask/Cargo.toml

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ctron ctron requested a review from mrizzi October 21, 2025 09:01
@ctron
Copy link
Contributor Author

ctron commented Oct 21, 2025

@mrrajan maybe you're interested in that too?

sourcery-ai[bot]
sourcery-ai bot previously requested changes Oct 21, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Blocking issues:

  • An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `xtask/src/dataset.rs:192-195` </location>
<code_context>
+            IngestorService::new(Graph::new(runner.db.clone()), runner.storage.clone(), None);
+        for path in config.paths {
+            log::info!("Ingesting: {}", path.display());
+            let path = wd.join(path).canonicalize()?;
+            log::info!(" Resolved: {}", path.display());
+
</code_context>

<issue_to_address>
**suggestion:** Canonicalizing a path that does not exist will return an error; consider a fallback or clearer error message.

Consider handling the error from canonicalize to provide a clearer message or fallback when the path does not exist.

```suggestion
        for path in config.paths {
            log::info!("Ingesting: {}", path.display());
            let resolved_path = match wd.join(path).canonicalize() {
                Ok(p) => p,
                Err(e) => {
                    log::error!(
                        "Failed to resolve path '{}': {}. Skipping ingestion for this path.",
                        path.display(),
                        e
                    );
                    continue;
                }
            };
            log::info!(" Resolved: {}", resolved_path.display());
```
</issue_to_address>

### Comment 2
<location> `xtask/src/dataset.rs:226-235` </location>
<code_context>
+                };
+                let data = detector.decompress(data).map_err(|err| anyhow!("{err}"))?;
+
+                let result = service
+                    .ingest(&data, Format::Unknown, (), None, Cache::Skip)
+                    .await?;
+                log::info!("  id: {}", result.id);
+                if !result.warnings.is_empty() {
</code_context>

<issue_to_address>
**suggestion:** Consider logging ingestion failures for individual files to aid debugging.

Logging the file name and error on ingestion failure will make it easier to pinpoint and resolve issues with specific files.

```suggestion
                match service
                    .ingest(&data, Format::Unknown, (), None, Cache::Skip)
                    .await
                {
                    Ok(result) => {
                        log::info!("  id: {}", result.id);
                        if !result.warnings.is_empty() {
                            log::warn!("  warnings:");
                            for warning in result.warnings {
                                log::warn!("    - {}", warning);
                            }
                        }
                    }
                    Err(err) => {
                        log::error!("Failed to ingest file '{}': {err}", name);
                    }
                }
```
</issue_to_address>

### Comment 3
<location> `.github/workflows/migration-upload.yaml:22` </location>
<code_context>
      - uses: Swatinem/rust-cache@v2
</code_context>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ctron
Copy link
Contributor Author

ctron commented Oct 21, 2025

@sourcery-ai review

sourcery-ai[bot]
sourcery-ai bot previously requested changes Oct 21, 2025
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Blocking issues:

  • An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `xtask/src/dataset.rs:192-61` </location>
<code_context>
+
+        let service =
+            IngestorService::new(Graph::new(runner.db.clone()), runner.storage.clone(), None);
+        for path in config.paths {
+            log::info!("Ingesting: {}", path.display());
+            let path = wd.join(path);
+            let path = path
+                .canonicalize()
+                .with_context(|| format!("failed to canonicalize '{}'", path.display()))?;
+            log::info!(" Resolved: {}", path.display());
+
+            let mut files = vec![];
+
+            if path.is_dir() {
+                for entry in walkdir::WalkDir::new(path).follow_links(true) {
+                    let entry = entry?;
+                    if !entry.file_type().is_file() {
+                        continue;
+                    }
+                    if entry.file_name().to_string_lossy().starts_with(".") {
+                        continue;
</code_context>

<issue_to_address>
**suggestion (performance):** Consider limiting the number of files ingested to avoid resource exhaustion.

A configurable file count limit or warning for large directories would help prevent excessive memory or IO usage.

Suggested implementation:

```rust
            let mut files = vec![];

            // Set a configurable file count limit (default: 1000)
            let max_files = config.max_files.unwrap_or(1000);

            if path.is_dir() {
                let mut file_count = 0;
                for entry in walkdir::WalkDir::new(path).follow_links(true) {
                    let entry = entry?;
                    if !entry.file_type().is_file() {
                        continue;
                    }
                    if entry.file_name().to_string_lossy().starts_with(".") {
                        continue;
                    }
                    if file_count >= max_files {
                        log::warn!(
                            "File count limit ({}) reached for '{}'. Only ingesting the first {} files.",
                            max_files,
                            path.display(),
                            max_files
                        );
                        break;
                    }
                    files.push(entry.into_path());
                    file_count += 1;
                }
            } else {
                files.push(path);
            }

```

- Ensure that `config.max_files` exists and is an `Option<usize>`. If not, add it to your config struct and make it configurable (e.g., via CLI or config file).
- You may want to document the new `max_files` option for users.
</issue_to_address>

### Comment 2
<location> `xtask/src/dataset.rs:221-223` </location>
<code_context>
+                let name = file.as_os_str().to_string_lossy().to_string();
+
+                log::info!("Loading: {name}");
+                let data: Bytes = fs::read(file).await?.into();
+
+                let detector = Detector {
</code_context>

<issue_to_address>
**suggestion (performance):** Large file ingestion may block or exhaust memory; consider streaming or chunking.

Streaming or reading files in chunks will help prevent memory issues and improve scalability for large files.

```suggestion
                use tokio::io::AsyncReadExt;
                use tokio::fs::File;

                // Stream file in chunks to avoid loading entire file into memory
                let mut f = File::open(&file).await?;
                let mut buffer = Vec::new();
                let mut chunk = [0u8; 8 * 1024]; // 8KB chunk size

                loop {
                    let n = f.read(&mut chunk).await?;
                    if n == 0 {
                        break;
                    }
                    buffer.extend_from_slice(&chunk[..n]);
                }
                let data: Bytes = buffer.into();

                let detector = Detector {
```
</issue_to_address>

### Comment 3
<location> `xtask/src/dataset.rs:227` </location>
<code_context>
+                log::info!("Loading: {name}");
+                let data: Bytes = fs::read(file).await?.into();
+
+                let detector = Detector {
+                    file_name: Some(name.as_str()),
+                    ..Default::default()
+                };
+                let data = detector.decompress(data).map_err(|err| anyhow!("{err}"))?;
+
+                let result = service
</code_context>

<issue_to_address>
**suggestion:** Consider logging decompression errors with file context.

Including the file name in error logs will make it easier to identify which file caused the decompression issue.

```suggestion
                let data = match detector.decompress(data) {
                    Ok(data) => data,
                    Err(err) => {
                        log::error!("Failed to decompress file '{}': {err}", name);
                        return Err(anyhow!("Failed to decompress file '{}': {err}", name));
                    }
                };
```
</issue_to_address>

### Comment 4
<location> `.github/workflows/migration-upload.yaml:22` </location>
<code_context>
      - uses: Swatinem/rust-cache@v2
</code_context>

<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ctron ctron force-pushed the feature/gen_migration_dump_1 branch from 3d5a625 to 9a8dbd0 Compare October 21, 2025 09:20
@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 0% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.40%. Comparing base (943e52e) to head (9a8dbd0).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
xtask/src/dataset.rs 0.00% 70 Missing ⚠️
modules/storage/src/service/fs.rs 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2050      +/-   ##
==========================================
- Coverage   68.64%   68.40%   -0.25%     
==========================================
  Files         362      362              
  Lines       20240    20307      +67     
  Branches    20240    20307      +67     
==========================================
- Hits        13893    13890       -3     
- Misses       5557     5626      +69     
- Partials      790      791       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@dejanb dejanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@ctron ctron enabled auto-merge October 21, 2025 11:33
@ctron
Copy link
Contributor Author

ctron commented Oct 21, 2025

@sourcery-ai dismiss

@sourcery-ai sourcery-ai bot dismissed stale reviews from themself October 21, 2025 11:35

Automated Sourcery review dismissed.

@ctron ctron added this pull request to the merge queue Oct 21, 2025
Merged via the queue into guacsec:main with commit 0d1884c Oct 21, 2025
3 of 6 checks passed
@ctron ctron deleted the feature/gen_migration_dump_1 branch October 21, 2025 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants