-
Notifications
You must be signed in to change notification settings - Fork 32
Create DB and storage dumps with each commit for testing migration #2050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR enhances the xtask generate-dump command to produce filesystem storage dumps alongside SQL dumps, adds configurable ingestion of file paths, extends the storage backend for test compression, updates dataset schema and configs for new options, introduces required dependencies, and adds a GitHub Actions workflow to automate dump generation and upload to S3. ER diagram for updated dataset config schemaerDiagram
INSTRUCTIONS {
Vec import
Vec paths
}
IMPORTER_CONFIGURATION {
type
config
}
INSTRUCTIONS ||--o{ IMPORTER_CONFIGURATION : contains
IMPORTER_CONFIGURATION {
sbom
cve
osv
csaf
}
Class diagram for updated GenerateDump and Instructions structuresclassDiagram
class GenerateDump {
+PathBuf output
+Option<PathBuf> storage_output
+Option<PathBuf> input
+Option<PathBuf> working_dir
+fn load_config(&self) -> anyhow::Result<(PathBuf, Instructions)>
+async fn ingest(&self, runner: ImportRunner) -> anyhow::Result<()>
}
class Instructions {
+Vec<ImporterConfiguration> import
+Vec<PathBuf> paths
}
GenerateDump --> Instructions : uses
Instructions o-- ImporterConfiguration
ImporterConfiguration <|-- CveImporter
ImporterConfiguration <|-- SbomImporter
ImporterConfiguration <|-- OsvImporter
ImporterConfiguration <|-- CsafImporter
Class diagram for FileSystemBackend changesclassDiagram
class FileSystemBackend {
+async fn for_test_with(compression: Compression) -> anyhow::Result<(Self, TempDir)>
+async fn new(path: Path, compression: Compression) -> anyhow::Result<Self>
}
FileSystemBackend --> Compression
FileSystemBackend --> TempDir
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
@mrrajan maybe you're interested in that too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Blocking issues:
- An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `xtask/src/dataset.rs:192-195` </location>
<code_context>
+ IngestorService::new(Graph::new(runner.db.clone()), runner.storage.clone(), None);
+ for path in config.paths {
+ log::info!("Ingesting: {}", path.display());
+ let path = wd.join(path).canonicalize()?;
+ log::info!(" Resolved: {}", path.display());
+
</code_context>
<issue_to_address>
**suggestion:** Canonicalizing a path that does not exist will return an error; consider a fallback or clearer error message.
Consider handling the error from canonicalize to provide a clearer message or fallback when the path does not exist.
```suggestion
for path in config.paths {
log::info!("Ingesting: {}", path.display());
let resolved_path = match wd.join(path).canonicalize() {
Ok(p) => p,
Err(e) => {
log::error!(
"Failed to resolve path '{}': {}. Skipping ingestion for this path.",
path.display(),
e
);
continue;
}
};
log::info!(" Resolved: {}", resolved_path.display());
```
</issue_to_address>
### Comment 2
<location> `xtask/src/dataset.rs:226-235` </location>
<code_context>
+ };
+ let data = detector.decompress(data).map_err(|err| anyhow!("{err}"))?;
+
+ let result = service
+ .ingest(&data, Format::Unknown, (), None, Cache::Skip)
+ .await?;
+ log::info!(" id: {}", result.id);
+ if !result.warnings.is_empty() {
</code_context>
<issue_to_address>
**suggestion:** Consider logging ingestion failures for individual files to aid debugging.
Logging the file name and error on ingestion failure will make it easier to pinpoint and resolve issues with specific files.
```suggestion
match service
.ingest(&data, Format::Unknown, (), None, Cache::Skip)
.await
{
Ok(result) => {
log::info!(" id: {}", result.id);
if !result.warnings.is_empty() {
log::warn!(" warnings:");
for warning in result.warnings {
log::warn!(" - {}", warning);
}
}
}
Err(err) => {
log::error!("Failed to ingest file '{}': {err}", name);
}
}
```
</issue_to_address>
### Comment 3
<location> `.github/workflows/migration-upload.yaml:22` </location>
<code_context>
- uses: Swatinem/rust-cache@v2
</code_context>
<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@sourcery-ai review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Blocking issues:
- An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload. (link)
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `xtask/src/dataset.rs:192-61` </location>
<code_context>
+
+ let service =
+ IngestorService::new(Graph::new(runner.db.clone()), runner.storage.clone(), None);
+ for path in config.paths {
+ log::info!("Ingesting: {}", path.display());
+ let path = wd.join(path);
+ let path = path
+ .canonicalize()
+ .with_context(|| format!("failed to canonicalize '{}'", path.display()))?;
+ log::info!(" Resolved: {}", path.display());
+
+ let mut files = vec![];
+
+ if path.is_dir() {
+ for entry in walkdir::WalkDir::new(path).follow_links(true) {
+ let entry = entry?;
+ if !entry.file_type().is_file() {
+ continue;
+ }
+ if entry.file_name().to_string_lossy().starts_with(".") {
+ continue;
</code_context>
<issue_to_address>
**suggestion (performance):** Consider limiting the number of files ingested to avoid resource exhaustion.
A configurable file count limit or warning for large directories would help prevent excessive memory or IO usage.
Suggested implementation:
```rust
let mut files = vec![];
// Set a configurable file count limit (default: 1000)
let max_files = config.max_files.unwrap_or(1000);
if path.is_dir() {
let mut file_count = 0;
for entry in walkdir::WalkDir::new(path).follow_links(true) {
let entry = entry?;
if !entry.file_type().is_file() {
continue;
}
if entry.file_name().to_string_lossy().starts_with(".") {
continue;
}
if file_count >= max_files {
log::warn!(
"File count limit ({}) reached for '{}'. Only ingesting the first {} files.",
max_files,
path.display(),
max_files
);
break;
}
files.push(entry.into_path());
file_count += 1;
}
} else {
files.push(path);
}
```
- Ensure that `config.max_files` exists and is an `Option<usize>`. If not, add it to your config struct and make it configurable (e.g., via CLI or config file).
- You may want to document the new `max_files` option for users.
</issue_to_address>
### Comment 2
<location> `xtask/src/dataset.rs:221-223` </location>
<code_context>
+ let name = file.as_os_str().to_string_lossy().to_string();
+
+ log::info!("Loading: {name}");
+ let data: Bytes = fs::read(file).await?.into();
+
+ let detector = Detector {
</code_context>
<issue_to_address>
**suggestion (performance):** Large file ingestion may block or exhaust memory; consider streaming or chunking.
Streaming or reading files in chunks will help prevent memory issues and improve scalability for large files.
```suggestion
use tokio::io::AsyncReadExt;
use tokio::fs::File;
// Stream file in chunks to avoid loading entire file into memory
let mut f = File::open(&file).await?;
let mut buffer = Vec::new();
let mut chunk = [0u8; 8 * 1024]; // 8KB chunk size
loop {
let n = f.read(&mut chunk).await?;
if n == 0 {
break;
}
buffer.extend_from_slice(&chunk[..n]);
}
let data: Bytes = buffer.into();
let detector = Detector {
```
</issue_to_address>
### Comment 3
<location> `xtask/src/dataset.rs:227` </location>
<code_context>
+ log::info!("Loading: {name}");
+ let data: Bytes = fs::read(file).await?.into();
+
+ let detector = Detector {
+ file_name: Some(name.as_str()),
+ ..Default::default()
+ };
+ let data = detector.decompress(data).map_err(|err| anyhow!("{err}"))?;
+
+ let result = service
</code_context>
<issue_to_address>
**suggestion:** Consider logging decompression errors with file context.
Including the file name in error logs will make it easier to identify which file caused the decompression issue.
```suggestion
let data = match detector.decompress(data) {
Ok(data) => data,
Err(err) => {
log::error!("Failed to decompress file '{}': {err}", name);
return Err(anyhow!("Failed to decompress file '{}': {err}", name));
}
};
```
</issue_to_address>
### Comment 4
<location> `.github/workflows/migration-upload.yaml:22` </location>
<code_context>
- uses: Swatinem/rust-cache@v2
</code_context>
<issue_to_address>
**security (yaml.github-actions.security.third-party-action-not-pinned-to-commit-sha):** An action sourced from a third-party repository on GitHub is not pinned to a full length commit SHA. Pinning an action to a full length commit SHA is currently the only way to use an action as an immutable release. Pinning to a particular SHA helps mitigate the risk of a bad actor adding a backdoor to the action's repository, as they would need to generate a SHA-1 collision for a valid Git object payload.
*Source: opengrep*
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
3d5a625 to
9a8dbd0
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2050 +/- ##
==========================================
- Coverage 68.64% 68.40% -0.25%
==========================================
Files 362 362
Lines 20240 20307 +67
Branches 20240 20307 +67
==========================================
- Hits 13893 13890 -3
- Misses 5557 5626 +69
- Partials 790 791 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
|
@sourcery-ai dismiss |
Automated Sourcery review dismissed.
This creates a DB and storage dump, by ingesting a specific dataset, which we checked in to the repository.
This results in a "most recent" DB format dump, which can then be used by PRs to test if the DB migration would also run with existing data (non-empty DB). This can also be used to test data migrations.
This PR is a preparation step towards this. It only covers the creation of dump and does not yet use them. This would be part of a follow up PR.
See: #2040
Summary by Sourcery
Add database and storage dump generation to the xtask CLI for migration testing and automate their creation and upload via a new GitHub Actions workflow.
New Features:
Enhancements:
Build:
CI: