-
Notifications
You must be signed in to change notification settings - Fork 10
feat: add dry run functionality to autopipeline and add sensitive dicom tags to metadata #430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…om tags to metadata
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (4)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the
📝 WalkthroughWalkthroughAdds a CLI-accessible dry-run mode propagated through Autopipeline, SampleOutput, and writers to skip actual filesystem writes while still resolving paths and updating indexes; expands DICOM metadata extraction to include seven additional patient-related tags. process_one_sample only errors on missing outputs when not in dry-run. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Suggested labels
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #430 +/- ##
==========================================
- Coverage 54.60% 54.58% -0.02%
==========================================
Files 66 66
Lines 4318 4325 +7
==========================================
+ Hits 2358 2361 +3
- Misses 1960 1964 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/imgtools/io/writers/numpy_writer.py (1)
74-79: dry_run currently masks invalid data types in NumPyWriter.saveWith the new
if not self.dry_run:guard, theelsebranch that raisesNumpyWriterValidationErrorfor unsupporteddatatypes is never executed in dry‑run mode. That means invalid inputs will “succeed” in dry‑run (path + index entry) but later fail only when running for real, which makes misconfiguration harder to catch.You can keep dry_run semantics while still validating input types by moving the type check outside the dry_run gate:
def save( self, data: np.ndarray | sitk.Image | dict[str, np.ndarray | sitk.Image], **kwargs: object, @@ - out_path = self.resolve_path(**kwargs) - - if not self.dry_run: - if isinstance(data, (np.ndarray, sitk.Image)): - # Single image or array - array, metadata = self._to_numpy(data) - np.savez_compressed(out_path, image_array=array, **metadata) - elif isinstance(data, dict): - # Multiple images or arrays - arrays = {} - metadata = {} - for key, value in data.items(): - array, meta = self._to_numpy(value) - arrays[key] = array - for meta_key, meta_value in meta.items(): - metadata[f"{key}_{meta_key}"] = meta_value - if self.compressed: - np.savez_compressed( - out_path, allow_pickle=False, **arrays, **metadata - ) - else: - np.savez(out_path, allow_pickle=False, **arrays, **metadata) - else: - raise NumpyWriterValidationError( - "Data must be a NumPy array, SimpleITK image, or a dictionary of these types." - ) + out_path = self.resolve_path(**kwargs) + + # Always validate input type, even in dry_run mode. + if not isinstance(data, (np.ndarray, sitk.Image, dict)): + raise NumpyWriterValidationError( + "Data must be a NumPy array, SimpleITK image, or a dictionary of these types." + ) + + if not self.dry_run: + if isinstance(data, (np.ndarray, sitk.Image)): + # Single image or array + array, metadata = self._to_numpy(data) + np.savez_compressed(out_path, image_array=array, **metadata) + elif isinstance(data, dict): + # Multiple images or arrays + arrays: dict[str, np.ndarray] = {} + metadata: dict[str, object] = {} + for key, value in data.items(): + array, meta = self._to_numpy(value) + arrays[key] = array + for meta_key, meta_value in meta.items(): + metadata[f"{key}_{meta_key}"] = meta_value + if self.compressed: + np.savez_compressed( + out_path, allow_pickle=False, **arrays, **metadata + ) + else: + np.savez(out_path, allow_pickle=False, **arrays, **metadata) @@ - self.add_to_index( + self.add_to_index( out_path, include_all_context=True, filepath_column="path", replace_existing=True, )(Separately, the TRY003 hint about the long error message is purely stylistic; you can leave it as‑is or move the message string into the exception class/docstring if you prefer.)
Also applies to: 98-125
🧹 Nitpick comments (4)
src/imgtools/dicom/dicom_metadata/extractor_base.py (1)
105-159: New PHI tags will now flow through all metadata pathsAdding
PatientSex,PatientBirthDate,PatientAge,EthnicGroup,PatientWeight,PatientSize, andAdditionalPatientHistorytobase_tagsmeans these PHI fields will be extracted for every modality and can end up in indexes, CSVs, and logs. Please double‑check that all downstream consumers (index writing, report generation, logging, export) are intended and configured to handle PHI safely (e.g., anonymization, access controls, redaction when exporting).src/imgtools/io/writers/abstract_base_writer.py (1)
624-655: ExampleWriter.save respects dry_run but still resolves paths and updates indexThe added
if not self.dry_run:around the file write aligns with the base class contract: dry‑run skips actual writes while still resolving the path and updating the index. Just be aware that directories and the index file are still created via__post_init__andadd_to_index, so this is a “no data files written” dry run, not a completely side‑effect‑free mode.src/imgtools/io/sample_output.py (1)
134-138: Dry‑run wiring is correct; consider whether root dir creation is desiredPassing
dry_run=self.dry_runandcreate_dirs=not self.dry_runintoNIFTIWriterensures that, in dry‑run mode, the writer doesn’t create per‑file directories or touch data files, while still updating the index and returning resolved paths.Note that the
directoryvalidator still callsvalidate_directory(..., create=True), so the top‑level output directory will be created even in dry‑run mode. If you want a completely non‑creating dry run, you may eventually want to make the validator conditional ondry_runas well, or document that dry_run only applies to individual file creation, not the root directory or index.Also applies to: 141-150
src/imgtools/cli/autopipeline.py (1)
168-175: CLI dry‑run flag is wired correctly; consider clarifying its semanticsThe
--dry-run / -dflag is correctly threaded through theautopipelinefunction intoAutopipeline(dry_run=...), so end‑to‑end behavior will track the flag as intended.If you want to avoid confusion, you might tweak the option help text to reflect the actual behavior (e.g., “Run the pipeline without writing output data files; still generates index and summary reports”) so users don’t expect a completely side‑effect‑free run.
Also applies to: 179-197, 255-275
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lockand included by none
📒 Files selected for processing (7)
src/imgtools/autopipeline.py(4 hunks)src/imgtools/cli/autopipeline.py(3 hunks)src/imgtools/dicom/dicom_metadata/extractor_base.py(1 hunks)src/imgtools/io/sample_output.py(2 hunks)src/imgtools/io/writers/abstract_base_writer.py(5 hunks)src/imgtools/io/writers/nifti_writer.py(1 hunks)src/imgtools/io/writers/numpy_writer.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
⚙️ CodeRabbit configuration file
Review the Python code for compliance with PEP 8 and PEP 257 (docstring conventions). Ensure the following: - Variables and functions follow meaningful naming conventions. - Docstrings are present, accurate, and align with the implementation. - Code is efficient and avoids redundancy while adhering to DRY principles. - Consider suggestions to enhance readability and maintainability. - Highlight any potential performance issues, edge cases, or logical errors. - Ensure all imported libraries are used and necessary.
Files:
src/imgtools/cli/autopipeline.pysrc/imgtools/io/writers/numpy_writer.pysrc/imgtools/io/sample_output.pysrc/imgtools/dicom/dicom_metadata/extractor_base.pysrc/imgtools/autopipeline.pysrc/imgtools/io/writers/abstract_base_writer.pysrc/imgtools/io/writers/nifti_writer.py
🧬 Code graph analysis (2)
src/imgtools/io/sample_output.py (1)
src/imgtools/io/sample_input.py (1)
default(228-230)
src/imgtools/io/writers/nifti_writer.py (1)
src/imgtools/io/writers/abstract_base_writer.py (1)
ExistingFileMode(57-79)
🪛 Ruff (0.14.5)
src/imgtools/io/writers/numpy_writer.py
121-123: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Unit-Tests (windows-latest, py311, public)
- GitHub Check: Unit-Tests (macos-latest, py312, public)
- GitHub Check: Unit-Tests (windows-latest, py310, public)
- GitHub Check: Integration-Tests (ubuntu-latest, py313, public)
- GitHub Check: Integration-Tests (windows-latest, py310, public)
- GitHub Check: Integration-Tests (windows-latest, py313, public)
- GitHub Check: Integration-Tests (ubuntu-latest, py310, public)
- GitHub Check: Integration-Tests (macos-latest, py313, public)
- GitHub Check: Integration-Tests (macos-latest, py310, public)
🔇 Additional comments (2)
src/imgtools/io/writers/nifti_writer.py (1)
135-225: Dry‑run gate around NIfTI I/O is well‑placedThe
if not self.dry_run:block cleanly isolates the existence check andsitk.WriteImagecall from the rest of the method, while still resolving the path and updating the index. OnceAbstractBaseWriter.resolve_pathis updated to avoid unlinking files in dry‑run mode, this writer’s behavior will match the documented semantics.src/imgtools/autopipeline.py (1)
259-269: dry_run propagation in Autopipeline is consistent and avoids false failuresForwarding
dry_runintoSampleOutputand relaxing the “No output files were saved” check whensample_output.dry_runisTruegives a sensible behavior: dry‑run still executes the full pipeline, populates the index, and generates reports, but doesn’t treat “no outputs written” as a hard error.Combined with the writer‑level dry_run gates (once
resolve_pathis fixed to avoid unlinking in dry_run), this offers a coherent end‑to‑end dry‑run mode without affecting existing non‑dry‑run behavior.Also applies to: 308-309, 346-347, 360-366, 465-472
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/imgtools/io/writers/abstract_base_writer.py (1)
356-368: dry_run still allows destructive operations viapreview_path(and directory manipulation)The early
if self.dry_run: return out_pathinresolve_pathnicely fixes the earlier issue whereOVERWRITEcould unlink files during a dry‑run. However:
preview_pathstill callsout_path.unlink()inExistingFileMode.OVERWRITEregardless ofself.dry_run, so a caller usingdry_run=True+OVERWRITE+preview_path()can still delete existing outputs.- Additionally,
resolve_pathwill still create parent directories for new paths (and__exit__can delete an empty root directory) even whendry_runis True, which partially contradicts the “skip actual file I/O operations” wording.If the intent is “no filesystem mutations at all in dry‑run,” you may want to:
- Guard the
out_path.unlink()inpreview_pathbehindnot self.dry_run.- Consider skipping new directory creation in
resolve_pathand the empty‑directory cleanup in__exit__whendry_runis True.If instead the goal is only “no data‑file writes, but directory housekeeping is OK,” it would still be worth updating the class‑level docs for
dry_runto call that out, and to mention thatpreview_pathremains destructive inOVERWRITEmode. This will make the contract much clearer to future users of the API.Also applies to: 439-460
🧹 Nitpick comments (2)
src/imgtools/io/writers/abstract_base_writer.py (2)
129-132: Aligndry_rundocumentation withresolve_pathbehaviour (especiallyFAILmode)The new
dry_runattribute and the abstractsavedoc clearly state that dry‑run “skips actual file I/O operations but still write[s] to the index and return[s] the resolved path,” which is good from a readability standpoint. However, there’s now a subtle mismatch between theresolve_pathdocstring and implementation:
- The
resolve_pathdocstring still says it “only raisesFileExistsErrorif the file already exists and the mode is set toFAIL.”- With the new
if self.dry_run: return out_pathbranch, aFileExistsErroris no longer raised inFAILmode whendry_runisTrue; the call quietly succeeds and returns the path.For maintainability, it would be good to make this explicit one way or the other:
- Either keep the old semantics and let
FAILstill raise even in dry‑run (so dry‑run simulates failures as well as writes), or- Intentionally document that in dry‑run,
resolve_pathnever raises on existing files regardless ofexisting_file_mode, because it’s meant as a read‑only preview.Right now, a reader relying on the docstring would expect
FileExistsErrorin dry‑run +FAIL, but won’t get it. Clarifying this in the docs (or adjusting the branch) will avoid surprises for users of the abstract base.Also applies to: 159-159, 220-235, 346-368
633-657: ExampleWriter.save dry_run handling is clear; minor doc/style nitsThe
ExampleWriter.saveimplementation looks good:
- It now respects
self.dry_runby skipping the actual file write while still callingadd_to_index, which matches the abstractsavecontract.- Using
replace_existing=output_path.exists()continues to behave sensibly in both normal and dry‑run modes.Two minor readability / style tweaks to consider:
- Line 653: PEP 8 prefers two spaces before an inline comment, e.g.
if self.dry_run: # ..., to keep comments visually separated from code.- Since this class is the “demo” implementation, you might add a brief line to the docstring noting that
dry_run=Truewill skip writing the file but still update the index. That would make the example self‑documenting for downstream users reading this concrete writer.Neither change affects behaviour, but they would make the example slightly clearer and more consistent.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/imgtools/io/writers/abstract_base_writer.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
⚙️ CodeRabbit configuration file
Review the Python code for compliance with PEP 8 and PEP 257 (docstring conventions). Ensure the following: - Variables and functions follow meaningful naming conventions. - Docstrings are present, accurate, and align with the implementation. - Code is efficient and avoids redundancy while adhering to DRY principles. - Consider suggestions to enhance readability and maintainability. - Highlight any potential performance issues, edge cases, or logical errors. - Ensure all imported libraries are used and necessary.
Files:
src/imgtools/io/writers/abstract_base_writer.py
🧬 Code graph analysis (1)
src/imgtools/io/writers/abstract_base_writer.py (1)
src/imgtools/io/sample_output.py (1)
default(159-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Integration-Tests (windows-latest, py313, public)
- GitHub Check: Integration-Tests (macos-latest, py313, public)
- GitHub Check: Integration-Tests (windows-latest, py310, public)
- GitHub Check: Integration-Tests (ubuntu-latest, py310, public)
- GitHub Check: Integration-Tests (macos-latest, py310, public)
- GitHub Check: Integration-Tests (ubuntu-latest, py313, public)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/imgtools/io/sample_output.py (1)
134-150: Surfacedry_runclearly in docs and consider aligning directory creation with dry-run intentThe
dry_runflag being threaded intoNIFTIWriterandcreate_dirs=not self.dry_runlooks good and keeps the behavior localized.Two small maintainability/readability suggestions:
- The
SampleOutputclass docstring’s Attributes section doesn’t listdry_run. Adding it there will keep the public configuration surface self-documenting for users reading just this model.- Even when
dry_run=True,validate_directoryis still called withcreate=True, so instantiatingSampleOutputmay create the root output directory. If the intended contract of dry-run is “no filesystem changes at all,” consider gating this as well (e.g.,create=not dry_run) or at least documenting that only subdirectory creation is suppressed viacreate_dirs. This will make the behavior less surprising for callers relying on dry-run.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/imgtools/io/sample_output.py(2 hunks)src/imgtools/io/writers/abstract_base_writer.py(6 hunks)src/imgtools/io/writers/numpy_writer.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/imgtools/io/writers/abstract_base_writer.py
🧰 Additional context used
📓 Path-based instructions (1)
src/**/*.py
⚙️ CodeRabbit configuration file
Review the Python code for compliance with PEP 8 and PEP 257 (docstring conventions). Ensure the following: - Variables and functions follow meaningful naming conventions. - Docstrings are present, accurate, and align with the implementation. - Code is efficient and avoids redundancy while adhering to DRY principles. - Consider suggestions to enhance readability and maintainability. - Highlight any potential performance issues, edge cases, or logical errors. - Ensure all imported libraries are used and necessary.
Files:
src/imgtools/io/sample_output.pysrc/imgtools/io/writers/numpy_writer.py
🧬 Code graph analysis (2)
src/imgtools/io/sample_output.py (1)
src/imgtools/io/sample_input.py (1)
default(228-230)
src/imgtools/io/writers/numpy_writer.py (1)
src/imgtools/vizualize/visualizer2.py (1)
array(176-177)
🪛 Ruff (0.14.5)
src/imgtools/io/writers/numpy_writer.py
123-125: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Integration-Tests (windows-latest, py313, public)
- GitHub Check: Integration-Tests (macos-latest, py310, public)
- GitHub Check: Integration-Tests (macos-latest, py313, public)
- GitHub Check: Integration-Tests (windows-latest, py310, public)
- GitHub Check: Integration-Tests (ubuntu-latest, py313, public)
- GitHub Check: Linting
- GitHub Check: Integration-Tests (ubuntu-latest, py310, public)
🔇 Additional comments (1)
src/imgtools/io/writers/numpy_writer.py (1)
85-87: Docstring update for**kwargsimproves clarityDocumenting
**kwargsas additional context for filename generation is helpful and keeps thesaveAPI self-explanatory for callers.
| if not self.dry_run: | ||
| if isinstance(data, (np.ndarray, sitk.Image)): | ||
| # Single image or array | ||
| array, metadata = self._to_numpy(data) | ||
| np.savez_compressed(out_path, image_array=array, **metadata) | ||
| elif isinstance(data, dict): | ||
| # Multiple images or arrays | ||
| arrays = {} | ||
| metadata = {} | ||
| for key, value in data.items(): | ||
| array, meta = self._to_numpy(value) | ||
| arrays[key] = array | ||
| for meta_key, meta_value in meta.items(): | ||
| metadata[f"{key}_{meta_key}"] = meta_value | ||
| if self.compressed: | ||
| np.savez_compressed( | ||
| out_path, allow_pickle=False, **arrays, **metadata | ||
| ) | ||
| else: | ||
| np.savez( | ||
| out_path, allow_pickle=False, **arrays, **metadata | ||
| ) | ||
| else: | ||
| np.savez(out_path, allow_pickle=False, **arrays, **metadata) | ||
| else: | ||
| raise NumpyWriterValidationError( | ||
| "Data must be a NumPy array, SimpleITK image, or a dictionary of these types." | ||
| ) | ||
| raise NumpyWriterValidationError( | ||
| "Data must be a NumPy array, SimpleITK image, or a dictionary of these types." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve data validation behavior under dry_run and optionally centralize error message
The dry-run guard cleanly skips actual filesystem writes, but it currently also skips all type validation:
- The
isinstancechecks and_to_numpycalls only run whennot self.dry_run. - If
dry_run=Trueanddatais an unsupported type, noNumpyWriterValidationErroris raised; instead, the method quietly indexes a path and returns it. That makes it easy for a dry-run to “succeed” while a real run would fail at write time.
To keep dry-run useful for catching pipeline issues while still avoiding I/O, consider validating data independently of the write, e.g.:
- out_path = self.resolve_path(**kwargs)
-
- if not self.dry_run:
- if isinstance(data, (np.ndarray, sitk.Image)):
+ out_path = self.resolve_path(**kwargs)
+
+ # Validate input type regardless of dry-run to surface misuses early.
+ if isinstance(data, (np.ndarray, sitk.Image)):
+ to_save = ("single", data)
+ elif isinstance(data, dict):
+ to_save = ("dict", data)
+ else:
+ raise NumpyWriterValidationError(
+ "Data must be a NumPy array, SimpleITK image, or a dictionary of these types."
+ )
+
+ if not self.dry_run:
+ kind, payload = to_save
+ if kind == "single":
+ data_single = payload
# Single image or array
- array, metadata = self._to_numpy(data)
+ array, metadata = self._to_numpy(data_single)
np.savez_compressed(out_path, image_array=array, **metadata)
- elif isinstance(data, dict):
+ else: # dict
# Multiple images or arrays
- arrays = {}
- metadata = {}
- for key, value in data.items():
+ arrays: dict[str, np.ndarray] = {}
+ metadata: dict[str, object] = {}
+ for key, value in payload.items():
...
- else:
- raise NumpyWriterValidationError(
- "Data must be a NumPy array, SimpleITK image, or a dictionary of these types."
- )This keeps validation semantics identical between normal and dry-run modes while still avoiding writes.
Optionally, to satisfy TRY003 and improve reuse, you could move the long error message into a class-level constant on NumpyWriterValidationError or NumPyWriter and reference it from both _to_numpy and save, but that’s more of a style/readability tweak than a functional requirement.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.14.5)
123-125: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In src/imgtools/io/writers/numpy_writer.py around lines 100 to 125, the current
dry-run branch skips all type validation so unsupported input types never raise
NumpyWriterValidationError during dry runs; change the flow to always validate
input (perform isinstance checks and call self._to_numpy or a dedicated
validation helper) before checking self.dry_run, then only skip the actual
np.savez/np.savez_compressed when dry_run is True; optionally extract the long
error message into a class-level constant on NumpyWriter or
NumpyWriterValidationError and reuse it in both _to_numpy and save to avoid
duplication.
Co-authored-by: Katy Scott <[email protected]>
skim2257
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great. just would love clarification on one CLI output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"no output files, check the directory" sounds a bit confusing? what are they checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you send the full output
Summary by CodeRabbit
New Features
Behavior Changes
✏️ Tip: You can customize this high-level summary in your review settings.