Skip to content

Conversation

SFJohnson24
Copy link
Collaborator

@SFJohnson24 SFJohnson24 commented Oct 2, 2025

This pull request introduces two major new configuration options for validation and reporting: limiting the number of rows per Excel report file and capping the number of errors reported per rule. These options can be set via CLI flags or environment variables, and the code now supports splitting Excel reports into multiple files when data exceeds the configured row limit. The changes are reflected in the CLI, reporting logic, and rule validation flow.

Reporting enhancements

  • Added support for splitting Excel reports into multiple files when the number of rows exceeds a configurable maximum (max_report_rows). This can be set via the CLI (-mr) or the MAX_REPORT_ROWS environment variable, and is documented in README.md and env.example. The logic ensures the larger value between CLI and env is used, and disables the limit if set to 0. [1] [2] [3] [4] [5] [6]
  • Refactored ExcelReport to handle multiple output files, updating file naming and metadata to reflect report parts when splitting occurs. [1] [2]

Validation logic improvements

  • Introduced a configurable soft maximum for errors reported per rule (max_errors_per_rule), settable via CLI (-me) or MAX_ERRORS_PER_RULE env variable, with logic to halt validation for a rule once the limit is reached. Defaults to 1000 if unspecified; disables the limit if set to 0. [1] [2] [3] [4] [5] [6]
  • Added helper function set_max_errors_per_rule to centralize max error logic, and integrated it into both CLI and script-based validation flows. [1] [2] [3] [4]

CLI and documentation updates

  • Updated CLI (core.py) to accept new flags for controlling report row and error limits, and ensured environment variables are loaded for these options. Also updated test and validation argument handling to support the new parameters. [1] [2] [3] [4] [5] [6] [7] [8]
  • Expanded README.md and env.example to document the new options and their behaviors. [1] [2]

Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please see the below comment and will the rule editor also have a way to set the size of records in the report?

Create multiple workbooks when data exceeds Excel's row limit.
"""
workbooks = []
detailed_chunks = self._chunk_data(detailed_data, self.MAX_ROWS_PER_SHEET)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you lowercase the max_rows_per_sheet here as it is defined in the constructor.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rule editor will not have a way to set the size--this is purely a feature for the CLI and the executable. editor uses excel data and should not be getting a dataset with issues that would reach the row limit of excel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you lowercase the max_rows_per_sheet here as it is defined in the constructor.

this is done

Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I ran multiple validation with different test cases. I found that the issue summary page is being left empty from part2 onwards. Only the first part shows content on issue summary page.
  2. Could we make the env variable name and CLI flag same?
  3. Is the rules report sheet omitted from the max number of rows check?

@SFJohnson24
Copy link
Collaborator Author

SFJohnson24 commented Oct 6, 2025

  1. I ran multiple validation with different test cases. I found that the issue summary page is being left empty from part2 onwards. Only the first part shows content on issue summary page.

I changed this so if it is shorter than the row limit, it will print it on every wb. If it is longer than the limit, it will populate as much as it can but may leave blanks

2. Could we make the env variable name and CLI flag same?

done

3. Is the rules report sheet omitted from the max number of rows check?

yes--I could add this if we think this is necessary? Number of issues and the issue reports seemed like the logical things to do as those could get quite long. I didnt think that rules run would get to the point it needs to be split across multiple workbooks

@SFJohnson24 SFJohnson24 requested a review from RamilCDISC October 6, 2025 20:39
Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a validation using the test_dataset.xslx file in test/resuources folder. I set the standard to sdtmig, version to 3.4 and -me 1. The generated report in the issue details tab show record for CORE-000206 for 80 times for the same dataset suppec.xpt. What i understood from the readme is that when the max error limit is met or exceeded the engine will cease the execution for that rule on that dataset. This is violating the expected behavior.

The report file is attached below

CORE-Report-2025-10-08T17-10-39.xlsx

@SFJohnson24
Copy link
Collaborator Author

SFJohnson24 commented Oct 9, 2025

@RamilCDISC -- the limit was met or exceeded for a single dataset. for suppec, it produced 80 errors which is >1 so it did not validate any of the other datasets against that rule. If you look at the results, it is only validating 1 dataset per rule because once the first dataset is validated with issues, it exceeds the limit and the validation for that rule ceases for tthe remaining datasets. This is expected behavior based off what you just pasted.

setting a limit and stopping validaition mid-dataset is much more difficult that after each dataset runs--which is why it is 'met or exceeded'. It runs a dataset against a rule, checks the errors returns and sees if it is at or over the limit

@SFJohnson24 SFJohnson24 requested a review from RamilCDISC October 9, 2025 13:24
Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read your last comment. Thank you for making it more clear.
Could you please see the below comment. Its the final change and then we are good to merge.

DATASET_SIZE_THRESHOLD = size_in_bytes to force dask implementation
DATASET_SIZE_THRESHOLD = size_in_bytes to force dask implementation
MAX_REPORT_ROWS = maximum number of issues per excel sheet (plus headers) in result report
MAX_ERRORS_PER_RULE = maximum number of errors to report per rule during a validation run.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the env variable is MAX_ERRORS_PER_RULE but the readme file says MAX_REPORT_ERRORS. Could you please update the readme?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was incorrect in the readme. MAX_ERRORS_PER_RULE is correct--i believe that is more descriptive.

Copy link
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds two new features to engine. Users can now set the max number of rows in the reports and also configure max number of errors for one rule. The validation was done by:

  1. Reviewing the code for updated logic and correctness.
  2. Validating the code to ensure no unwanted code or comments are remaining.
  3. Validating all unit tests, regression tests and other automated testings pass.
  4. Running validation manually for -mr flag with a high value.
  5. Running validation manually for -mr flag with a low value.
  6. Running validation manually for -mr flag set to 0.
  7. Running validation manually for -me flag with a high value.
  8. Running validation manually for -me flag with a low value.
  9. Running validation manually for -me flag set to 0.
  10. Running valdiaiton using both mr and me flags to ensure they integrate properly together.

@RamilCDISC RamilCDISC merged commit a0ffebe into main Oct 12, 2025
6 checks passed
@RamilCDISC RamilCDISC deleted the limit branch October 12, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants