-
Notifications
You must be signed in to change notification settings - Fork 22
update #1359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update #1359
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please see the below comment and will the rule editor also have a way to set the size of records in the report?
Create multiple workbooks when data exceeds Excel's row limit. | ||
""" | ||
workbooks = [] | ||
detailed_chunks = self._chunk_data(detailed_data, self.MAX_ROWS_PER_SHEET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you lowercase the max_rows_per_sheet here as it is defined in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rule editor will not have a way to set the size--this is purely a feature for the CLI and the executable. editor uses excel data and should not be getting a dataset with issues that would reach the row limit of excel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you lowercase the max_rows_per_sheet here as it is defined in the constructor.
this is done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I ran multiple validation with different test cases. I found that the issue summary page is being left empty from part2 onwards. Only the first part shows content on issue summary page.
- Could we make the env variable name and CLI flag same?
- Is the rules report sheet omitted from the max number of rows check?
I changed this so if it is shorter than the row limit, it will print it on every wb. If it is longer than the limit, it will populate as much as it can but may leave blanks
done
yes--I could add this if we think this is necessary? Number of issues and the issue reports seemed like the logical things to do as those could get quite long. I didnt think that rules run would get to the point it needs to be split across multiple workbooks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a validation using the test_dataset.xslx file in test/resuources folder. I set the standard to sdtmig, version to 3.4 and -me 1. The generated report in the issue details tab show record for CORE-000206 for 80 times for the same dataset suppec.xpt. What i understood from the readme is that when the max error limit is met or exceeded the engine will cease the execution for that rule on that dataset. This is violating the expected behavior.
The report file is attached below
@RamilCDISC -- the limit was met or exceeded for a single dataset. for suppec, it produced 80 errors which is >1 so it did not validate any of the other datasets against that rule. If you look at the results, it is only validating 1 dataset per rule because once the first dataset is validated with issues, it exceeds the limit and the validation for that rule ceases for tthe remaining datasets. This is expected behavior based off what you just pasted. setting a limit and stopping validaition mid-dataset is much more difficult that after each dataset runs--which is why it is 'met or exceeded'. It runs a dataset against a rule, checks the errors returns and sees if it is at or over the limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read your last comment. Thank you for making it more clear.
Could you please see the below comment. Its the final change and then we are good to merge.
DATASET_SIZE_THRESHOLD = size_in_bytes to force dask implementation | ||
DATASET_SIZE_THRESHOLD = size_in_bytes to force dask implementation | ||
MAX_REPORT_ROWS = maximum number of issues per excel sheet (plus headers) in result report | ||
MAX_ERRORS_PER_RULE = maximum number of errors to report per rule during a validation run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the env variable is MAX_ERRORS_PER_RULE but the readme file says MAX_REPORT_ERRORS. Could you please update the readme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was incorrect in the readme. MAX_ERRORS_PER_RULE is correct--i believe that is more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR adds two new features to engine. Users can now set the max number of rows in the reports and also configure max number of errors for one rule. The validation was done by:
- Reviewing the code for updated logic and correctness.
- Validating the code to ensure no unwanted code or comments are remaining.
- Validating all unit tests, regression tests and other automated testings pass.
- Running validation manually for -mr flag with a high value.
- Running validation manually for -mr flag with a low value.
- Running validation manually for -mr flag set to 0.
- Running validation manually for -me flag with a high value.
- Running validation manually for -me flag with a low value.
- Running validation manually for -me flag set to 0.
- Running valdiaiton using both mr and me flags to ensure they integrate properly together.
This pull request introduces two major new configuration options for validation and reporting: limiting the number of rows per Excel report file and capping the number of errors reported per rule. These options can be set via CLI flags or environment variables, and the code now supports splitting Excel reports into multiple files when data exceeds the configured row limit. The changes are reflected in the CLI, reporting logic, and rule validation flow.
Reporting enhancements
max_report_rows
). This can be set via the CLI (-mr
) or theMAX_REPORT_ROWS
environment variable, and is documented inREADME.md
andenv.example
. The logic ensures the larger value between CLI and env is used, and disables the limit if set to 0. [1] [2] [3] [4] [5] [6]ExcelReport
to handle multiple output files, updating file naming and metadata to reflect report parts when splitting occurs. [1] [2]Validation logic improvements
max_errors_per_rule
), settable via CLI (-me
) orMAX_ERRORS_PER_RULE
env variable, with logic to halt validation for a rule once the limit is reached. Defaults to 1000 if unspecified; disables the limit if set to 0. [1] [2] [3] [4] [5] [6]set_max_errors_per_rule
to centralize max error logic, and integrated it into both CLI and script-based validation flows. [1] [2] [3] [4]CLI and documentation updates
core.py
) to accept new flags for controlling report row and error limits, and ensured environment variables are loaded for these options. Also updated test and validation argument handling to support the new parameters. [1] [2] [3] [4] [5] [6] [7] [8]README.md
andenv.example
to document the new options and their behaviors. [1] [2]