Skip to content

Conversation

RamilCDISC
Copy link
Collaborator

The pull request adds performance Test to the tests folder. The performance test can be run by following command
python tests/PerformanceTest.py -dd path/to/folder/with/datasets -rd path/to/folder/with/rule/files -total_calls 1
Total calls define how many times do you want to execute a rule. The report will be saved in the directory from whic the test file was executed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the new cli instructions to the readme

README.md Outdated
To execute the performance test, navigate to the root directory of the project and run the following command:

```sh
python tests/PerformanceTest.py -dd <DATASET_DIRECTORY> -rd <RULES_DIRECTORY> -total_calls <NUMBER_OF_CALLS> -od <OUTPUT_DIRECTORY>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the same as the -d, --data TEXT and -lr, --local_rules TEXT args in the validate command, it would be good to use the same arg names.

README.md Outdated

This repository includes a performance testing script located in the `tests` folder under the filename `PerformanceTest.py`. The script is designed to evaluate the execution time of rules against datasets by running multiple test iterations.

### Running the Performance Test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the next header should be nested under the Performance Testing header. IOW, use 4 #

README.md Outdated
```
### Performance Test Command-Line Flags

- **`-dd` (Dataset Directory)**: The directory containing the dataset files in `.json` or `.xpt` format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format the documentation similar to the existing documentation. For example, use a code block for the args

@gerrycampion gerrycampion added this to the v0.10.0 milestone Feb 19, 2025
@gerrycampion gerrycampion added large datasets testing Unit, regression, performance, QA, test automation labels Feb 20, 2025
@gerrycampion gerrycampion linked an issue Feb 21, 2025 that may be closed by this pull request
12 tasks
@RamilCDISC
Copy link
Collaborator Author

Since engine is updated to remove test column, I will re-go over my code and update it to be according to updated engine. Putting back to in progress for now. @SFJohnson24 @gerrycampion

@nickdedonder nickdedonder modified the milestones: v0.10.0, V0.11.0 Apr 28, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the unit tests is failing

"core.py",
"test",
"-s",
"sdtmig",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a cli option

"-s",
"sdtmig",
"-v",
"3.4",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a cli option

for num_call in range(total_calls):
rule_path = os.path.join(rule_dir, rule)
command = [
"python",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use sys.executable so that it uses the same python exe and env as the caller.

command = [
"python",
"core.py",
"test",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be validate now


for num_call in range(total_calls):
rule_path = os.path.join(rule_dir, rule)
command = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be an ability to include a define xml


for num_call in range(total_calls):
rule_path = os.path.join(rule_dir, rule)
command = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous comments

return results, rule_results


def all_datset_against_each_rule(dataset_dir, rule_dir, total_calls):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "dataset"

Comment on lines 483 to 499
writer, sheet_name=f"Rule_{sanitized_rule_name[:28]}", index=False
) # Truncate to 31 chars

# Overall collective dataset results
collective_dataset_df = pd.DataFrame(collective_dataset_result)
collective_dataset_df.to_excel(
writer, sheet_name="Collective Dataset Result", index=False
)

# Individual dataset results
for dataset_name, dataset_data in individual_dataset_result.items():
sanitized_dataset_name = re.sub(
r"[\\/*?:[\]]", "_", dataset_name
) # Replace invalid characters with '_'
dataset_df = pd.DataFrame(dataset_data)
dataset_df.to_excel(
writer, sheet_name=f"Dataset_{sanitized_dataset_name[:28]}", index=False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting a warning that some titles are too long. The math might be off.

Comment on lines 253 to 263
for rule in rules:
rule_name = os.path.basename(rule)

# Initialize variables to collect times for the dataset across all rules
all_time_taken = []
all_preprocessing_times = []
all_operator_times = {}
all_operation_times = []

rule_names = []
for dataset_path in dataset_files:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this method is essentially running the same set of commands as the previous function, just in a different order. I think it is redundant.
I don't think it supports rules where the rule might require multiple datasets that are joined together since only a single dataset is being passed at a time. I think the timing feedback needs to given from within the engine. Maybe an additional logging mechanism can be used.

@RamilCDISC RamilCDISC requested a review from gerrycampion May 12, 2025 19:26
@OGarcia11 OGarcia11 modified the milestones: V0.12.0, v1.0.0 Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

large datasets testing Unit, regression, performance, QA, test automation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark baseline testing - Reports and test data

5 participants