Skip to content

Benchmark baseline testing - Reports and test data #881

@gerrycampion

Description

@gerrycampion

Reporting

For each of the following, we want type, name, number of calls, cumulative time, mean time, median time, min time, max time

Rule # Dataset Time (s)
CORE-00001 LB 5.1
Rule # # of Calls Cumulative Time (s) Mean Time Median Time Min Time Max Time
CORE-00001 5 5.1 1.1 1.0 .5 8
  • Time taken for each dataset #908

  • Time taken to preprocess each dataset #909

    dataset = deepcopy(dataset)
    # preprocess dataset
    dataset_preprocessor = DatasetPreprocessor(
    dataset, domain, dataset_path, self.data_service, self.cache
    )
    dataset = dataset_preprocessor.preprocess(rule_copy, datasets)

  • Time taken for each Operator #910
    Refer to feature: valid_codelist operator working, needs extensibility logic #898 - dataframe_operators.py has a new wrapper that can be used

  • Time taken for each Operation #911

    result = self._execute_operation()

  • Time taken for each Dataset Builder #912

    kwargs = {}
    builder = self.get_dataset_builder(rule, dataset_path, datasets, domain)
    dataset = builder.get_dataset()
    # Update rule for certain rule types
    # SPECIAL CASES FOR RULE TYPES ###############################
    # TODO: Handle these special cases better.
    if self.library_metadata:
    kwargs[
    "variable_codelist_map"
    ] = self.library_metadata.variable_codelist_map
    kwargs[
    "codelist_term_maps"
    ] = self.library_metadata.get_all_ct_package_metadata()
    if rule.get("rule_type") == RuleTypes.DEFINE_ITEM_METADATA_CHECK.value:
    if self.library_metadata:
    kwargs[
    "variable_codelist_map"
    ] = self.library_metadata.variable_codelist_map
    kwargs[
    "codelist_term_maps"
    ] = self.library_metadata.get_all_ct_package_metadata()
    elif (
    rule.get("rule_type")
    == RuleTypes.VARIABLE_METADATA_CHECK_AGAINST_DEFINE.value
    ):
    self.rule_processor.add_comparator_to_rule_conditions(
    rule, comparator=None, target_prefix="define_"
    )
    elif (
    rule.get("rule_type")
    == RuleTypes.VALUE_LEVEL_METADATA_CHECK_AGAINST_DEFINE.value
    ):
    value_level_metadata: List[dict] = self.get_define_xml_value_level_metadata(
    dataset_path, domain
    )
    kwargs["value_level_metadata"] = value_level_metadata
    elif (
    rule.get("rule_type")
    == RuleTypes.DATASET_CONTENTS_CHECK_AGAINST_DEFINE_AND_LIBRARY.value
    ):
    library_metadata: dict = self.library_metadata.variables_metadata.get(
    domain, {}
    )
    define_metadata: List[dict] = builder.get_define_xml_variables_metadata()
    targets: List[
    str
    ] = self.data_processor.filter_dataset_columns_by_metadata_and_rule(
    dataset.columns.tolist(), define_metadata, library_metadata, rule
    )
    rule_copy = deepcopy(rule)
    updated_conditions = RuleProcessor.duplicate_conditions_for_all_targets(
    rule_copy["conditions"], targets
    )
    rule_copy["conditions"].set_conditions(updated_conditions)
    # When duplicating conditions,
    # rule should be copied to prevent updates to concurrent rule executions
    return self.execute_rule(
    rule_copy, dataset, dataset_path, datasets, domain, **kwargs
    )
    kwargs["ct_packages"] = list(self.ct_packages)
    logger.info(f"Using dataset build by: {builder.__class__}")

  • Any remaining time unaccounted for #913

  • Telemetry reports (OpenTelemetry) #914

    • Need to ensure that these reports do not have any data or proprietary information
  • Test Data #915
    Test data should include:

    • A set of small datasets and define-xml (Nick's certification data)
    • A set containing at least one large (~1GB) dataset
    • All currently published rules, given as list of rules
  • Output #916

    • JSON will be the primary output
    • JSON converted to Excel

Metadata

Metadata

Assignees

Labels

large datasetstestingUnit, regression, performance, QA, test automation

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions