Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ containing a list of the concept's children or parents in the hierarchy.
- Call `bias.display_concept_tree(parent_concept_tree)` and `bias.display_concept_tree(children_concept_tree)` to display
the concept hierarchical tree in an indented text format. If ipytree widget is installed and supported in a Jupyter notebook
environment, you can set `show_in_text_format` input parameter to `False`
(e.g., call `bias.display_concept_tree(parent_concept_tree, show_in_text_format=False)`)to leverage the tree widget for displaying
(e.g., call `bias.display_concept_tree(parent_concept_tree, show_in_text_format=False)`) to leverage the tree widget for displaying
the hierarchy in a tree that can be expanded and collapsed on demand interactively.

In addition to exploring the concepts using BiasAnalyzer APIs, the main functionalities of the BiasAnalyzer is
Expand Down Expand Up @@ -88,15 +88,35 @@ The following code snippets show some examples.
```
Note that currently the `get_stats()` method only returns statistics of age, gender, race, and ethinicity of a cohort
and `get_distributions()` method only returns distribution of age and gender in a cohort.
- You can also get patient counts and prevalence with each diagnostic condition concept code in a cohort by accessing
- You can also explore concept prevalence within a cohort - a key step in identifying potential biases during
cohort selection. A concept refers to a coded term from a standardized medical vocabulary, uniquely identified by a
concept ID. All clinical events in OMOP, such as conditions, drug exposures, procedures, measurements, and events, are
represented as concepts. You can get patient counts and prevalence associated with each concept by accessing
the method `get_concept_stats()` with a code snippet example shown below.
```angular2html
cohort_concepts = baseline_cohort_data.get_concept_stats()
cohort_concepts = baseline_cohort_data.get_concept_stats(concept_type='condition_occurrence')
print(pd.DataFrame(cohort_concepts["condition_occurrence"]))
```
- There is also an API method that enables users to compare distributions of two cohorts by calling `bias.compare_cohorts(cohort1_id, cohort2_id)`
where cohort1_id and cohort2_id are integers and can be obtained from metadata of a cohort object. Currently,
only hellinger distances between distributions of two cohorts are computed.

- After all analysis is done, please make sure to close database connections and do necessary cleanups by calling
the API method `bias.cleanup()`.
the API method `bias.cleanup()`.

---

## 📘 Tutorial Notebooks

To help users get started with the `BiasAnalyzer` python package, four Jupyter notebooks are
provided in the [`notebooks/`](https://github.com/VACLab/BiasAnalyzer/tree/main/notebooks)
directory. These tutorials walk users through key features and workflows with illustrative examples.

| Tutorial | Description |
|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| [BiasAnalyzerCohortsTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerCohortsTutorial.ipynb) | Demonstrates how to create baseline and study cohorts, retrieve cohort statistics, and compare cohort distributions. |
| [BiasAnalyzerAsyncCohortsTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerAsyncCohortsTutorial.ipynb) | As a companion to the Cohort tutorial above, demonstrates how to create and analyze cohorts asynchronously for improved performance and responsiveness when working with large datasets or complex cohort definitions. |
| [BiasAnalyzerCohortConceptTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerCohortConceptTutorial.ipynb) | Demonstrates how to explore clinical concept prevalence within a cohort, helping users analyze clinical concept prevalence and identify potential cohort selection biases. |
| [BiasAnalyzerConceptBrowsingTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerConceptBrowsingTutorial.ipynb) | Guides users through browsing OMOP concepts, domains, and vocabularies, including how to retrieve and visualize concept hierarchies. |

These tutorials are designed to run in a Jupyter environment with access to an OMOP-compatible postgreSQL or DuckDB database.
20 changes: 9 additions & 11 deletions biasanalyzer/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ def _set_cohort_action(self):
return self.cohort_action

def get_domains_and_vocabularies(self):
print(f'self.omop_cdm_db: {self.omop_cdm_db}')
if self.omop_cdm_db is None:
notify_users('A valid OMOP CDM must be set before getting domains. '
'Call set_root_omop first to set a valid root OMOP CDM')
Expand All @@ -96,19 +95,18 @@ def get_concept_hierarchy(self, concept_id):
return None
return self.omop_cdm_db.get_concept_hierarchy(concept_id)

def display_concept_tree(self, concept_tree: dict, level: int = 0, show_in_text_format=True, tree_type=None):
def display_concept_tree(self, concept_tree: dict, level: int = 0, show_in_text_format=True):
"""
Recursively prints the concept hierarchy tree in an indented format for display.
"""
details = concept_tree.get("details", {})
if tree_type is None or tree_type not in ['parents', 'children']:
if 'parents' in concept_tree:
tree_type = 'parents'
elif 'children' in concept_tree:
tree_type = 'children'
else:
notify_users('The input concept tree must contain parents or children key as the type of the tree.')
return ''
if 'parents' in concept_tree:
tree_type = 'parents'
elif 'children' in concept_tree:
tree_type = 'children'
else:
notify_users('The input concept tree must contain parents or children key as the type of the tree.')
return ''

if show_in_text_format:
if details:
Expand All @@ -119,7 +117,7 @@ def display_concept_tree(self, concept_tree: dict, level: int = 0, show_in_text_

for child in concept_tree.get(tree_type, []):
if child:
self.display_concept_tree(child, level + 1, tree_type=tree_type, show_in_text_format=True)
self.display_concept_tree(child, level + 1, show_in_text_format=True)
# return empty string to print None being printed at the end of printout
return ""
else:
Expand Down
4 changes: 3 additions & 1 deletion biasanalyzer/cohort.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ def __init__(self, cohort_id: int, bias_db: BiasDatabase, omop_db: OMOPCDMDataba
self.omop_db = omop_db
self._cohort_data = None # cache the cohort data
self._metadata = None
self.query_builder = CohortQueryBuilder(cohort_creation=False)

@property
def data(self):
Expand Down Expand Up @@ -55,6 +56,7 @@ def get_concept_stats(self, concept_type='condition_occurrence', filter_count=0,
Get cohort concept statistics such as concept prevalence
"""
cohort_stats = self.bias_db.get_cohort_concept_stats(self.cohort_id,
self.query_builder,
concept_type=concept_type,
filter_count=filter_count,
vocab=vocab,
Expand Down Expand Up @@ -106,7 +108,7 @@ def create_cohort(self, cohort_name: str, description: str, query_or_yaml_file:
notify_users(f'cohort creation configuration yaml file is not valid with validation error: {ex}')
return None

query = self._query_builder.build_query(cohort_config)
query = self._query_builder.build_query_cohort_creation(cohort_config)
else:
query = clean_string(query_or_yaml_file)
progress.update(1)
Expand Down
57 changes: 43 additions & 14 deletions biasanalyzer/cohort_query_builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@


class CohortQueryBuilder:
def __init__(self):
def __init__(self, cohort_creation=True):
"""Get the path to SQL templates, whether running from source or installed."""
try:
if sys.version_info >= (3, 9): # pragma: no cover
Expand All @@ -19,12 +19,13 @@ def __init__(self):
except ModuleNotFoundError: # pragma: no cover
template_path = os.path.join(os.path.dirname(__file__), "sql_templates")

print(f'template_path: {template_path}')
print(f'template_path: {template_path}, cohort_creation: {cohort_creation}')
self.env = Environment(loader=FileSystemLoader(template_path), extensions=['jinja2.ext.do'])
self.env.globals.update(
demographics_filter=self._load_macro('demographics_filter'),
temporal_event_filter=self.temporal_event_filter
)
if cohort_creation:
self.env.globals.update(
demographics_filter=self._load_macro('demographics_filter'),
temporal_event_filter=self.temporal_event_filter
)

def _extract_domains(self, events):
domains = set()
Expand All @@ -42,16 +43,11 @@ def _load_macro(self, macro_name):
macros_template = self.env.get_template('macros.sql.j2')
return macros_template.module.__dict__[macro_name]


def build_query(self, cohort_config: dict) -> str:
def build_query_cohort_creation(self, cohort_config: dict) -> str:
"""
Build a SQL query from the CohortCreationConfig object.

Args:
cohort_config: dict object loaded from yaml file for building sql query.

Returns:
str: The rendered SQL query.
:param cohort_config: dict object loaded from yaml file for building sql query.
:return: The rendered SQL query.
"""
inclusion_criteria = cohort_config.get('inclusion_criteria')
exclusion_criteria = cohort_config.get('exclusion_criteria', {})
Expand All @@ -75,6 +71,39 @@ def build_query(self, cohort_config: dict) -> str:
temporal_events=temporal_events
)

def build_concept_prevalence_query(self, concept_type: str, cid: int, filter_count: int, vocab: str,
include_hierarchy: bool) -> str:
"""
Build a SQL query for concept prevalence statistics for a given domain and cohort.
:param concept_type: Domain from DOMAIN_MAPPING (e.g., 'condition_occurrence').
:param cid: Cohort definition ID.
:param filter_count: Minimum count threshold for concepts with 0 meaning no filtering
:param vocab: Vocabulary ID. Defaults to domain-specific vocabulary as defined in DOMAIN_MAPPING if set to None
:param include_hierarchy: Include concept hierarchy in results or not
:return: The rendered SQL query
:raises ValueError if concept_type is not invalid
"""

# Validate concept_type
if concept_type not in DOMAIN_MAPPING or DOMAIN_MAPPING[concept_type]["table"] is None:
valid_domains = [k for k in DOMAIN_MAPPING.keys() if DOMAIN_MAPPING[k]["table"] is not None]
raise ValueError(f"Invalid concept_type: {concept_type}. Must be one of {valid_domains}")

# The provided vocab is assumed to be already validated if it is not set to None. Otherwise,
# if set to None, use domain-specific default vocabulary
effective_vocab = vocab if vocab is not None else DOMAIN_MAPPING[concept_type]["default_vocab"]
# Load and render the template
template = self.env.get_template("cohort_concept_prevalence_query.sql.j2")
return template.render(
table_name=DOMAIN_MAPPING[concept_type]["table"],
concept_id_column=DOMAIN_MAPPING[concept_type]["concept_id"],
start_date_column=DOMAIN_MAPPING[concept_type]["start_date"],
cid=cid,
filter_count=filter_count,
vocab=effective_vocab,
include_hierarchy=include_hierarchy
)

@staticmethod
def render_event(event):
"""
Expand Down
39 changes: 17 additions & 22 deletions biasanalyzer/database.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import SQLAlchemyError
from sqlalchemy import create_engine, text
from biasanalyzer.models import Cohort, CohortDefinition
from biasanalyzer.models import CohortDefinition
from biasanalyzer.sql import *
from biasanalyzer.utils import build_concept_hierarchy, print_hierarchy, find_roots, notify_users

Expand All @@ -22,16 +22,6 @@ class BiasDatabase:
"race": RACE_STATS_QUERY,
"ethnicity": ETHNICITY_STATS_QUERY
}
cohort_concept_queries = {
'condition_occurrence': {
'query': COHORT_CONCEPT_CONDITION_PREVALENCE_QUERY,
'default_vocab': 'SNOMED'
},
'drug_exposure': {
'query': COHORT_CONCEPT_DRUG_PREVALENCE_QUERY,
'default_vocab': 'RxNorm'
}
}
_instance = None # indicating a singleton with only one instance of the class ever created
def __new__(cls, *args, **kwargs):
if cls._instance is None:
Expand Down Expand Up @@ -142,7 +132,7 @@ def get_cohort(self, cohort_definition_id):
return [dict(zip(headers, row)) for row in rows]

def _create_omop_table(self, table_name):
if self.omop_cdm_db_url is not None and not self.omop_cdm_db_url.endswith('.duckdb'):
if self.omop_cdm_db_url is not None and not self.omop_cdm_db_url.endswith('duckdb'):
# need to create person table from OMOP CDM postgreSQL database
self.conn.execute(f"""
CREATE TABLE IF NOT EXISTS {table_name} AS
Expand Down Expand Up @@ -237,25 +227,29 @@ def get_cohort_distributions(self, cohort_definition_id: int, variable: str):
notify_users(f"Error computing cohort {variable} distributions: {e}", level='error')
return None

def get_cohort_concept_stats(self, cohort_definition_id: int,
def get_cohort_concept_stats(self, cohort_definition_id: int, qry_builder,
concept_type='condition_occurrence', filter_count=0, vocab=None,
include_hierarchy=False):
"""
Get concept statistics for a cohort from the cohort table.
"""
concept_stats = {}
if concept_type not in self.__class__.cohort_concept_queries:
notify_users(f"input {concept_type} is not a valid concept type. "
f"Supported concept types are: {self.__class__.cohort_concept_queries.keys()}", level='error')
return concept_stats

try:
if (self._create_omop_table('concept') and self._create_omop_table('concept_ancestor')
and self._create_omop_table(concept_type)):
query_str = self.__class__.cohort_concept_queries[concept_type]['query']
if not vocab:
vocab = self.__class__.cohort_concept_queries[concept_type]['default_vocab']
query = query_str.format(cid=cohort_definition_id, filter_count=filter_count,
vocab=vocab, include_hierarchy=include_hierarchy)
# validate input vocab if it is not None
if vocab is not None:
valid_vocabs = self._execute_query("SELECT distinct vocabulary_id FROM concept")
valid_vocab_ids = [row['vocabulary_id'] for row in valid_vocabs]
if vocab not in valid_vocab_ids:
notify_users(f"input {vocab} is not a valid vocabulary in OMOP. "
f"Supported vocabulary ids are: {valid_vocab_ids}",
level='error')
return concept_stats

query = qry_builder.build_concept_prevalence_query(concept_type, cohort_definition_id,
filter_count, vocab, include_hierarchy)
concept_stats[concept_type] = self._execute_query(query)
cs_df = pd.DataFrame(concept_stats[concept_type])
# Combine concept_name and prevalence into a "details" column
Expand Down Expand Up @@ -510,6 +504,7 @@ def get_concept_hierarchy(self, concept_id: int):
ancestor_id, {"details": concept_details[ancestor_id], "parents": []})
desc_entry_rev["parents"].append(ancestor_entry_rev)
progress.update(1)
progress.close()

# Return the parent hierarchy and children hierarchy of the specified concept
return reverse_hierarchy[concept_id], hierarchy[concept_id]
Expand Down
7 changes: 7 additions & 0 deletions biasanalyzer/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,42 +9,49 @@
"concept_id": "condition_concept_id",
"start_date": "condition_start_date",
"end_date": "condition_end_date",
"default_vocab": "SNOMED" # for use by concept prevalence query
},
"drug_exposure": {
"table": "drug_exposure",
"concept_id": "drug_concept_id",
"start_date": "drug_exposure_start_date",
"end_date": "drug_exposure_end_date",
"default_vocab": "RxNorm" # for use by concept prevalence query
},
"procedure_occurrence": {
"table": "procedure_occurrence",
"concept_id": "procedure_concept_id",
"start_date": "procedure_date",
"end_date": "procedure_date",
"default_vocab": "SNOMED" # for use by concept prevalence query
},
"visit_occurrence": {
"table": "visit_occurrence",
"concept_id": "visit_concept_id",
"start_date": "visit_start_date",
"end_date": "visit_end_date",
"default_vocab": "SNOMED" # for use by concept prevalence query
},
"measurement": {
"table": "measurement",
"concept_id": "measurement_concept_id",
"start_date": "measurement_date",
"end_date": "measurement_date",
"default_vocab": "LOINC" # for use by concept prevalence query
},
"observation": {
"table": "observation",
"concept_id": "observation_concept_id",
"start_date": "observation_date",
"end_date": "observation_date",
"default_vocab": "SNOMED" # for use by concept prevalence query
},
"date": { # Special case for static timestamps
"table": None,
"concept_id": None,
"start_date": "timestamp",
"end_date": "timestamp",
"default_vocab": None
}
}

Expand Down
Loading