VACLab · hyi · Jun 20, 2025 · Jun 15, 2025 · Jun 16, 2025 · Jun 17, 2025
diff --git a/README.md b/README.md
@@ -57,7 +57,7 @@ containing a list of the concept's children or parents in the hierarchy.
 - Call `bias.display_concept_tree(parent_concept_tree)` and `bias.display_concept_tree(children_concept_tree)` to display 
 the concept hierarchical tree in an indented text format. If ipytree widget is installed and supported in a Jupyter notebook 
 environment, you can set `show_in_text_format` input parameter to `False` 
-(e.g., call `bias.display_concept_tree(parent_concept_tree,  show_in_text_format=False)`)to leverage the tree widget for displaying 
+(e.g., call `bias.display_concept_tree(parent_concept_tree,  show_in_text_format=False)`) to leverage the tree widget for displaying 
 the hierarchy in a tree that can be expanded and collapsed on demand interactively.   
 
 In addition to exploring the concepts using BiasAnalyzer APIs, the main functionalities of the BiasAnalyzer is 
@@ -88,15 +88,35 @@ The following code snippets show some examples.
   ```
   Note that currently the `get_stats()` method only returns statistics of age, gender, race, and ethinicity of a cohort 
 and `get_distributions()` method only returns distribution of age and gender in a cohort.
-- You can also get patient counts and prevalence with each diagnostic condition concept code in a cohort by accessing 
+- You can also explore concept prevalence within a cohort - a key step in identifying potential biases during 
+cohort selection. A concept refers to a coded term from a standardized medical vocabulary, uniquely identified by a 
+concept ID. All clinical events in OMOP, such as conditions, drug exposures, procedures, measurements, and events, are 
+represented as concepts. You can get patient counts and prevalence associated with each concept by accessing 
 the method `get_concept_stats()` with a code snippet example shown below.
   ```angular2html
-    cohort_concepts = baseline_cohort_data.get_concept_stats()
+    cohort_concepts = baseline_cohort_data.get_concept_stats(concept_type='condition_occurrence')
     print(pd.DataFrame(cohort_concepts["condition_occurrence"]))
   ```
 - There is also an API method that enables users to compare distributions of two cohorts by calling `bias.compare_cohorts(cohort1_id, cohort2_id)` 
 where cohort1_id and cohort2_id are integers and can be obtained from metadata of a cohort object. Currently, 
 only hellinger distances between distributions of two cohorts are computed.
 
 - After all analysis is done, please make sure to close database connections and do necessary cleanups by calling 
-the API method `bias.cleanup()`.
+the API method `bias.cleanup()`.
+
+---
+
+## 📘 Tutorial Notebooks
+
+To help users get started with the `BiasAnalyzer` python package, four Jupyter notebooks are 
+provided in the [`notebooks/`](https://github.com/VACLab/BiasAnalyzer/tree/main/notebooks) 
+directory. These tutorials walk users through key features and workflows with illustrative examples.
+
+| Tutorial | Description                                                                                                                                                                                                           |
+|----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [BiasAnalyzerCohortsTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerCohortsTutorial.ipynb) | Demonstrates how to create baseline and study cohorts, retrieve cohort statistics, and compare cohort distributions.                                                                                                  |
+| [BiasAnalyzerAsyncCohortsTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerAsyncCohortsTutorial.ipynb) | As a companion to the Cohort tutorial above, demonstrates how to create and analyze cohorts asynchronously for improved performance and responsiveness when working with large datasets or complex cohort definitions. |
+| [BiasAnalyzerCohortConceptTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerCohortConceptTutorial.ipynb) | Demonstrates how to explore clinical concept prevalence within a cohort, helping users analyze clinical concept prevalence and identify potential cohort selection biases.                                            |
+| [BiasAnalyzerConceptBrowsingTutorial.ipynb](https://github.com/VACLab/BiasAnalyzer/blob/main/notebooks/BiasAnalyzerConceptBrowsingTutorial.ipynb) | Guides users through browsing OMOP concepts, domains, and vocabularies, including how to retrieve and visualize concept hierarchies.                                                                                  |
+
+These tutorials are designed to run in a Jupyter environment with access to an OMOP-compatible postgreSQL or DuckDB database. 
diff --git a/biasanalyzer/api.py b/biasanalyzer/api.py
@@ -72,7 +72,6 @@ def _set_cohort_action(self):
         return self.cohort_action
 
     def get_domains_and_vocabularies(self):
-        print(f'self.omop_cdm_db: {self.omop_cdm_db}')
         if self.omop_cdm_db is None:
             notify_users('A valid OMOP CDM must be set before getting domains. '
                          'Call set_root_omop first to set a valid root OMOP CDM')
@@ -96,19 +95,18 @@ def get_concept_hierarchy(self, concept_id):
             return None
         return self.omop_cdm_db.get_concept_hierarchy(concept_id)
 
-    def display_concept_tree(self, concept_tree: dict, level: int = 0, show_in_text_format=True, tree_type=None):
+    def display_concept_tree(self, concept_tree: dict, level: int = 0, show_in_text_format=True):
         """
         Recursively prints the concept hierarchy tree in an indented format for display.
         """
         details = concept_tree.get("details", {})
-        if tree_type is None or tree_type not in ['parents', 'children']:
-            if 'parents' in concept_tree:
-                tree_type = 'parents'
-            elif 'children' in concept_tree:
-                tree_type = 'children'
-            else:
-                notify_users('The input concept tree must contain parents or children key as the type of the tree.')
-                return ''
+        if 'parents' in concept_tree:
+            tree_type = 'parents'
+        elif 'children' in concept_tree:
+            tree_type = 'children'
+        else:
+            notify_users('The input concept tree must contain parents or children key as the type of the tree.')
+            return ''
 
         if show_in_text_format:
             if details:
@@ -119,7 +117,7 @@ def display_concept_tree(self, concept_tree: dict, level: int = 0, show_in_text_
 
             for child in concept_tree.get(tree_type, []):
                 if child:
-                    self.display_concept_tree(child, level + 1, tree_type=tree_type, show_in_text_format=True)
+                    self.display_concept_tree(child, level + 1, show_in_text_format=True)
             # return empty string to print None being printed at the end of printout
             return ""
         else:

diff --git a/biasanalyzer/cohort.py b/biasanalyzer/cohort.py
@@ -18,6 +18,7 @@ def __init__(self, cohort_id: int, bias_db: BiasDatabase, omop_db: OMOPCDMDataba
         self.omop_db = omop_db
         self._cohort_data = None # cache the cohort data
         self._metadata = None
+        self.query_builder = CohortQueryBuilder(cohort_creation=False)
 
     @property
     def data(self):
@@ -55,6 +56,7 @@ def get_concept_stats(self, concept_type='condition_occurrence', filter_count=0,
         Get cohort concept statistics such as concept prevalence
         """
         cohort_stats = self.bias_db.get_cohort_concept_stats(self.cohort_id,
+                                                             self.query_builder,
                                                              concept_type=concept_type,
                                                              filter_count=filter_count,
                                                              vocab=vocab,
@@ -106,7 +108,7 @@ def create_cohort(self, cohort_name: str, description: str, query_or_yaml_file:
                 notify_users(f'cohort creation configuration yaml file is not valid with validation error: {ex}')
                 return None
 
-            query = self._query_builder.build_query(cohort_config)
+            query = self._query_builder.build_query_cohort_creation(cohort_config)
         else:
             query = clean_string(query_or_yaml_file)
         progress.update(1)

diff --git a/biasanalyzer/cohort_query_builder.py b/biasanalyzer/cohort_query_builder.py
@@ -6,7 +6,7 @@
 
 
 class CohortQueryBuilder:
-    def __init__(self):
+    def __init__(self, cohort_creation=True):
         """Get the path to SQL templates, whether running from source or installed."""
         try:
             if sys.version_info >= (3, 9): # pragma: no cover
@@ -19,12 +19,13 @@ def __init__(self):
         except ModuleNotFoundError: # pragma: no cover
             template_path = os.path.join(os.path.dirname(__file__), "sql_templates")
 
-        print(f'template_path: {template_path}')
+        print(f'template_path: {template_path}, cohort_creation: {cohort_creation}')
         self.env = Environment(loader=FileSystemLoader(template_path), extensions=['jinja2.ext.do'])
-        self.env.globals.update(
-            demographics_filter=self._load_macro('demographics_filter'),
-            temporal_event_filter=self.temporal_event_filter
-        )
+        if cohort_creation:
+            self.env.globals.update(
+                demographics_filter=self._load_macro('demographics_filter'),
+                temporal_event_filter=self.temporal_event_filter
+            )
 
     def _extract_domains(self, events):
         domains = set()
@@ -42,16 +43,11 @@ def _load_macro(self, macro_name):
         macros_template = self.env.get_template('macros.sql.j2')
         return macros_template.module.__dict__[macro_name]
 
-
-    def build_query(self, cohort_config: dict) -> str:
+    def build_query_cohort_creation(self, cohort_config: dict) -> str:
         """
         Build a SQL query from the CohortCreationConfig object.
-
-        Args:
-            cohort_config: dict object loaded from yaml file for building sql query.
-
-        Returns:
-            str: The rendered SQL query.
+        :param cohort_config: dict object loaded from yaml file for building sql query.
+        :return: The rendered SQL query.
         """
         inclusion_criteria = cohort_config.get('inclusion_criteria')
         exclusion_criteria = cohort_config.get('exclusion_criteria', {})
@@ -75,6 +71,39 @@ def build_query(self, cohort_config: dict) -> str:
             temporal_events=temporal_events
         )
 
+    def build_concept_prevalence_query(self, concept_type: str, cid: int, filter_count: int, vocab: str,
+                                       include_hierarchy: bool) -> str:
+        """
+        Build a SQL query for concept prevalence statistics for a given domain and cohort.
+        :param concept_type: Domain from DOMAIN_MAPPING (e.g., 'condition_occurrence').
+        :param cid: Cohort definition ID.
+        :param filter_count: Minimum count threshold for concepts with 0 meaning no filtering
+        :param vocab: Vocabulary ID. Defaults to domain-specific vocabulary as defined in DOMAIN_MAPPING if set to None
+        :param include_hierarchy: Include concept hierarchy in results or not
+        :return: The rendered SQL query
+        :raises ValueError if concept_type is not invalid
+        """
+
+        # Validate concept_type
+        if concept_type not in DOMAIN_MAPPING or DOMAIN_MAPPING[concept_type]["table"] is None:
+            valid_domains = [k for k in DOMAIN_MAPPING.keys() if DOMAIN_MAPPING[k]["table"] is not None]
+            raise ValueError(f"Invalid concept_type: {concept_type}. Must be one of {valid_domains}")
+
+        # The provided vocab is assumed to be already validated if it is not set to None. Otherwise,
+        # if set to None, use domain-specific default vocabulary
+        effective_vocab = vocab if vocab is not None else DOMAIN_MAPPING[concept_type]["default_vocab"]
+        # Load and render the template
+        template = self.env.get_template("cohort_concept_prevalence_query.sql.j2")
+        return template.render(
+            table_name=DOMAIN_MAPPING[concept_type]["table"],
+            concept_id_column=DOMAIN_MAPPING[concept_type]["concept_id"],
+            start_date_column=DOMAIN_MAPPING[concept_type]["start_date"],
+            cid=cid,
+            filter_count=filter_count,
+            vocab=effective_vocab,
+            include_hierarchy=include_hierarchy
+        )
+
     @staticmethod
     def render_event(event):
         """

diff --git a/biasanalyzer/database.py b/biasanalyzer/database.py
@@ -6,7 +6,7 @@
 from sqlalchemy.orm import sessionmaker
 from sqlalchemy.exc import SQLAlchemyError
 from sqlalchemy import create_engine, text
-from biasanalyzer.models import Cohort, CohortDefinition
+from biasanalyzer.models import CohortDefinition
 from biasanalyzer.sql import *
 from biasanalyzer.utils import build_concept_hierarchy, print_hierarchy, find_roots, notify_users
 
@@ -22,16 +22,6 @@ class BiasDatabase:
         "race": RACE_STATS_QUERY,
         "ethnicity": ETHNICITY_STATS_QUERY
     }
-    cohort_concept_queries = {
-        'condition_occurrence': {
-            'query': COHORT_CONCEPT_CONDITION_PREVALENCE_QUERY,
-            'default_vocab': 'SNOMED'
-        },
-        'drug_exposure': {
-            'query': COHORT_CONCEPT_DRUG_PREVALENCE_QUERY,
-            'default_vocab': 'RxNorm'
-        }
-    }
     _instance = None  # indicating a singleton with only one instance of the class ever created
     def __new__(cls, *args, **kwargs):
         if cls._instance is None:
@@ -142,7 +132,7 @@ def get_cohort(self, cohort_definition_id):
         return [dict(zip(headers, row)) for row in rows]
 
     def _create_omop_table(self, table_name):
-        if self.omop_cdm_db_url is not None and not self.omop_cdm_db_url.endswith('.duckdb'):
+        if self.omop_cdm_db_url is not None and not self.omop_cdm_db_url.endswith('duckdb'):
             # need to create person table from OMOP CDM postgreSQL database
             self.conn.execute(f"""
                 CREATE TABLE IF NOT EXISTS {table_name} AS 
@@ -237,25 +227,29 @@ def get_cohort_distributions(self, cohort_definition_id: int, variable: str):
             notify_users(f"Error computing cohort {variable} distributions: {e}", level='error')
             return None
 
-    def get_cohort_concept_stats(self, cohort_definition_id: int,
+    def get_cohort_concept_stats(self, cohort_definition_id: int, qry_builder,
                                  concept_type='condition_occurrence', filter_count=0, vocab=None,
                                  include_hierarchy=False):
         """
         Get concept statistics for a cohort from the cohort table.
         """
         concept_stats = {}
-        if concept_type not in self.__class__.cohort_concept_queries:
-            notify_users(f"input {concept_type} is not a valid concept type. "
-                         f"Supported concept types are: {self.__class__.cohort_concept_queries.keys()}", level='error')
-            return concept_stats
+
         try:
             if (self._create_omop_table('concept') and self._create_omop_table('concept_ancestor')
                     and self._create_omop_table(concept_type)):
-                query_str = self.__class__.cohort_concept_queries[concept_type]['query']
-                if not vocab:
-                    vocab = self.__class__.cohort_concept_queries[concept_type]['default_vocab']
-                query = query_str.format(cid=cohort_definition_id, filter_count=filter_count,
-                                         vocab=vocab, include_hierarchy=include_hierarchy)
+                # validate input vocab if it is not None
+                if vocab is not None:
+                    valid_vocabs = self._execute_query("SELECT distinct vocabulary_id FROM concept")
+                    valid_vocab_ids = [row['vocabulary_id'] for row in valid_vocabs]
+                    if vocab not in valid_vocab_ids:
+                        notify_users(f"input {vocab} is not a valid vocabulary in OMOP. "
+                                     f"Supported vocabulary ids are: {valid_vocab_ids}",
+                                     level='error')
+                        return concept_stats
+
+                query = qry_builder.build_concept_prevalence_query(concept_type, cohort_definition_id,
+                                                                   filter_count, vocab, include_hierarchy)
                 concept_stats[concept_type] = self._execute_query(query)
                 cs_df = pd.DataFrame(concept_stats[concept_type])
                 # Combine concept_name and prevalence into a "details" column
@@ -510,6 +504,7 @@ def get_concept_hierarchy(self, concept_id: int):
                 ancestor_id, {"details": concept_details[ancestor_id], "parents": []})
             desc_entry_rev["parents"].append(ancestor_entry_rev)
         progress.update(1)
+        progress.close()
 
         # Return the parent hierarchy and children hierarchy of the specified concept
         return reverse_hierarchy[concept_id], hierarchy[concept_id]

diff --git a/biasanalyzer/models.py b/biasanalyzer/models.py
@@ -9,42 +9,49 @@
         "concept_id": "condition_concept_id",
         "start_date": "condition_start_date",
         "end_date": "condition_end_date",
+        "default_vocab": "SNOMED"  # for use by concept prevalence query
     },
     "drug_exposure": {
         "table": "drug_exposure",
         "concept_id": "drug_concept_id",
         "start_date": "drug_exposure_start_date",
         "end_date": "drug_exposure_end_date",
+        "default_vocab": "RxNorm"  # for use by concept prevalence query
     },
     "procedure_occurrence": {
         "table": "procedure_occurrence",
         "concept_id": "procedure_concept_id",
         "start_date": "procedure_date",
         "end_date": "procedure_date",
+        "default_vocab": "SNOMED"  # for use by concept prevalence query
     },
     "visit_occurrence": {
         "table": "visit_occurrence",
         "concept_id": "visit_concept_id",
         "start_date": "visit_start_date",
         "end_date": "visit_end_date",
+        "default_vocab": "SNOMED"  # for use by concept prevalence query
     },
     "measurement": {
         "table": "measurement",
         "concept_id": "measurement_concept_id",
         "start_date": "measurement_date",
         "end_date": "measurement_date",
+        "default_vocab": "LOINC"  # for use by concept prevalence query
     },
     "observation": {
         "table": "observation",
         "concept_id": "observation_concept_id",
         "start_date": "observation_date",
         "end_date": "observation_date",
+        "default_vocab": "SNOMED"  # for use by concept prevalence query
     },
     "date": {  # Special case for static timestamps
         "table": None,
         "concept_id": None,
         "start_date": "timestamp",
        "end_date": "timestamp",
+        "default_vocab": None
     }
 }