Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions clinical_data_in_arcus.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
<!--
link: https://storage.googleapis.com/chop-dbhi-arcus-education-website-assets/css/styles.css
script: https://kit.fontawesome.com/83b2343bd4.js
title: Clinical Data and the ADR
-->

# Clinical Data in Arcus

In Arcus, we aim to make getting access to and working with clinical data for research purposes simpler. This guide will discuss the Arcus Data Repository: why it exists, some of what it contains, and how to get more information.

## Clinical data

For our purposes, when we say "clinical data", we mean data from the electronic health record that relates to patient encounters. We are not including data collected on research subjects for the purpose of research, even if these research subjects are also patients of CHOP. Please note, however, that data collected for research is also available through Arcus, as is any dataset available in CHOP's enterprise data catalog (https://gene.chop.edu).

Clinical data includes information about patients, medication administrations, procedures, and many, many other entities and encounters at CHOP!

There are quite a few names and acronyms related to clinical data at CHOP, some of which are listed below:

- EHR: Electronic health record
- Epic: The EHR that CHOP uses-- this is where clinical data is recorded
- Hyperspace: Epic's user interface
- Chronicles: Real-time Epic database (the data you see in Hyperspace lives in Chronicles)
- Clarity: Reporting database for Epic
- Helix: CHOP's new cloud-based data warehouse
- ADR: Arcus Data Repository


## The Arcus Data Repository

For researchers wanting to perform retrospective analyses on clinical data, there are a variety of problems they might face:

* The purposes are not the same: Clinical data is not collected with research in mind!
* Data organization: The organizational systems that make patient care more efficient might be quite different from those needed for research.
* Data access/privacy: Not all clinical data should be made available for all types of research.
* Performance: The database that stores patient data for clinical use (Chronicles) would be very inefficient for returning research-relevant information. Also, crucially, we can't risk burdening Chronicles with computationally-intensive research queries because it could pose a safety risk to patients.

These are some of the problems that the Arcus Data Repository (ADR) aims to address. So what is the ADR?

* A relational database, stored in Google BigQuery, of most frequently requested EHR data, pulled from Clarity
* The ADR has identified or de-identified datasets available, depending on your needs and regulatory status (IRB, non human subjects, etc.)

**Important note**: This does not mean that the data are "pre-cleaned"! Data will still be messy or incomplete.

## Clinical data journey to the ADR

![Journey of clinical data at CHOP, beginning in Epic Chronicles, flowing into Epic Clarity, and from there into two branches, the CHOP Data Warehouse or the Arcus Data Repository.](media/chop_clinical_data_overview_updated.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not referencing the CDW anymore, I thought?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yep correct, missed this one!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have an updated graphic I just forgot to copy to the media folder... but I think I like yours better for this anyway!


CHOP stores a _vast_ amount of clinical data, and it is incredibly complex! For efficient storage and access, data from Epic Chronicles is uploaded to Clarity (a SQL database) nightly. This database is where the ADR gets its data. The ADR documentation in the CHOP data catalog contains information about the lineage of the data; [check out this lineage of the contact date field in the encounter table](https://chop.alationcloud.com/attribute/933634/lineage/) as an example.

While Epic does have some data analysis tools, they are not built with research as the primary focus. The ADR _is_ designed for research, and because it is a curated list, it is much easier to find and deliver exactly what you need to answer your research question.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're contrasting Hyperspace (slicerdicer, etc.) here with ADR, but we might want to make that explicit and also contrast ADR with Clarity and/or Helix, maybe as a separate point.


### Relational databases

The ADR is a relational database, stored in Google BigQuery. But what is a relational database?

A **relational database** is a storage solution in which tables are related by columns they have in common.

Because medical data contains many **one-to-many** relationships and a lot of inter-related information, relational databases are an efficient way to store these data. The process of organizing data into a relational database is called **normalization**.

<div class = "learn-more">
<b style="color: rgb(var(--color-highlight));">Learning connection</b><br>

For more information about this topic see out [database normalization module](https://liascript.github.io/course/?https://raw.githubusercontent.com/arcus/education_modules/main/database_normalization/database_normalization.md#1).

</div>

## Exploring the ADR

The ADR currently contains more than 50 tables, but it continues to grow. It represents a "greatest hits" selection of the most commonly requested data. There are identified and de-identified versions available, depending on your research needs.

You can read more about the ADR and look at some of the metadata about the tables in the ADR in the [CHOP Data Catalog](https://chop.alationcloud.com/data/23/).

The next few sections will go through two of the central tables in the ADR, to which many others connect: Patients and Encounters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these next few sections are probably the most valuable piece for folks reading this content. I expect the two big questions folks will have in their mind for this content are "what data are available in the ADR" and "where do I find X in the ADR". To that end, we may want to significantly expand this section, to cover more than just patients and encounters, although doing so would be a big lift and we may not have bandwidth.

Here's the suggestion I had put on the PR in the notebooks repo, which I think applies here as well:

Instead of an ERD, what about a few examples that start with a general research question and then point to where those data would be? For example:

incidence of asthma dx for pts who have vs have not received covid vaccine (where is dx info stored? where is vaccine history?)
impact of SODH (e.g. child opportunity index) on rehospitalization after surgery (where is geospatial info? where are surgery encounters? where are hospitalizations?)
comparison of over- vs. under-weight BMI in referrals to nutrition specialists (where is BMI? where are referrals?)
For each example, I think the "where are the data" would be just a the level of table: "BMI for a given encounter is in flowsheets, as are a lot of other common measurements like vital signs. A surgery is an encounter, as is a hospitalization; you can filter by encounter type. And for inpatient encounters, there's lots of additional information in the ADT tables. etc." but no more detailed than that.
And also: I personally don't know with confidence where to find all these data points, so I'm definitely not expecting that you would either! :) But if you like this approach, I'd be happy to work with you on fleshing out some examples like these and get the ADR team to vet.


## Patient

The `patient` table in the ADR is a table containing information about patient demographics. This includes information like:

- Age
- Race
- Sex
- Contact information

The entity being represented is **patients**. The patient table contains one patient per row, which their uniques identifiers and demographic information. Each row is identified by a unique `pat_id`, assigned by Arcus (this is not the same as the MRN). All patients included in the ADR have had at least one CHOP encounter.

### Patient: Entity relationship diagram

Below is an [entity relationship diagram](https://www.lucidchart.com/pages/er-diagrams#discoveryTop) for the `patient` table, which illustrates the relationships between the `patient` table and other tables. If this isn't helpful to you, feel free to move on ahead!

![Entity relationship diagram of the patient table in the ADR.](media/patient_erd.png)


## Encounter

The `encounter` table in the ADR contains information about encounters at CHOP. The term "encounter" can mean things like:

- Office visits
- Phone calls
- Surgeries

But there are many other types of encounters.

Each row is a single encounter (patients can have more than one!), and each row is identified by a unique `encounter_id`. The table contains information about the patient who had the encounter, the time, date, place, duration, admission status, and much more; this also includes canceled visits and no-shows.

Besides the `encounter` table itself, there several other tables in the encounter domain:

- `encounter_adr`
- `encounter_chief_complaint`
- `encounter_reason`
- `encounter_diagnosis`

These tables contain additional information related to specific encounters, such as admissions or diagnoses. For more information, check out the [ADR in the CHOP data catalog](https://chop.alationcloud.com/data/23/).

### Encounter: Entity relationship diagram
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how helpful these diagrams really are... if they were interactive (click on a table and then it centers on that table, etc.) like the diagrams in Caboodle, I think it would be a lot more valuable, and then it would also potentially provide a way for you to sort of "walk through" the full data model, by clicking from one table to the next and seeing how they all can potentially connect.
But it's a little weird to me that the ERD here isn't complete -- they don't show all the tables you can link to from encounter, just a few examples, so it's not providing a birdseye view of the ADR's structure. I dunno. What do you think? I think maybe what I would prefer would be for there to be a complete ERD for every single blessed table in the ADR, and for the ERD to be included on the Gene page for that table. But to me, I think showing an incomplete ERD just underscores that it's a relational database -- you can join tables together by keys -- it doesn't necessarily communicate the data model itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're probably right--I actually don't really like them and don't find them helpful, but I do wish we had more visuals that would be useful. We can play around with other options though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could include screenshots of the cohort discovery tool and/or clinical data finder, potentially.


![Entity relationship diagram of the encounters table in the ADR.](media/patient_erd.png)

## Getting more information

This was just a brief overview of the clinical data in the Arcus data repository, but you may have more questions! There are a few places you can get more information:

- [Explore data in Arcus](https://arcus.chop.edu/i-want-to/explore-data)
- [Book time with Arcus Education](https://outlook.office365.com/owa/calendar/[email protected]/bookings/)
- [Book time with the Arcus data team](https://outlook.office365.com/owa/calendar/[email protected]/bookings/)

## Starting a research project with Arcus

So what do you need to do next if you're interested in conducting research on CHOP's clinical data with Arcus?

- Confirm your Arcus access (CHOP credentials, [CITI training](https://forum.arcus.chop.edu/t/citi-training-requirement-for-arcus/174), [Arcus terms of use](https://arcus.chop.edu/terms-of-use)).

- Know at least [a little SQL](https://liascript.github.io/course/?https://raw.githubusercontent.com/arcus/education_modules/main/sql_basics/sql_basics.md#1), or hire someone who does.

- (Recommended) Use the [Cohort Discovery Tool](https://arcus.chop.edu/apps/cohort-discovery) to check the feasibility of your project.

- [Submit a request to Arcus!](https://pm.arcus.chop.edu/servicedesk/customer/portal/6/create/307).


Binary file added media/chop_clinical_data_overview_updated.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/encounter_erd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/patient_erd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.