Bulk FHIR transformations to apply a standard set of annotations to the FHIR data models to better support the types of queries used in defining cohorts, calculating quality measures, and performing public health data.
Prototype implementation (unit tests).
- Convert value to UTC
- If partial dates, convert to start and end with sub-second precision. For example, '2018-05' will be populated with start date being '2018-05-01T00:00:00.000Z' and end date being '2018-05-31T23:59:59.999Z'. '2017-03-01' will be populated with start date being '2017-03-01T00:00:00.000Z' and end date being '2017-03-01T23:59:59.999Z'
- Instant types should have the same start and end
- Add to resource as {elementName}_aa.start and {elementName}_aa.end
- FHIR Timing elements are ignored at present due to their limited use and the complexity involved in converting them into a date range.
- Convert to uppercase
- Follow unicode codepoint normalization guidelines (for JS: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize)
- Add to resource as {elementName}_aa
- Note: applied to text element in CodeableConcept and display element in Coding
- TODO: review string normalization implementations in HAPI and MS FHIR servers
- Build URL with base url, resourceType, and id
- Remove scheme
- Hash with SHA1
- Update resource id to hash
- Retain original id in id_prev_aa
- TODO: is SHA1 the best hashing algorithm for this?
- Omit from analytic dataset by default with option to include Narrative (optional inclusion not yet implemented in prototype)
- Build URL with base url, resourceType, parent resource id, and contained resource id
- Remove URL scheme
- Hash with SHA1
- Update resource id
- Retain original id in id_prev_aa
- Extract from parent resource
- Update internal references in former parent to new id
- TODO: is SHA1 the best hashing algorithm for this?
-
If absolute URL, and base matches FHIR server base URL (ie not an external reference):
id = hash of url without scheme
-
If relative URL
id = hash of base url without scheme + relative url
-
If contained URL
id = hash of base url without scheme + relative url + "#" + relative id
-
Store previous Reference.reference as reference_prev_aa
-
Update Reference.reference to [resourceType]/[hashed id]
-
Populate Reference.type if not populated
-
Populate reference_id_aa with the hashed id
Flattened into a record structure for easy querying.
Note that BigQuery uses a typed schema and limits the number of fields in a table, so only FHIR types pre-defined for that extension path are included in the record.
- Make all extensions URLs absolute
- Flatten to the following table, replacing the extension element:
extension [] parent (eg. 0.1.2) url (absolute) value[x]
Example queries:
SELECT *
FROM Patient,
UNNEST (extension) AS pt_extension
WHERE pt_extension.url = "http://fhir.org/guides/argonaut/StructureDefinition/argo-race/ombCategory"
AND pt_extension.valueCoding.system = "http://fhir.org/guides/argonaut/v3/Race"
AND pt_extension.valueCoding.code = "1002-5"
SELECT *
FROM Patient,
UNNEST (extension) AS pt_extension
WHERE pt_extension.url = "http://fhir.org/guides/argonaut/StructureDefinition/argo-race/text"
AND pt_extension.valueString_aa LIKE "%MIXED%"A few FHIR structures can be infinitely nested and need to be limited to fit in BigQuery and other schema based data stores.
- Extensions can contain extensions (handled by flattening extensions as described above)
- Extensions can contain complex types that can contain extensions (currently handled by omitting these)
- Complex types can contain other complex types (eg. an Reference includes an Identifier has an Assigner which is a Reference). This is handled by limiting the levels of recursion to a pre-defined number of levels by path (prototype defaults to 3).
- Context References that are circular (eg. Questionnaire.item.item). This is handled by limiting the levels of recursion to a pre-defined number of levels by path (prototype defaults to 3).
- Normalize city, district, state and country by converting to uppercase and fixing abbreviations where possible
- Optionally geocode
- Add to Address type:
city_aa district_aa state_aa country_aa geocode_aa longitude Longitude with WGS84 datum latitude decimal Latitude with WGS84 datum altitude decimal Altitude with WGS84 datum
If needed, primitive extensions could be flattened using a similar approach to that described above for other extensions.
- Is it worth grouping similar choice types (eg. date, dateTime, instant) into one type for querying or should this be handled by the query generator checking for the existence of each type?
- Support inclusion of timezone offset extension (https://www.hl7.org/fhir/extension-tz-offset.json.html) on date and time elements rather than assuming UTC?
- Build schema for each upload and tailor to data or build general purpose schema (as is currently done)? Pro: would transparently support extensions and primitive extensions. Con: clients would have to check if fields exists prior to querying.
- Is performance sufficient to do interval packing via window queries or does aggregation need to be done as a transformation on load (eg. merging multiple medication orders into a drug exposure "era")?