-
Notifications
You must be signed in to change notification settings - Fork 73
[SYNPY-1698] Add CSV data model tutorial #1292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 13 commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
7dfa9c4
first draft of data model tutorial
andrewelamb 1ca26dd
fix typos
andrewelamb 01e7091
fix typos
andrewelamb 8c6328b
clean up into
andrewelamb baeef99
clean up into
andrewelamb 31d7745
clean up into
andrewelamb 8b87655
clean up column descriptions
andrewelamb 2531a8d
clean up column descriptions
andrewelamb d745d3a
Add conditonal dependencies section
andrewelamb 052e3f3
fix conditonal dependencies section
andrewelamb f571580
fix conditonal dependencies section
andrewelamb 81010ae
add conditonal dependencies example
andrewelamb c24f203
move file and link in mkdcos
andrewelamb ffd8061
various fixes to the tutorial
andrewelamb aff17fc
various fixes to the tutorial
andrewelamb 47c669d
various fixes to the tutorial
andrewelamb d9092b2
various fixes to the tutorial
andrewelamb 615c609
various fixes to the tutorial
andrewelamb e86a481
Update docs/explanations/curator_data_model.md
andrewelamb ce0bc19
Update docs/explanations/curator_data_model.md
andrewelamb ac2f00e
Update docs/explanations/curator_data_model.md
andrewelamb 9c38077
Update docs/explanations/curator_data_model.md
andrewelamb 1bc7334
Update docs/explanations/curator_data_model.md
andrewelamb f377504
Update docs/explanations/curator_data_model.md
andrewelamb 69711f1
Update docs/explanations/curator_data_model.md
andrewelamb 8748db4
Update docs/explanations/curator_data_model.md
andrewelamb 63758d7
copilot fixes
andrewelamb 9746259
more fixes to tutorial
andrewelamb c1514c6
more fixes to tutorial
andrewelamb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,347 @@ | ||||||
| # CSV data model description | ||||||
|
|
||||||
| The Curator-Extension (formerly Schematic) data model is used to create JSON Schemas for Curator. See [JSON Schema documentation](https://json-schema.org/). This is used for the DCCs that prefer working in a tabular format (CSV) over JSON or LinkML. A data model is created in the format specified below. Then the Curator-Extension in the Synapse Python Client can be used to convert to JSON Schema. | ||||||
|
|
||||||
| A link will be provided here to documentation for converting CSV data models to JSON Schema in the near future. | ||||||
|
|
||||||
| ## Data model columns | ||||||
|
|
||||||
| A JSON Schema is made up of one data type(for example a person) and the attributes that describe the data type (for example age and gender). The CSV data model will describe one or more data types. Each row describes either a data type, or an attribute. | ||||||
|
|
||||||
| Data types: | ||||||
|
|
||||||
| - must have a unique name in the `Attribute` column | ||||||
| - must have at least one attribute in the `DependsOn` column | ||||||
| - may have a value in the `Description` column | ||||||
|
|
||||||
| Attributes: | ||||||
|
|
||||||
| - must have a unique name in the `Attribute` column | ||||||
| - may have all values for all other columns besides `DependsOn` | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| The following data model has one data type, `Person`, and that data type has one attribute, `Gender`. | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| | Attribute | DependsOn | | ||||||
| |---|---| | ||||||
| | Person | "Gender" | | ||||||
| | Gender | | | ||||||
|
|
||||||
| Converting the above data model to JSON Schema results in: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Attribute | ||||||
|
|
||||||
| The name of the data type or attribute being described on this line. This should be a unique identifier in the file. For attributes this will be translated as the title in the JSON Schema. | ||||||
|
|
||||||
| ### DependsOn | ||||||
|
|
||||||
| The set of of attributes this data type has. These must be attributes that exists in this data model. Each attribute will appear in the properties of the JSON Schema. This should be a comma-separated list in quotes. Example: "Patient ID, Sex, Year of Birth, Diagnosis" | ||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### Description | ||||||
|
|
||||||
| A description of the datatype or attribute. This will be appear as a description in the JSON Schema. If left blank, this will be filled with ‘TBD’. | ||||||
|
|
||||||
| ### Valid Values | ||||||
|
|
||||||
| Set of possible values for the current attribute. This attribute will be an enum in the JSON Schema, with the values here as the enum values. See [enum](https://json-schema.org/understanding-json-schema/reference/enum#enumerated-values). This should be a comma-separated list in quotes. Example: "Female, Male, Other" | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | Valid Values | | ||||||
| |---|---|---| | ||||||
| | Person | "Gender" | | | ||||||
| | Gender | | "Female, Male, Other" | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "enum": ["Female", "Male", "Other"] | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Required | ||||||
|
|
||||||
| Whether a value must be set for this attribute. This field is boolean, i.e. valid values are ‘TRUE’ and ‘FALSE’. All attributes that are required will appear in the required list in the JSON Schema. See [required](https://json-schema.org/understanding-json-schema/reference/object#required). | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | Required | | ||||||
| |---|---|---| | ||||||
| | Person | "Gender, Age" | | | ||||||
| | Gender | | True | | ||||||
| | Age | | False | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| }, | ||||||
| "Gender": { | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| "description": "TBD", | ||||||
| "title": "Age" | ||||||
| } | ||||||
| }, | ||||||
| "required": ["Gender"] | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### columnType | ||||||
|
|
||||||
| The data type this of this attribute. See [type](https://json-schema.org/understanding-json-schema/reference/type). | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Must be one of: | ||||||
|
|
||||||
| - `string` | ||||||
| - `number` | ||||||
| - `integer` | ||||||
| - `boolean` | ||||||
| - `string_list` | ||||||
| - `integer_list` | ||||||
| - `boolean_list` | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | | ||||||
| |---|---|---| | ||||||
| | Person | "Gender, Assays" | | | ||||||
| | Gender | | string | | ||||||
| | Assays | | string_list | | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "Assays": { | ||||||
| "description": "TBD", | ||||||
| "title": "Assays", | ||||||
| "type": "array", | ||||||
| "items": { | ||||||
| "type": "string" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Format | ||||||
|
|
||||||
| The format of this attribute. See [format](https://json-schema.org/understanding-json-schema/reference/type#format) The type of this attribute must be "string" or "string_list". The value of this column will be appear as the `format` of this attribute in the JSON Schema. Must be one of: | ||||||
|
|
||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| - `date-time` | ||||||
| - `email` | ||||||
| - `hostname` | ||||||
| - `ipv4` | ||||||
| - `ipv6` | ||||||
| - `uri` | ||||||
| - `uri-reference` | ||||||
| - `uri-template` | ||||||
| - `json-pointer` | ||||||
| - `date` | ||||||
| - `time` | ||||||
| - `regex` | ||||||
| - `relative-json-pointer` | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | Format | | ||||||
| |---|---|---|---| | ||||||
| | Person | "Gender, Date" | | | | ||||||
| | Gender | | string | | | ||||||
| | Date | | string | date | | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "Date": { | ||||||
| "description": "TBD", | ||||||
| "title": "Date", | ||||||
| "type": "string", | ||||||
| "format": "date" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Pattern | ||||||
|
|
||||||
| The regex pattern this attribute match. The type of this attribute must be `string` or `string_list`. See [pattern](https://json-schema.org/understanding-json-schema/reference/https://json-schema.org/understanding-json-schema/reference/regular_expressions#regular-expressions) The value of this column will be appear as the `pattern` of this attribute in the JSON Schema. Must be a legal regex pattern as determined by the python `re` library. | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | Pattern | | ||||||
| |---|---|---|---| | ||||||
| | Person | "Gender, ID" | | | | ||||||
| | Gender | | string | | | ||||||
| | ID | | string | [a-f] | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "ID": { | ||||||
| "description": "TBD", | ||||||
| "title": "ID", | ||||||
| "type": "string", | ||||||
| "pattern": "[a-f]" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Minimum/Maximum | ||||||
|
|
||||||
| The range of numeric values this attribute must be in. The type of this attribute must be "integer", "number", or "integer_list". See [range](https://json-schema.org/understanding-json-schema/reference/numeric#range) The value of these columns will be appear as the `minimum` and `maximum` of this attribute in the JSON Schema. Both must be numeric values. | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | Minimum | Maximum | | ||||||
| |---|---|---|---|---| | ||||||
| | Person | "Age, Weight, Expression" | | | | | ||||||
| | Age | | integer | 0 | 120 | | ||||||
| | Weight | | number | 0.0 | | | ||||||
| | Expression | | number | 0.0 | 1.0 | | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Age": { | ||||||
| "description": "TBD", | ||||||
| "title": "Age", | ||||||
| "type": "integer", | ||||||
| "minimum": 0, | ||||||
| "maximum": 120 | ||||||
| }, | ||||||
| "Weight": { | ||||||
| "description": "TBD", | ||||||
| "title": "Weight", | ||||||
| "type": "number", | ||||||
| "minimum": 0.0 | ||||||
| }, | ||||||
| "Expression": { | ||||||
| "description": "TBD", | ||||||
| "title": "Expression", | ||||||
| "type": "number", | ||||||
| "minimum": 0.0, | ||||||
| "maximum": 1.0 | ||||||
| }, | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Validation Rules (deprecated) | ||||||
|
|
||||||
| This a remnant from Schematic. t is still used(for now) to translate certain validation rules to other JSONSchema key words. If you are starting a new data model do not use it. If you have an existing data model using any of the following validation rules, follow these instructions to update it: | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| - `list`: Make sure you are using one of the list-types in the `columnType` column. | ||||||
| - `regex`: `regex <module> <pattern>`, move the `<pattern>` to the `Pattern` column. | ||||||
| - `inRange`: `inRange <minimum> <maximum>`, move the `<minimum>` and/or the `<maximum>` to the `Minimum` and `Maximum` columns respectively. | ||||||
| - `date`: Use the `Format` column with value `date` | ||||||
| - `url`: Use the `Format` column with value `uri` | ||||||
|
|
||||||
| ## Conditional dependencies | ||||||
|
|
||||||
| The ‘DependsOn’ and ‘Valid Values’ columns can be used together to flexibly define conditional logic for determining the relevant attributes for a data type. | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| | Attribute | DependsOn | Valid Values | Required | | ||||||
| |---|---|---|---| | ||||||
| | Patient | "Diagnosis, Cancer" | | | | ||||||
| | Diagnosis | | "Healthy, Cancer" | True | | ||||||
| | Cancer | "Cancer Type, Family History" | "Cancer Type, Family History"| | | ||||||
| | Cancer Type | | "Brain, Lung, Skin" | True | | ||||||
| | Family History | | | True | | ||||||
|
|
||||||
| To demonstrate this, see the above example with the `Patient` and `Cancer` data types. Because we want to also know the `Cancer Type` and `Family History` for cancer patients (but not for healthy patients), `Healthy, Cancer` are valid values for `Diagnosis`. (Note `Cancer` is both a valid value and a data type.) `Cancer` has two required attributes, `Cancer Type`, and `Family History`. | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| As a result `Patient` data should include the columns `Diagnosis`, `Cancer Type`, and `Family History`, but the last two columns would only be required if `Diagnosis` is set to `Cancer` for a given patient. (and if the ‘Required’ column is set to true for these two attributes). The conditional logic may define an arbitrary number of branching paths. For instance, in the above example, we could require a `Brain Biopsy Site` attribute if `Cancer Type` is set to `Brain`. | ||||||
|
|
||||||
| The resulting JSON Schema: | ||||||
|
|
||||||
| ```JSON | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Diagnosis": { | ||||||
| "description": "TBD", | ||||||
| "enum": ["Cancer", "Healthy"], | ||||||
| "title": "Diagnosis" | ||||||
| }, | ||||||
| "Cancer Type": { | ||||||
| "description": "TBD", | ||||||
| "enum": ["Breast","Lung","Skin"], | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| "title": "Cancer Type" | ||||||
| }, | ||||||
| "Family History": { | ||||||
| "description": "TBD", | ||||||
| "title": "Family History" | ||||||
| } | ||||||
| }, | ||||||
| "required": ["Diagnosis"], | ||||||
| "allOf": [ | ||||||
| { | ||||||
| "if": { | ||||||
| "properties": { | ||||||
| "Diagnosis": { | ||||||
| "enum": [ | ||||||
| "Cancer" | ||||||
| ] | ||||||
| } | ||||||
| } | ||||||
| }, | ||||||
| "then": { | ||||||
| "required": ["Cancer Type", "FamilyHistory"] | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| "required": ["Cancer Type", "FamilyHistory"] | |
| "required": ["Cancer Type", "Family History"] |
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.