-
Notifications
You must be signed in to change notification settings - Fork 73
[SYNPY-1698] Add CSV data model tutorial #1292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 18 commits
7dfa9c4
1ca26dd
01e7091
8c6328b
baeef99
31d7745
8b87655
2531a8d
d745d3a
052e3f3
f571580
81010ae
c24f203
ffd8061
aff17fc
47c669d
d9092b2
615c609
e86a481
ce0bc19
ac2f00e
9c38077
1bc7334
f377504
69711f1
8748db4
63758d7
9746259
c1514c6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,389 @@ | ||||||
| # CSV data model description | ||||||
|
|
||||||
| The Curator-Extension (formerly Schematic) data model is used to create JSON Schemas for Curator. See [JSON Schema documentation](https://json-schema.org/). This is used for the DCCs that prefer working in a tabular format (CSV) over JSON or LinkML. A data model is created in the format specified below. Then the Curator-Extension in the Synapse Python Client can be used to convert to JSON Schema. | ||||||
|
|
||||||
| A link will be provided here to documentation for converting CSV data models to JSON Schema in the near future. | ||||||
|
|
||||||
| ## Understanding the Structure | ||||||
|
|
||||||
| A data model describes real world entities(data types) and attributes that you want to collect data for. For example you might want to describe a Patient, and you want to collect their age, gender and name. | ||||||
|
|
||||||
| The CSV data model described in this tutorial formalizes this structure: | ||||||
|
|
||||||
| - The CSV data model describes one or more data types. | ||||||
| - Each row describes either a data type, or an attribute. | ||||||
|
|
||||||
| Here is the Patient described above represented as a CSV data model: | ||||||
|
|
||||||
| | Attribute | DependsOn | | ||||||
| |---|---| | ||||||
| | Patient | "Age, Gender, Name" | | ||||||
| | Age | | | ||||||
| | Gender | | | ||||||
| | Name | | | ||||||
|
|
||||||
| The end goal is to create a JSON schema that cna be used in Curator. A JSON Schema consists of only one data type and their attributes. Converting the above data model to JSON Schema results in: | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Age": { | ||||||
| "description": "TBD", | ||||||
| "title": "Age" | ||||||
| }, | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender" | ||||||
| }, | ||||||
| "Name": { | ||||||
| "description": "TBD", | ||||||
| "title": "Name" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ## CSV Data model columns | ||||||
|
|
||||||
| Note: Individual columns are covered later on this page. | ||||||
|
|
||||||
| Defining data types: | ||||||
|
|
||||||
| - Put a unique data type name in the `Attribute` column | ||||||
| - List its attributes (minimum 1) in the `DependsOn` column (comma-separated) | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| - Optionally add a description to the `Description` column. | ||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Defining attributes: | ||||||
|
|
||||||
| - Put a unique attribute name in the `Attribute` column | ||||||
| - Leave the `DependsOn` column empty | ||||||
| - All other columns are optional | ||||||
|
|
||||||
| ### Attribute | ||||||
|
|
||||||
| The name of the data type or attribute being described on this line. This should be a unique identifier in the file. For attributes this will be translated as the title in the JSON Schema. | ||||||
|
|
||||||
| ### DependsOn | ||||||
|
|
||||||
| The set of of attributes this data type has. These must be attributes that exists in this data model. Each attribute will appear in the properties of the JSON Schema. This should be a comma-separated list in quotes. Example: "Patient ID, Sex, Year of Birth, Diagnosis" | ||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| ### Description | ||||||
|
|
||||||
| A description of the datatype or attribute. This will be appear as a description in the JSON Schema. If left blank, this will be filled with ‘TBD’. | ||||||
|
|
||||||
| ### Valid Values | ||||||
|
|
||||||
| Set of possible values for the current attribute. This attribute will be an enum in the JSON Schema, with the values here as the enum values. See [enum](https://json-schema.org/understanding-json-schema/reference/enum#enumerated-values). This should be a comma-separated list in quotes. Example: "Female, Male, Other" | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | Valid Values | | ||||||
| |---|---|---| | ||||||
| | Patient | "Gender" | | | ||||||
| | Gender | | "Female, Male, Other" | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "enum": ["Female", "Male", "Other"] | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Required | ||||||
|
|
||||||
| Whether a value must be set for this attribute. This field is boolean, i.e. valid values are `TRUE` and `FALSE`. All attributes that are required will appear in the required list in the JSON Schema. See [required](https://json-schema.org/understanding-json-schema/reference/object#required). | ||||||
|
|
||||||
| Note: Leaving this empty is the equivalent of `False`. | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | Required | | ||||||
| |---|---|---| | ||||||
| | Patient | "Gender, Age" | | | ||||||
| | Gender | | True | | ||||||
| | Age | | False | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| }, | ||||||
| "Gender": { | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| "description": "TBD", | ||||||
| "title": "Age" | ||||||
| } | ||||||
| }, | ||||||
| "required": ["Gender"] | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### columnType | ||||||
|
|
||||||
| The data type of this of this attribute. See [type](https://json-schema.org/understanding-json-schema/reference/type). | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Must be one of: | ||||||
|
|
||||||
| - `string` | ||||||
| - `number` | ||||||
| - `integer` | ||||||
| - `boolean` | ||||||
| - `string_list` | ||||||
| - `integer_list` | ||||||
| - `boolean_list` | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | | ||||||
| |---|---|---| | ||||||
| | Patient | "Gender, Hobbies" | | | ||||||
| | Gender | | string | | ||||||
| | Hobbies | | string_list | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "Hobbies": { | ||||||
| "description": "TBD", | ||||||
| "title": "Hobbies", | ||||||
| "type": "array", | ||||||
| "items": { | ||||||
| "type": "string" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Format | ||||||
|
|
||||||
| The format of this attribute. See [format](https://json-schema.org/understanding-json-schema/reference/type#format) The type of this attribute must be "string" or "string_list". The value of this column will be appear as the `format` of this attribute in the JSON Schema. Must be one of: | ||||||
|
|
||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| - `date-time` | ||||||
| - `email` | ||||||
| - `hostname` | ||||||
| - `ipv4` | ||||||
| - `ipv6` | ||||||
| - `uri` | ||||||
| - `uri-reference` | ||||||
| - `uri-template` | ||||||
| - `json-pointer` | ||||||
| - `date` | ||||||
| - `time` | ||||||
| - `regex` | ||||||
| - `relative-json-pointer` | ||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | Format | | ||||||
| |---|---|---|---| | ||||||
| | Patient | "Gender, Date" | | | | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| | Gender | | string | | | ||||||
| | Birth Date | | string | date | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "Birth Date": { | ||||||
| "description": "TBD", | ||||||
| "title": "Birth Date", | ||||||
| "type": "string", | ||||||
| "format": "date" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Pattern | ||||||
|
|
||||||
| The regex pattern this attribute match. The type of this attribute must be `string` or `string_list`. See [pattern](https://json-schema.org/understanding-json-schema/reference/https://json-schema.org/understanding-json-schema/reference/regular_expressions#regular-expressions) The value of this column will be appear as the `pattern` of this attribute in the JSON Schema. Must be a legal regex pattern as determined by the python `re` library. | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | Pattern | | ||||||
| |---|---|---|---| | ||||||
| | Patient | "Gender, ID" | | | | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| | Gender | | string | | | ||||||
| | ID | | string | [a-f] | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Gender": { | ||||||
| "description": "TBD", | ||||||
| "title": "Gender", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "ID": { | ||||||
| "description": "TBD", | ||||||
| "title": "ID", | ||||||
| "type": "string", | ||||||
| "pattern": "[a-f]" | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Minimum/Maximum | ||||||
|
|
||||||
| The range of numeric values this attribute must be in. The type of this attribute must be "integer", "number", or "integer_list". See [range](https://json-schema.org/understanding-json-schema/reference/numeric#range) The value of these columns will be appear as the `minimum` and `maximum` of this attribute in the JSON Schema. Both must be numeric values. | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
| | Attribute | DependsOn | columnType | Minimum | Maximum | | ||||||
| |---|---|---|---|---| | ||||||
| | Patient | "Age, Weight, Expression" | | | | | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| | Age | | integer | 0 | 120 | | ||||||
| | Weight | | number | 0.0 | | | ||||||
| | Health Score | | number | 0.0 | 1.0 | | ||||||
|
|
||||||
| JSON Schema output: | ||||||
|
|
||||||
| ```json | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Age": { | ||||||
| "description": "TBD", | ||||||
| "title": "Age", | ||||||
| "type": "integer", | ||||||
| "minimum": 0, | ||||||
| "maximum": 120 | ||||||
| }, | ||||||
| "Weight": { | ||||||
| "description": "TBD", | ||||||
| "title": "Weight", | ||||||
| "type": "number", | ||||||
| "minimum": 0.0 | ||||||
| }, | ||||||
| "Health Score": { | ||||||
| "description": "TBD", | ||||||
| "title": "Health Score", | ||||||
| "type": "number", | ||||||
| "minimum": 0.0, | ||||||
| "maximum": 1.0 | ||||||
| }, | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| } | ||||||
| } | ||||||
| ``` | ||||||
|
|
||||||
| ### Validation Rules (deprecated) | ||||||
|
|
||||||
| This is a remnant from Schematic. It is still used (for now) to translate certain validation rules to other JSON Schema keywords. | ||||||
|
|
||||||
| If you are starting a new data model, DO NOT use this column. | ||||||
|
|
||||||
| If you have an existing data model using any of the following validation rules, follow these instructions to update it: | ||||||
|
|
||||||
| - `list`: Make sure you are using one of the list-types in the `columnType` column. | ||||||
| - `regex`: `regex <module> <pattern>`, move the `<pattern>` to the `Pattern` column. | ||||||
| - `inRange`: `inRange <minimum> <maximum>`, move the `<minimum>` and/or the `<maximum>` to the `Minimum` and `Maximum` columns respectively. | ||||||
| - `date`: Use the `Format` column with value `date` | ||||||
| - `url`: Use the `Format` column with value `uri` | ||||||
|
|
||||||
| ## Conditional dependencies | ||||||
|
|
||||||
| The `DependsOn` and `Valid Values` columns can be used together to flexibly define conditional logic for determining the relevant attributes for a data type. | ||||||
|
|
||||||
| In this example we have the `Patient` data type. In this case the `Patient` can dbe diagnosed as healthy or with cancer. For Patients with cancer we also want to collect info about their cancer type, and any cancers in their family history. | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
|
||||||
| Data Model: | ||||||
|
|
||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| | Attribute | DependsOn | Valid Values | Required | columnType | | ||||||
andrewelamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| |---|---|---|---|---| | ||||||
| | Patient | "Diagnosis" | | | | | ||||||
| | Diagnosis | | "Healthy, Cancer" | True | string | | ||||||
| | Cancer | "Cancer Type, Family History" | | | | | ||||||
| | Cancer Type | | "Brain, Lung, Skin" | True | string | | ||||||
| | Family History | | "Brain, Lung, Skin" | True | string_list | | ||||||
|
|
||||||
| To demonstrate this, see the above example with the `Patient` and `Cancer` data types: | ||||||
|
|
||||||
| - `Diagnosis` is an attribute of `Patient`. | ||||||
| - `Diagnosis` has `Valid Values` of `Healthy` and `Cancer`. | ||||||
| - `Cancer` is also a data type. | ||||||
| - `Cancer Type` and `Family History` are attributes of `Cancer` and are both required. | ||||||
|
|
||||||
| As a result of the above data model, in the JSON Schema: | ||||||
|
|
||||||
| - `Cancer Type` and `Family History` become properties of `Patient`. | ||||||
| - For a given `Patient`, if `Diagnosis` == `Cancer` then `Cancer Type` and `Family History` become required for that `Patient`. | ||||||
| - The conditional logic is contained in the `allOf` array. | ||||||
|
|
||||||
| ```JSON | ||||||
| { | ||||||
| "description": "TBD", | ||||||
| "properties": { | ||||||
| "Diagnosis": { | ||||||
| "description": "TBD", | ||||||
| "enum": ["Cancer", "Healthy"], | ||||||
| "title": "Diagnosis", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "Cancer Type": { | ||||||
| "description": "TBD", | ||||||
| "enum": ["Breast","Lung","Skin"], | ||||||
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| "title": "Cancer Type", | ||||||
| "type": "string" | ||||||
| }, | ||||||
| "Family History": { | ||||||
| "description": "TBD", | ||||||
| "title": "Family History", | ||||||
| "type": "array", | ||||||
| "items": { | ||||||
| "type": "string", | ||||||
| "enum": ["Breast","Lung","Skin"], | ||||||
|
||||||
| "enum": ["Breast","Lung","Skin"], | |
| "enum": ["Breast","Lung","Skin"] |
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
andrewelamb marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in the required array: "FamilyHistory" should be "Family History" (with a space) to match the property name defined at line 361.
| "required": ["Cancer Type", "FamilyHistory"] | |
| "required": ["Cancer Type", "Family History"] |
Uh oh!
There was an error while loading. Please reload this page.