Skip to content

Ref/redcap #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 19, 2025
Merged

Ref/redcap #92

merged 11 commits into from
Mar 19, 2025

Conversation

djarecka
Copy link
Member

@djarecka djarecka commented Mar 3, 2025

  • refactoring redcap2reproschema
  • update to work with hbn example (previously we just used some random redcap)
  • update and refactor some smaller test from test_field_property.py (moved some tests from test_redcap2rs) and test_process_csv

NOTES:
There are still some things that are not solved:

  • what to do with BIOPORTAL:RXNORM in choices (perhpas I can treat it as other fields that we do not support: just add to the notes..?)

  • choices can have …| 998, 99.8 | 0999, 99.9 | 1000, 100 , and the value 0999 is interpreted as a string by the converter. I thought for moment that perhaps this is a typo, but I guess this might be done to distinguish from value for “I don’t know” that has value 999, so I keep it as a string, but not sure if this is fine

  • currently test_rs2redcap_redcap2rs.py doesn't work, since we used to have preamble for activity, but not sure how we want to deal with it based on the example from hbn, should we call teh first descriptive item as preamble to activity

@yibeichan
Copy link
Contributor

thank you! some (not complete)

what to do with BIOPORTAL:RXNORM in choices (perhpas I can treat it as other fields that we do not support: just add to the notes..?)

this is some ontology mapping, if it's in the choice set, we keep it as choice? we can set choice label and value the same?

choices can have …| 998, 99.8 | 0999, 99.9 | 1000, 100 , and the value 0999 is interpreted as a string by the converter. I thought for moment that perhaps this is a typo, but I guess this might be done to distinguish from value for “I don’t know” that has value 999, so I keep it as a string, but not sure if this is fine

I think you're right, it should be different from 999, so string should be okay, but shouldn't 99.9 be the value not 0999? or if it's the logic that the first half is value, then we keep it as value.

currently test_rs2redcap_redcap2rs.py doesn't work, since we used to have preamble for activity, but not sure how we want to deal with it based on the example from hbn, should we call teh first descriptive item as preamble to activity

so preamble is the only reason that causes this problem? hmmm, we can decide later

@djarecka
Copy link
Member Author

djarecka commented Mar 6, 2025

@yibeichan - I forgot the change the way we deal with choices from BIOPORTAL after our discussion yesterday, but I just realized that the main reason why the code was complaining it was that the input type was text instead of radio or checkbox. I will change the code that we accept choices for text, but you would have problem when using with ui...

You can test the code and the output

@yibeichan
Copy link
Contributor

for BIOPORTAL we keep it there as choices, if we need both value and label for choices, we use BIOPORTAL for both of them
thank you! I'll test it soon

@yibeichan
Copy link
Contributor

@djarecka i test it, looks reasonable. and yes, we will probably have trouble using UI for this one. but I guess, the purpose of our UI is to serve our own reproschema files than those converted from REDCap?
@satra there are complicated logics/setups (like the ontology BIOPORTAL and others) in this HBCD redcap, for this type of conversion, our goal should be keep as much information as possible rather than making it work through our UI, right?

@satra
Copy link
Contributor

satra commented Mar 7, 2025

roundtrip is one goal. but we do want reuse, so we need to understand either why our ui doesn't support it or what part of the schema doesn't support something. a specific set of examples would be helpful to review.

@yibeichan
Copy link
Contributor

0006-simplify-value-and-input-type-mapping-add-truefalse.patch
@djarecka i made some changes based on ABCD files (yes, ABCD not HBCD, but they are both REDCap), you can use the patch I attached here

  • Added support for datetime_mdy, datetime_dmy, and time validation types through a more flexible way
  • Added support for the truefalse input type, mapping it to radio buttons with "True" and "False" options

@yibeichan
Copy link
Contributor

@satra here is an activity example, look at the isVis part
"isVis": "event - name == 'v06_arm_1'" we really have no idea what thisevent - name is. the first time it appears in the csv file is [event-name] = 'v04_arm_1'. i guess we will get complains from UI about event - name

{
    "id": "child_survey_introduction_schema",
    "category": "reproschema:Activity",
    "prefLabel": {
        "en": "child_survey_introduction"
    },
    "schemaVersion": "1.0.0",
    "ui": {
        "order": [
            "items/child_introduction_v4",
            "items/child_introduction_v5",
            "items/child_introduction_v6"
        ],
        "addProperties": [
            {
                "isAbout": "items/child_introduction_v4",
                "isVis": "event - name == 'v04_arm_1'",
                "valueRequired": false,
                "variableName": "child_introduction_v4"
            },
            {
                "isAbout": "items/child_introduction_v5",
                "isVis": "event - name == 'v05_arm_1'",
                "valueRequired": false,
                "variableName": "child_introduction_v5"
            },
            {
                "isAbout": "items/child_introduction_v6",
                "isVis": "event - name == 'v06_arm_1'",
                "valueRequired": false,
                "variableName": "child_introduction_v6"
            }
        ],
        "shuffle": false
    },
    "version": "v1.0",
    "@context": "https://raw.githubusercontent.com/ReproNim/reproschema/main/releases/1.0.0/reproschema"
}

@yibeichan
Copy link
Contributor

@djarecka btw i add get_value_type in redcap_mappings.py but just realized that maybe we should add it to convertutils.py, we can chat tmr morning

Copy link
Contributor

@yibeichan yibeichan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i moved all content from my patch here

except:
raise ValueError(f"Invalid input for HTML parsing: {input_string}")


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_value_type(validation_type):
"""
Determine the XSD value type based on REDCap validation type
Args:
validation_type (str): Validation type from REDCap
Returns:
str: XSD value type for ReproSchema
"""
if validation_type is None:
return "xsd:string"
# Handle date and time formats with pattern matching
if validation_type.startswith("date_"):
return "xsd:date"
elif validation_type.startswith("datetime_"):
return "xsd:dateTime"
elif validation_type.startswith("time"):
return "xsd:time"
# For other types, use the mapping
return VALUE_TYPE_MAP.get(validation_type, "xsd:string")

REDCAP_COLUMN_REQUIRED,
RESPONSE_COND,
VALUE_TYPE_MAP,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
get_value_type)

)


def process_input_value_types(input_type_rc, value_type_rc) -> (str, str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def process_input_value_types(input_type_rc, value_type_rc) -> (str, str):
def process_input_value_types(input_type_rc, value_type_rc) -> (str, str):
"""
Process input type and value type to determine the final input type and value type,
that can be used by ReproSchema.
Args:
input_type_rc (str): Input type from redcap form
value_type_rc (str): Value type from redcap form
Returns:
tuple: (input_type, value_type)
input_type (str): Final input type for ReproSchema
value_type (str): Final value type for ReproSchema
"""
# If input type in redcap is set but not recognized, raise an error
if input_type_rc not in INPUT_TYPE_MAP:
raise ValueError(
f"Input type '{input_type_rc}' from redcap is not currently supported, "
f"supported types are: {', '.join(INPUT_TYPE_MAP.keys())}"
)
elif input_type_rc:
input_type = INPUT_TYPE_MAP.get(input_type_rc)
if value_type_rc:
# Get value type using the new function
value_type = get_value_type(value_type_rc)
# Adjust input type based on validation
if value_type_rc.startswith("date") or value_type_rc.startswith("datetime"):
if input_type_rc == "text":
input_type = "date"
elif value_type_rc.startswith("time"):
if input_type_rc == "text":
input_type = "time"
elif value_type_rc == "integer" and input_type_rc == "text":
input_type = "number"
elif value_type_rc in ["float", "number"] and input_type_rc == "text":
input_type = "float"
elif value_type_rc == "email" and input_type_rc == "text":
input_type = "email"
elif value_type_rc == "signature" and input_type_rc == "text":
input_type = "sign"
elif input_type_rc == "yesno":
value_type = "xsd:boolean"
elif input_type_rc == "truefalse":
value_type = "xsd:boolean"
elif input_type_rc in COMPUTE_LIST:
value_type = "xsd:integer"
else: # if no validation type is set, default to string
value_type = "xsd:string"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response_options["choices"] = [
{"name": {"en": "Yes"}, "value": 1},
{"name": {"en": "No"}, "value": 0},
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
]
]
elif input_type_rc == "truefalse":
response_options["choices"] = [
{"name": {"en": "True"}, "value": 1},
{"name": {"en": "False"}, "value": 0},
]

INPUT_TYPE_MAP = {
"calc": "number",
"sql": "number",
"yesno": "radio",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"yesno": "radio",
"yesno": "radio",
"truefalse": "radio",

}

# Map certain field types directly to xsd types
VALUE_TYPE_MAP = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VALUE_TYPE_MAP = {
VALUE_TYPE_MAP = {
# Basic types
"text": "xsd:string",
"email": "xsd:string",
"phone": "xsd:string",
"signature": "xsd:string",
"zipcode": "xsd:string",
"autocomplete": "xsd:string",
# Numeric types
"number": "xsd:decimal",
"float": "xsd:decimal",
"integer": "xsd:integer",
# Date and time types will be handled by pattern matching in process_input_value_types
# These entries are kept for backward compatibility
"date_": "xsd:date",
"time_": "xsd:time",
}


df = pd.read_csv(
csv_file, encoding="utf-8-sig"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
csv_file, encoding="utf-8-sig"
csv_file, encoding="utf-8-sig", low_memory=False

@satra
Copy link
Contributor

satra commented Mar 7, 2025

see: https://www.ctsi.ufl.edu/files/2017/06/Repeating-Instruments-and-Events-1.pdf (events are part of a longitudinal setup)

@yibeichan
Copy link
Contributor

see: https://www.ctsi.ufl.edu/files/2017/06/Repeating-Instruments-and-Events-1.pdf (events are part of a longitudinal setup)

thanks, makes sense. but in the csv we have examples in branching logic like

[event-name] = 'v01_arm_1' OR ([event-name] = 'v02_arm_1' AND [screening_arm_1][recruitment_period] = '2') OR [event-name] = 'v04_arm_1' OR [event-name] = 'v06_arm_1'

look at here we have variables like v01_arm_1, v02_arm_1, v04_arm_1, ``screening_arm_1, v06_arm_1`. `recruitment_period`, we don't have direct information to those variables either. if we want to make this work on UI, we probably need to setup extra things but we don't have the information for such variables, what should we do?

also, the ontology one, we currently set both the value and label as it, but obviously, it shouldn't be present as text.

"choices": [
            {
                "name": {
                    "en": "BIOPORTAL:ICD10"
                },
                "value": "BIOPORTAL:ICD10"
            }
{
    "id": "pex_bm_health_preg_i_illness_003_i_01",
    "category": "reproschema:Item",
    "additionalNotesObj": [
        {
            "column": "Custom Alignment",
            "source": "redcap",
            "value": "RH"
        }
    ],
    "preamble": {
        "en": "Illnesses During Pregnancy"
    },
    "prefLabel": {
        "en": "pex_bm_health_preg_i_illness_003_i_01"
    },
    "question": {
        "en": "Name of first illness (ICD-10)"
    },
    "responseOptions": {
        "choices": [
            {
                "name": {
                    "en": "BIOPORTAL:ICD10"
                },
                "value": "BIOPORTAL:ICD10"
            }
        ],
        "valueType": [
            "xsd:string"
        ]
    },
    "ui": {
        "inputType": "text"
    },
    "@context": "https://raw.githubusercontent.com/ReproNim/reproschema/main/releases/1.0.0/reproschema"
}

@djarecka
Copy link
Member Author

djarecka commented Mar 7, 2025

roundtrip is one goal. but we do want reuse, so we need to understand either why our ui doesn't support it or what part of the schema doesn't support something. a specific set of examples would be helpful to review.

@satra - I put several questions on slacks and here, but I'm happy to review later of all the things that I believe would not work properly in UI or just UI would not represent everything what I believe author had in mind (although sometimes it's hard to figure out). I also tried to put warnings in the code.

@satra
Copy link
Contributor

satra commented Mar 7, 2025

is BIOPORTAL:ICD10 that really the value of that question or range of that question? it seems like an ontology from which values can be drawn.

@yibeichan
Copy link
Contributor

i attached a screenshot here to help better understand what it does

image

@satra
Copy link
Contributor

satra commented Mar 7, 2025

i think those are categorical values (i.e. anything from those ontologies would work), but it looks like that's for validation and the test input is free form.

@yibeichan
Copy link
Contributor

@satra okay, i kind of understand what you mean, but if it's for "validation", then it means our UI should be capable of validating input in real time, right? Would we need such ontology to be part of our UI?

what's the action item for this current PR?

@djarecka
Copy link
Member Author

djarecka commented Mar 7, 2025

i think those are categorical values (i.e. anything from those ontologies would work), but it looks like that's for validation and the test input is free form.

yes, I mentioned earlier that this choices are weird, since they have text as input_type. Are you saying that this is not a mistake, and they indeed expect free form text, and just later run some validation?
Where in reproschema model the value BIOPORTAL:ICD10 could be added?

@satra
Copy link
Contributor

satra commented Mar 7, 2025

Are you saying that this is not a mistake, and they indeed expect free form text, and just later run some validation?

i suspect so.

Where in reproschema model the value BIOPORTAL:ICD10 could be added?

reproschema currently does not have a clear option that linkml does for range. we would need to introduce something like that. i was able to say this because of domain knowledge. this would be very hard to determine from the redcap schema without using some semantic+agentic approach. (redcap allows many custom schemas).

here is an example where we use a set of choices: https://github.com/ReproNim/reproschema-library/blob/25162084b702505cd7b6729ab569bd17c144213b/activities/demographics_and_background_information_v1/items/country_of_birth#L17

btw, what does redcap do when that schema is imported? that would be a way to check as well whether it is a wellformed schema.

@djarecka
Copy link
Member Author

djarecka commented Mar 7, 2025

here is an example where we use a set of choices: https://github.com/ReproNim/reproschema-library/blob/25162084b702505cd7b6729ab569bd17c144213b/activities/demographics_and_background_information_v1/items/country_of_birth#L17
but you use the select as the input type, not text.

btw, what does redcap do when that schema is imported? that would be a way to check as well whether it is a wellformed schema.
not sure, maybe @yibeichan can check

djarecka added 2 commits March 8, 2025 20:02
…hanges; updates to redcap2rs: adding activity preamble, adding description for calculate; fixing tests
@djarecka
Copy link
Member Author

@satra @yibeichan - list of issues, that I believe would prevent from running the schema using reproschema-ui (without testing, just based on my understanding the schema and ui). List is a collection of issues that I mentioned in various places and was able to find now, so might be not complete

Choices / responses options

  • no support for item from ontology, e.g., BIOPORTAL:RXNORM. Satra pointed me to the example that uses json file with values, but this is not the same. In addition in this example the input type is text, not select as in the example.
  • choices can have …| 998, 99.8 | 0999, 99.9 | 1000, 100 , and the value 0999 is interpreted as a string by the converter. I thought for moment that perhaps this is a typo, but it might be done to distinguish from value for “I don’t know” that has value 999, so I keep it as a string, but not sure if this is fine
  • we don't support minValue/maxValue for dates
  • for sliders we take minValue/maxValue between 0 and 10 (assuming that this is default?), but I didn't see in the example any sliders

Matrix Info and Preamble

  • We don't have idea of matrix groups, I saw in the old code some notion of this, but we don't have it in the schema and as far as I can say we never did, so all items are treated separately.
  • I try to use "Matrix Group Name" to group the elements that should have the same preamble. However, this is not consistent in all of those forms.
  • In general the way how I figure out which items have preamble is pretty ugly, and based on guessing using HBCD as an example. I look at the matrix group, if not I provide the preamble to all items below, untill there is another preamble, unless the inputType is "descriptive", etc.
  • If all the question have the same preamble, i decided to move it to the Activity preamble.

Input types and value types

  • there are different types for date, time and datetime, but in reproschema we have inputType only for date, (and year), so if something has "time", I'm converting this to text
  • in the hbcd cvs they are using number both for integer and floats, so I'm changing it for xsd:decimal
  • in the old converter there were multiple value types that were converted just to string, that includes email, phone, zipcode, signature and autocomplete, in hbcd I only had autocomplete that I also change to string
  • we treat the same way calc and sql from redcap. I'm not sure if this is ok, and also no easy way to return from reproschema 2 redcap
  • I am also not really sure if calc is parsed properly to use in the ui

Ignoring columns

These columns I don't map in any meaningful way to reproschema, so just adding to the additionalNotes:

ADDITIONAL_NOTES_LIST = [
    "Field Note",
    "Question Number (surveys only)",
    "Matrix Group Name",
    "Matrix Ranking?",
    "Text Validation Type OR Show Slider Number",
    "Text Validation Min",
    "Text Validation Max",
    "Identifier?",
    "Custom Alignment",
    "Question Number (surveys only)",
    "Field Annotation",
]
  • Field Annotation is there since I only look for @READONLY", @HIDDEN, or @CALCTEXT

@satra
Copy link
Contributor

satra commented Mar 12, 2025

let's file away components that cannot be easily mapped, but please keep the following principles in play.

  • the ui should not dictate the schema. the schema should be implementable through the ui. even right now there are components of reproschema that are not implemented via the ui (scheduling for example)
  • for redcap, do make sure that the redcap schema is actually valid (this would be a conversation with the abcd/hbcd folks). many people use redcap for two reasons (collecting data and storing data collected elsewhere into redcap data elements). as a result sometimes forms are not actually usable. hence it would be helpful to ask them the questions where the response is a function of an ontology from BIOPORTAL or elsewhere. for this specific csv, i think it would be really helpful to load into redcap to see how its interpreting some of the fields. that will tell you if redcap is actually taking care of some of the issues you are finding that don't seem to have a clear consistency.
  • the reason i pointed to the json file was there is a place in reproschema that says a set of options can be described in a json file. this means that any choice from an ontology could potentially be implemented using that option.
  • input types should be defined in the schema (using standard forms like XSD) and then implementation can be improved. ui should not be dictating the schema (nor should redcap directly). the relevant information that should be represented is what should drive the schema's two components: content and ui. for example, matrix is a ui component i.e. present the elements as a table in a matrix rather than as one item at a time. for example the PHQ-9 pdf could be presented as a matrix. so that kind of info falls in the ui category.

perhaps to help wrap this up file anything that cannot be addressed as new issues with proper description of what's not available (schema, ui, etc.,.), and relax evaluation of matching to specific columns, etc.,. for round trips.

@djarecka
Copy link
Member Author

djarecka commented Mar 13, 2025

let's file away components that cannot be easily mapped, but please keep the following principles in play.

  • the ui should not dictate the schema. the schema should be implementable through the ui. even right now there are components of reproschema that are not implemented via the ui (scheduling for example)

But how can I know what are the allowed value for inpuType or valueType in reproschema? I can only look at the website/Readme or reproschema-ui. This is how I created for example enum from this pr, and no one told be otherwise.

Do you want me just add enums with all possible values that we might ever want? If that's the case I can do it. We can just keep it as free text but I don't think it is very helpful.

  • for redcap, do make sure that the redcap schema is actually valid (this would be a conversation with the abcd/hbcd folks). many people use redcap for two reasons (collecting data and storing data collected elsewhere into redcap data elements). as a result sometimes forms are not actually usable. hence it would be helpful to ask them the questions where the response is a function of an ontology from BIOPORTAL or elsewhere. for this specific csv, i think it would be really helpful to load into redcap to see how its interpreting some of the fields. that will tell you if redcap is actually taking care of some of the issues you are finding that don't seem to have a clear consistency.

@yibeichan - any chances you checked if the file is a valid redcap schema? If not, can you do it?

  • the reason i pointed to the json file was there is a place in reproschema that says a set of options can be described in a json file. this means that any choice from an ontology could potentially be implemented using that option.

I understand. But do you want to do it now, or just record the ontology value that was used.

  • input types should be defined in the schema (using standard forms like XSD) and then implementation can be improved. ui should not be dictating the schema (nor should redcap directly). the relevant information that should be represented is what should drive the schema's two components: content and ui. for example, matrix is a ui component i.e. present the elements as a table in a matrix rather than as one item at a time. for example the PHQ-9 pdf could be presented as a matrix. so that kind of info falls in the ui category.

But we don't have matrix in our schema! Do you want to add it?

perhaps to help wrap this up file anything that cannot be addressed as new issues with proper description of what's not available (schema, ui, etc.,.), and relax evaluation of matching to specific columns, etc.,. for round trips.

In general, it seems to me that I was not aware what is the real task here.
If you want me to add elements to reproschema that are missing to represent some information from the redcap, e.g., matrix, I can do it. I can also create enums for inpuType and valueType that has elements that I believe would be useful. This could address some problems, and won't take a lot of time (implementing in ui would be another story).

In terms of redcap, that shouldn't dictate the form, I thought the task was mostly to record all redcap information to later track changes, so I was not questioning the files. I also never used redcap, and don't have an account.

@yibeichan
Copy link
Contributor

hi, thanks to both of you for the discussion!!

Our principle for conversion is to retain as much information as possible, prioritizing important elements such as questions and response choices. The UI serves the schema, not the other way around. If, during conversion, we encounter elements not available in the UI, we should first assess whether their schema logic is valid. If it is, we add them to the schema first, and the UI can be updated later.

For example, time should logically be included in reproschema, even though it's not yet available in the UI. We can add it to the schema first and update the UI later.

Regarding matrix_info, it is mainly for visualization. It does not add additional information to reproschema unless we can establish a consistent link between the preamble and the matrix, which we haven’t found in HBCD yet. So, we can set matrix_info aside for now.

As for checking whether a file is a valid REDCap schema, a CSV file may contain different forms serving different purposes—some for data collection, others as data elements. I can first perform the redcap2reproschema conversion and then determine the purpose of the generated activities.

For now, we can merge this PR and distribute the remaining issues, questions, and discussion points into separate issues.

@yibeichan yibeichan merged commit 5c1f898 into ReproNim:main Mar 19, 2025
9 checks passed
@djarecka
Copy link
Member Author

ok, so I will not be making any changes now, I will create issues based on the discussion, and please also add to the issue if I miss something, since there are many things that are lost in the discussion and it would be good if we do update the schema/code at some point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants