The ETL algorithm creates interoperable databases for the BETTER project. It relies on MongoDB to store the data and reads a .env file for the configuration. The code is written in Python and encapsulated in a Docker container.
- Docker Desktop is installed on the host machine.
- All input files are in the same folder on your machine, and they have the exact same name as specified in the metadata.
- Specifically, for VCF data:
- You can group them in a sub-folder (or not).
- If you grouped them in a folder, e.g., named
vcf-folder, please providevcf-folder/*.vcffor the data files in the.envfile. - If you did not group them, please provide
*.vcffor the data files in the.envfile. - The VCF files should be named with the patient ID (exactly and only the patient ID used the other data).
- Specifically, for VCF data:
- In your data files: variables are columns, patients are rows and patients have identifiers (which will be further anonymized by I-ETL).
- The column name for the patient ID, respectively the sample ID, is the same for all the files used to build the database.
- Get the image of the I-ETL:
- Either download the TAR image available in the repository (recommended)
- Go to the deployment artifacts page: https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/artifacts
- Click on the "folder" icon of the latest valid build (the most recent one with a green ☑)
- Download the TAR archive named
the-ietl-image.tar(be sure to download a.tar, not a.zip)
- Or build it from the repository (not recommended, see section "For developers")
- Download the
compose.yamlfile, available in the repository (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/compose.yaml?ref_type=heads) - Download the settings file
.env.examplefile, available in the repository (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/.env.example?ref_type=heads) and rename it to.env - Download your metadata file in the Better Drive: https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV
ES-HSJD-variables.xlsxfor SJD hospitalIT-BUZZI-variables.xlsxfor BUZZI hospital in use-case 1RS-IMGGE-variables.xlsxfor IMGGE hospitalUC2-variables.xlsxfor LAFE and HMC hospitalsUC3-variables-02-04-2025.xlsxfor TERRASSA, UKK and BUZZI hospitals
- Create a folder, e.g., named
better, with:
- The I-ETL Docker (TAR) image
- The
.envfile template - The
compose.yamlfile - The metadata file
- In that folder, load the TAR image within the Docker:
docker load < the-ietl-image.tar - In that folder, fill the
.envfile with your own settings (see Section 3) - In that folder, launch I-ETL by running the following commands:
export CONTEXT_MODE=DEVexport ETL_ENV_FILE_NAME=.envexport ABSOLUTE_PATH_ENV_FILE=XwhereXis the absolute path to your.envfiledocker compose --env-file ${ABSOLUTE_PATH_ENV_FILE} up -d(-dstands for--daemon, meaning that I-ETL will run as a background process).
- To check whether I-ETL has finished, you can run
docker ps: ifthe-etldoes not show in the list, this means that it is done. - To check the logs of the ETL, you have two options:
- If you have specified the parameter
SERVER_FOLDER_LOG_ETLin your.env, you can look at the log files produced in the folder you specified in that parameter; - Otherwise, use
docker logs the-etl.
The .env file is a file to specify several parameters that I-ETL needs to run properly.
This includes parameters about the database, the files to use, the language, etc.
The provided .env file is a template: you have to fill each parameter with your own value.
| Parameter name | Description | Values | Example |
|---|---|---|---|
SERVER_FOLDER_METADATA |
The absolute path to the folder containing the metadata file. | A folder path | /home/better/data |
METADATA |
The metadata filename. | A filename | ES-HSJD-variables.xlsx |
SERVER_FOLDER_DATA |
The absolute path to the folder containing the datasets. | A folder path | /home/better/data |
DATA_FILES |
The list of comma-separated filenames. | Filename(s) | Baseline_Clinical_Table.xlsx,Genomic_Table.xlsx,vcf-folder/*.vcf |
| Parameter name | Description | Values | Example |
|---|---|---|---|
HOSPITAL_NAME |
The hospital name. | it_buzzi_uc1, rs_imgge, es_hsjd, it_buzzi_uc3, es_terrassa, de_ukk, es_lafe, il_hmc |
any value in the list |
DB_NAME |
The database name. | better_database or any string without special character except _ (underscore). Please use better_database for any database created for the Better project. |
better_database |
DB_DROP |
Whether to drop the database. WARNING: if True, this deletes the database before creating a new one: this action is NOT reversible! | False, True |
|
SERVER_FOLDER_MONGODB |
The absolute path to the folder in which MongoDB will store its databases. | A folder path | /home/mongodb-data |
DOCKER_NETWORK_OF_STATION |
The name of the Docker network used by the PADME trains in the station. | Any Docker network name | PADME_STATION_POLIMI86pht-net |
If you already run the ETL without specifying the Docker network name of your station, you should proceed as follows:
- Find the container ID of your currently running MongoDB container:
docker-ps|grep the-mongoshould list one entry with the container ID of the ETL MongoDB container. - Find the Docker network name on which the Mongo DB is: list the Docker networks with
docker network lsand find the right one (usually, it containspht-netin its name). - Disconnect the current Docker MongoDB container from the currently used Docker network using:
docker network disconnect <currently-used-docker-network> <container-id-of-your-mongo-db> - Find the Docker network name used by your PADME trains/station: list the Docker networks with
docker network lsand find the right one (usually, it containspadmeorstationin its name). - Connect your MongoDB container to your Docker network using
docker network connect <your-Docker-network-name> <container-id-of-your-mongo-db>
| Parameter name | Description | Values | Example |
|---|---|---|---|
SERVER_FOLDER_LOG_ETL |
The absolute path to the folder in which I-ETL will write its log files. | A folder path | /home/better/logs |
USE_LOCALE |
The locale to be used for reading numerics and dates. | en_GB, en_US, es_ES, it_IT, sr_RS. |
en_GB |
COLUMNS_TO_REMOVE |
The list of columns that are too sensitive, thus NOT included in the database. | [] (empty list), or a list with strings being the column names |
[patient_address, date_of_birth] |
RECORD_CARRIER_PATIENTS |
For Buzzi hospital in use-case 1 only: whether to records carrier and diseased patients or only diseased ones. | False, True |
|
PATIENT_ID |
The name of the column in the data containing patient IDs | Any column name | Patient ID |
SAMPLE_ID |
The name of the column in the data containing sample IDs | (empty) if you do not have sample data, else a column name |
sample_id |
| Parameter name | Description | Values | Example |
|---|---|---|---|
CATALOGUE_ONLY |
Whether to only compute the metadata and aggregates and send them to the catalogue. | True, False |
|
TOKEN |
The user token to push catalogue data to the web application. | A very long string | 12fgjo5-gjjos3 |
USECASE |
The use-case to which the database belong. | paediatric, retinal, asd |
paediatric |
The token is mandatory to submit the catalogue data to the API that will ingest it and present it on the Web interface. Instructions for generating a token are available in Deliverable 4.5.
Given a user input (explained below), the class DataRetriever takes care of:
- Generating the MongoDB query to fetch the data or metadata in the database
- Loading the retrieved data, respectively metadata, in a Pandas DataFrame
The user input is:
- the information to connect to the database (MongoDB URL and database name)
- the features (also called "variables") the user is interested in when collecting data
- the post-process methods to "flatten"/"normalize" the data values if needed
The package is available in pip meaning that it should be added to the requirements of the train (as in pip install data-retriever). To be sure that you have all the required package:
- Download the
requirements.txtfile in this repository - Create a virtual Python environment (using Python venv or Anaconda)
- Activate that environment
- Install all the required packages with
pip install -r requirements.txt - Install the
data-retrieverpackage withpip install data-retriever(latest version is 1.7) - Then, you should adapt the main file
query.pywith your own settings (database name, etc.) - Finally, you can run the
query.pyfile withpython3 query.py
A main example is available in the query.py file (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/src/query.py).
Lines 9 to 11, respectively 21 to 23, contain the user input:
- The variable
FEATURE_CODESis a dictionary (map) to associate a variable name to the ontology term that has been associated to it in the metadata (https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV) - The variable
FEATURE_FILTERSis a dictionary (map) to associate a variable name to (a) its code under the keycode, and (b) the filter to apply under the keyfilter. The value of thefiltershould be a MongoDBmatch. Note that the variables mentioned in the filters may differ from the ones selected. - The variable
FEATURES_VALUE_PROCESSis a dictionary (map) to associate a variable to a MongoDB operator to process/flatten the fetched values. The values in this dictionary can be eitherget_labelto get the human-readable name of a category, or any other MongoDB operator. It should be used for the variables leading to non-atomic values (especially dictionaries). If no process is needed (because the value is atomic) or you do not know which MongoDB operator to choose, useNone.
The next lines create a new DataRetriever with the information for the MongoDB connection, user variables, and the query type which is data. The method run() generates the MongoDB query to fetch the data from the specified database.
Then, it loads the fetched data into a DataFrame. This DataFrame is accessible in the variable the_dataframe (see dataRetriever.the_dataframe). Finally, the dataframe is exported to a CSV file.
For instance, in the query.py file, the first query fetches, for all female patients, their associated VCF filepath and whether they have hypotonia. The second query retrieves the same information for the IMGGE hospital.
A main example is available in the query.py file (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/src/query.py).
Line 33 contains the user input:
- The variable
FEATURE_CODESis an array (list) of the variable codes we want to collect metadata for (to find the variable codes, look at the metadata: https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV)
The next lines create a new DataRetriever with the information for the MongoDB connection, user variables, and the query type which is metadata. The method run() generates the MongoDB query to fetch the metadata from the specified database.
Then, it loads the fetched metadata into a DataFrame. This DataFrame is accessible in the variable the_dataframe (see dataRetriever.the_dataframe). Finally, the dataframe is exported to a CSV file.
For instance, in the query.py file, the last query fetches all metadata information about variables hypotonia, vcf_path and gene. The retrieved information concerns the variable name, code, data type, categories, visibility, etc. (all information specified in the metadata: to find the variable codes, look at the metadata: https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV)
To be used when working with the I-ETL repository
- Install Docker Desktop and open it
- From the root of the project, run
docker build . --tag ietl - If an error saying
ERROR: Cannot connect to the Docker daemon at XXX. Is the docker daemon running?occurs, Docker Desktop has not started. - If an error saying
error getting credentialsoccurs while building, go to your Docker config file (probably~/.docker/config.json) and remove the linecredsStore. Then, save the file and build again the image.
To be used when deploying I-ETL within a center
- Locally, build the Docker image: see above section
- Locally, create a TAR image of I-ETL (only with the ETL, not with the mongo):
docker save ietl > the-ietl-image.tar - Send that TAR image to the host machine, e.g., using
scp the-ietl-image.tar "[email protected]:/somewhere/in/host/machine" - Send the env file to the host machine in the same folder as the TAR image, e.g., using
scp .env "[email protected]:/somewhere/in/host/machine" - Send the compose file to the host machine in the same folder as the TAR image, e.g., using
scp compose.yaml "[email protected]:/somewhere/in/host/machine - In the host machine, move to
/somewhere/in/host/machine/usingcd /somewhere/in/host/machine - In the host machine, load the TAR image within the Docker of the host machine:
docker load < the-ietl-image.tar - In the host machine, follow any above scenario, i.e., tune the .env file and run I-ETL