I-ETL

The ETL algorithm creates interoperable databases for the BETTER project. It relies on MongoDB to store the data and reads a .env file for the configuration. The code is written in Python and encapsulated in a Docker container.

1. Requirements

Docker Desktop is installed on the host machine.
All input files are in the same folder on your machine, and they have the exact same name as specified in the metadata.
- Specifically, for VCF data:
  - You can group them in a sub-folder (or not).
  - If you grouped them in a folder, e.g., named vcf-folder, please provide vcf-folder/*.vcf for the data files in the .env file.
  - If you did not group them, please provide *.vcf for the data files in the .env file.
  - The VCF files should be named with the patient ID (exactly and only the patient ID used the other data).
In your data files: variables are columns, patients are rows and patients have identifiers (which will be further anonymized by I-ETL).
The column name for the patient ID, respectively the sample ID, is the same for all the files used to build the database.

2. Use I-ETL to create an interoperable database with your data

Get the image of the I-ETL:

Either download the TAR image available in the repository (recommended)
- Go to the deployment artifacts page: https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/artifacts
- Click on the "folder" icon of the latest valid build (the most recent one with a green ☑)
- Download the TAR archive named the-ietl-image.tar (be sure to download a .tar, not a .zip)
Or build it from the repository (not recommended, see section "For developers")

Download the compose.yaml file, available in the repository (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/compose.yaml?ref_type=heads)
Download the settings file .env.example file, available in the repository (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/.env.example?ref_type=heads) and rename it to .env
Download your metadata file in the Better Drive: https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV

ES-HSJD-variables.xlsx for SJD hospital
IT-BUZZI-variables.xlsx for BUZZI hospital in use-case 1
RS-IMGGE-variables.xlsx for IMGGE hospital
UC2-variables.xlsx for LAFE and HMC hospitals
UC3-variables-02-04-2025.xlsx for TERRASSA, UKK and BUZZI hospitals

Create a folder, e.g., named better, with:

The I-ETL Docker (TAR) image
The .env file template
The compose.yaml file
The metadata file

In that folder, load the TAR image within the Docker: docker load < the-ietl-image.tar
In that folder, fill the .env file with your own settings (see Section 3)
In that folder, launch I-ETL by running the following commands:

export CONTEXT_MODE=DEV
export ETL_ENV_FILE_NAME=.env
export ABSOLUTE_PATH_ENV_FILE=X where X is the absolute path to your .env file
docker compose --env-file ${ABSOLUTE_PATH_ENV_FILE} up -d (-d stands for --daemon, meaning that I-ETL will run as a background process).

To check whether I-ETL has finished, you can run docker ps: if the-etl does not show in the list, this means that it is done.
To check the logs of the ETL, you have two options:

If you have specified the parameter SERVER_FOLDER_LOG_ETL in your .env, you can look at the log files produced in the folder you specified in that parameter;
Otherwise, use docker logs the-etl.

3. Parameters in the `.env` file

The .env file is a file to specify several parameters that I-ETL needs to run properly. This includes parameters about the database, the files to use, the language, etc.

The provided .env file is a template: you have to fill each parameter with your own value.

About input data (synthetic or real) given to the ETL

Parameter name	Description	Values	Example
`SERVER_FOLDER_METADATA`	The absolute path to the folder containing the metadata file.	A folder path	`/home/better/data`
`METADATA`	The metadata filename.	A filename	`ES-HSJD-variables.xlsx`
`SERVER_FOLDER_DATA`	The absolute path to the folder containing the datasets.	A folder path	`/home/better/data`
`DATA_FILES`	The list of comma-separated filenames.	Filename(s)	`Baseline_Clinical_Table.xlsx,Genomic_Table.xlsx,vcf-folder/*.vcf`

About the database

Parameter name	Description	Values	Example
`HOSPITAL_NAME`	The hospital name.	`it_buzzi_uc1`, `rs_imgge`, `es_hsjd`, `it_buzzi_uc3`, `es_terrassa`, `de_ukk`, `es_lafe`, `il_hmc`	any value in the list
`DB_NAME`	The database name.	`better_database` or any string without special character except `_` (underscore). Please use `better_database` for any database created for the Better project.	`better_database`
`DB_DROP`	Whether to drop the database. WARNING: if True, this deletes the database before creating a new one: this action is NOT reversible!	`False`, `True`
`SERVER_FOLDER_MONGODB`	The absolute path to the folder in which MongoDB will store its databases.	A folder path	`/home/mongodb-data`
`DOCKER_NETWORK_OF_STATION`	The name of the Docker network used by the PADME trains in the station.	Any Docker network name	`PADME_STATION_POLIMI86pht-net`

If you already run the ETL without specifying the Docker network name of your station, you should proceed as follows:

Find the container ID of your currently running MongoDB container: docker-ps|grep the-mongo should list one entry with the container ID of the ETL MongoDB container.
Find the Docker network name on which the Mongo DB is: list the Docker networks with docker network ls and find the right one (usually, it contains pht-net in its name).
Disconnect the current Docker MongoDB container from the currently used Docker network using: docker network disconnect <currently-used-docker-network> <container-id-of-your-mongo-db>
Find the Docker network name used by your PADME trains/station: list the Docker networks with docker network ls and find the right one (usually, it contains padme or station in its name).
Connect your MongoDB container to your Docker network using docker network connect <your-Docker-network-name> <container-id-of-your-mongo-db>

About the ETL

Parameter name	Description	Values	Example
`SERVER_FOLDER_LOG_ETL`	The absolute path to the folder in which I-ETL will write its log files.	A folder path	`/home/better/logs`
`USE_LOCALE`	The locale to be used for reading numerics and dates.	`en_GB`, `en_US`, `es_ES`, `it_IT`, `sr_RS`.	`en_GB`
`COLUMNS_TO_REMOVE`	The list of columns that are too sensitive, thus NOT included in the database.	`[]` (empty list), or a list with strings being the column names	[`patient_address`, `date_of_birth`]
`RECORD_CARRIER_PATIENTS`	For Buzzi hospital in use-case 1 only: whether to records carrier and diseased patients or only diseased ones.	`False`, `True`
`PATIENT_ID`	The name of the column in the data containing patient IDs	Any column name	`Patient ID`
`SAMPLE_ID`	The name of the column in the data containing sample IDs	(empty) if you do not have sample data, else a column name	`sample_id`

About the catalogue

Parameter name	Description	Values	Example
`CATALOGUE_ONLY`	Whether to only compute the metadata and aggregates and send them to the catalogue.	`True`, `False`
`TOKEN`	The user token to push catalogue data to the web application.	A very long string	12fgjo5-gjjos3
`USECASE`	The use-case to which the database belong.	`paediatric`, `retinal`, `asd`	`paediatric`

The token is mandatory to submit the catalogue data to the API that will ingest it and present it on the Web interface. Instructions for generating a token are available in Deliverable 4.5.

4. Querying the ETL database within a train

Steps

Given a user input (explained below), the class DataRetriever takes care of:

Generating the MongoDB query to fetch the data or metadata in the database
Loading the retrieved data, respectively metadata, in a Pandas DataFrame

The user input is:

the information to connect to the database (MongoDB URL and database name)
the features (also called "variables") the user is interested in when collecting data
the post-process methods to "flatten"/"normalize" the data values if needed

The package is available in pip meaning that it should be added to the requirements of the train (as in pip install data-retriever). To be sure that you have all the required package:

Download the requirements.txt file in this repository
Create a virtual Python environment (using Python venv or Anaconda)
Activate that environment
Install all the required packages with pip install -r requirements.txt
Install the data-retriever package with pip install data-retriever (latest version is 1.7)
Then, you should adapt the main file query.py with your own settings (database name, etc.)
Finally, you can run the query.py file with python3 query.py

Example to retrieve data

A main example is available in the query.py file (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/src/query.py).

Lines 9 to 11, respectively 21 to 23, contain the user input:

The variable FEATURE_CODES is a dictionary (map) to associate a variable name to the ontology term that has been associated to it in the metadata (https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV)
The variable FEATURE_FILTERS is a dictionary (map) to associate a variable name to (a) its code under the key code, and (b) the filter to apply under the key filter. The value of the filter should be a MongoDB match. Note that the variables mentioned in the filters may differ from the ones selected.
The variable FEATURES_VALUE_PROCESS is a dictionary (map) to associate a variable to a MongoDB operator to process/flatten the fetched values. The values in this dictionary can be either get_label to get the human-readable name of a category, or any other MongoDB operator. It should be used for the variables leading to non-atomic values (especially dictionaries). If no process is needed (because the value is atomic) or you do not know which MongoDB operator to choose, use None.

The next lines create a new DataRetriever with the information for the MongoDB connection, user variables, and the query type which is data. The method run() generates the MongoDB query to fetch the data from the specified database. Then, it loads the fetched data into a DataFrame. This DataFrame is accessible in the variable the_dataframe (see dataRetriever.the_dataframe). Finally, the dataframe is exported to a CSV file.

For instance, in the query.py file, the first query fetches, for all female patients, their associated VCF filepath and whether they have hypotonia. The second query retrieves the same information for the IMGGE hospital.

Example to retrieve metadata

A main example is available in the query.py file (https://git.rwth-aachen.de/padme-development/external/better/data-cataloging/etl/-/blob/main/src/query.py).

Line 33 contains the user input:

The variable FEATURE_CODES is an array (list) of the variable codes we want to collect metadata for (to find the variable codes, look at the metadata: https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV)

The next lines create a new DataRetriever with the information for the MongoDB connection, user variables, and the query type which is metadata. The method run() generates the MongoDB query to fetch the metadata from the specified database. Then, it loads the fetched metadata into a DataFrame. This DataFrame is accessible in the variable the_dataframe (see dataRetriever.the_dataframe). Finally, the dataframe is exported to a CSV file.

For instance, in the query.py file, the last query fetches all metadata information about variables hypotonia, vcf_path and gene. The retrieved information concerns the variable name, code, data type, categories, visibility, etc. (all information specified in the metadata: to find the variable codes, look at the metadata: https://drive.google.com/drive/u/1/folders/1J-3C2g06WbC1gUE_3KaDp3_v1uKHXxFV)

5. For developers

Build the Docker image

To be used when working with the I-ETL repository

Install Docker Desktop and open it
From the root of the project, run docker build . --tag ietl
If an error saying ERROR: Cannot connect to the Docker daemon at XXX. Is the docker daemon running? occurs, Docker Desktop has not started.
If an error saying error getting credentials occurs while building, go to your Docker config file (probably ~/.docker/config.json) and remove the line credsStore. Then, save the file and build again the image.

Steps to deploy the Docker image within a center

To be used when deploying I-ETL within a center

Locally, build the Docker image: see above section
Locally, create a TAR image of I-ETL (only with the ETL, not with the mongo): docker save ietl > the-ietl-image.tar
Send that TAR image to the host machine, e.g., using scp the-ietl-image.tar "[email protected]:/somewhere/in/host/machine"
Send the env file to the host machine in the same folder as the TAR image, e.g., using scp .env "[email protected]:/somewhere/in/host/machine"
Send the compose file to the host machine in the same folder as the TAR image, e.g., using scp compose.yaml "[email protected]:/somewhere/in/host/machine
In the host machine, move to /somewhere/in/host/machine/ using cd /somewhere/in/host/machine
In the host machine, load the TAR image within the Docker of the host machine: docker load < the-ietl-image.tar
In the host machine, follow any above scenario, i.e., tune the .env file and run I-ETL

Name		Name	Last commit message	Last commit date
Latest commit History 489 Commits
datasets		datasets
scripts		scripts
src		src
tests		tests
.env.ci		.env.ci
.env.example		.env.example
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
compose.local.yaml		compose.local.yaml
compose.test.yaml		compose.test.yaml
compose.yaml		compose.yaml
entrypoint.sh		entrypoint.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

I-ETL

1. Requirements

2. Use I-ETL to create an interoperable database with your data

3. Parameters in the `.env` file

About input data (synthetic or real) given to the ETL

About the database

About the ETL

About the catalogue

4. Querying the ETL database within a train

Steps

Example to retrieve data

Example to retrieve metadata

5. For developers

Build the Docker image

Steps to deploy the Docker image within a center

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

DEIB-GECO/i-etl

Folders and files

Latest commit

History

Repository files navigation

I-ETL

1. Requirements

2. Use I-ETL to create an interoperable database with your data

3. Parameters in the .env file

About input data (synthetic or real) given to the ETL

About the database

About the ETL

About the catalogue

4. Querying the ETL database within a train

Steps

Example to retrieve data

Example to retrieve metadata

5. For developers

Build the Docker image

Steps to deploy the Docker image within a center

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

3. Parameters in the `.env` file

Packages