This tool has been developed to scan the analysis objects held by specific projects in ENA for duplication
The tool exclude the suppressed and cancelled data and print out three files
- Summary of the duplicates
- List of accessions to be cancelled ( if exist)
- List of accessions to be suppressed (if exist)
Note: The tool send the most recent submission out of the duplicates to be suppressed/ cancelled except if the duplicates contains both private and public status, then it will cancel the private analysis regardless of the submission date
This script needs to be run through the VPN
Python 3
The ERA database is an Oracle database. In order to query the db, this script uses the cx_Oracle python module, which requires a little setup.
-
Install the module using:
pip install cx_Oracle -
The Oracle Instant Client is a requirement of this module. The ‘Basic Light’ package is sufficient for our needs.
-
Once the instant client is downloaded, set the location of this library using the
$ORACLE_CLIENT_LIBenvironment variable before using this script.
Setting up the Enviroment
,NO NEED FOR ROOT WORK
-
Unzip the
instantclient -
Find the path for the unzipped
instantclientand save it -
Edit the
.bashrcfile to set oracle enviroment -
Add the following lines to the end of
.bashrcfile-
export ORACLE_HOME=/path/to/oracle/instantclient -
export LD_LIBRARY_PATH=$ORACLE_HOME:$LD_LIBRARY_PATH -
export PATH=$ORACLE_HOME:$PATH -
export ORACLE_CLIENT_LIB=$ORACLE_HOME
-
-
sourcethe.bashrcfilesource $HOME/.bashrc
For more details, see: https://cx oracle.readthedocs.io/en/latest/user_guide/installation.html
-
Provide your ERAPRO details and credentials into the config file (config.yaml)
-
Provide the project accessions that you want to scan into the config file (config.yaml)
-
Run the script using the following flags
--config/-c: path/ to/config file--output/-o: path /to/the/output folder
-
Example:
python3 analysis_duplication_scan.py -c path/to/config.yaml -o path/to/output folder