Warning EnsemblLite is not ready for use! We will remove this notice when we are ready to post to PyPi at which point it will be ready for trialling. In the meantime, you can check the project progress towards being usable via the EnsemblLite roadmap.
🎬 Very early proof-of-concept demo and plan for a new style terminal user interface
demo-tui.mp4
NOTE: the command line name has changed since this early version. See text below for the new name.
Fork the repo and clone your fork to your local machine. In the terminal, create either a python virtual environment or a new conda environment and activate it. In that virtual environment
$ pip install flit
Then do the flit version of a "developer install". (It is basically creating a symlink to the repos source directory.)
$ flit install -s --python `which python`
Suggest creating a conda environment or a python virtual environment, using python3.11. Then install directly into that environment from the GitHub repo as
$ python -m pip install "ensembl_lite @ git+https://github.com/cogent3/EnsemblLite.git@develop"
Then run for the first time using
$ elt tui
The first start takes a while as, behind the scenes, cogent3 is transpiling various functions into C and compiling them. Eventually, you get a very neat terminal interface you can click around in. To exit, make sure the "root" is selected on the left panel then ^+r
.
The setup is (for now) controlled using a config file, defined in ini
format. To get a starting template use the exportrc
subcommand.
Usage: elt exportrc [OPTIONS]
exports sample config and species table to the nominated path
setting an environment variable ENSEMBLDBRC with this path will force its
contents to override the default ensembl_lite settings
Options:
-o, --outpath PATH path to directory to export all rc contents
--help Show this message and exit.
Click to see a sample config file I've been using for development
Using this config, it takes approximately 16' to download (over a ~200MB/s WiFi connection) and ~45' to install on my M2 Macbook Pro (note the install is incomplete). (Note this step uses up to 10 CPU cores.)
[remote path]
host=ftp.ensembl.org
path=pub
[local path]
staging_path=~/Desktop/Outbox/ensembl_download
install_path=~/Desktop/Outbox/ensembl_install
[release]
release=110
[Mouse Lemur]
db=core
[Macaque]
db=core
[Gibbon]
db=core
[Orangutan]
db=core
[Bonobo]
db=core
[Human]
db=core
[Chimp]
db=core
[Gorilla]
db=core
[compara]
align_names=10_primates.epo
Downloads the species indicated in the config file:
- genomes sequences as fasta format
- annotations as gff3
- gene homologies for individual genomes in tsv format
Alignments indicated in the config file will be downloaded in .maf
format.
Downloads are written to a local directory, specified in the config file. Downloads are done in parallel (using threads).
"Installation" presently involves transforming downloaded files into local sqlite3 databases which are saved to the location specified in the config file.
From the maf alignment files, the "ancestral" sequences are discarded and for every aligned sequence only the gap data is stored (i.e. gap position and length) along with the genomic coordinates. These alignments will be reconstructable by combining this information with the whole genome sequence. (This approach reduces storage requirements ~5-fold).
Installation is done in parallel on multiple CPUs (since the data need to be decompressed on the fly).