Enable docker.

bioinform · Nov 10, 2021 · b7ff740 · b7ff740
1 parent 8e940e0
commit b7ff740
Show file tree

Hide file tree

Showing 54 changed files with 636 additions and 235 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+__pycache__/
+.mypy_cache/
+.vscode/
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,19 @@
+FROM continuumio/miniconda3:4.10.3
+
+RUN apt-get update \
+    && apt-get install -y build-essential procps
+
+#RUN useradd -ms /bin/bash  daedalus
+#USER daedalus
+#WORKDIR /home/daedalus
+
+COPY environment.yml .
+COPY install.sh .
+
+RUN conda env create -f environment.yml
+
+RUN chmod 755 /root
+RUN echo "conda activate Daedalus" >> ~/.bashrc
+SHELL ["/bin/bash", "--login", "-c"]
+
+RUN export GIT_SSL_NO_VERIFY=1 && bash install.sh
diff --git a/Dockerfile_bcl2fastq b/Dockerfile_bcl2fastq
@@ -0,0 +1,11 @@
+FROM continuumio/miniconda3:4.10.3
+
+RUN apt-get update \
+    && apt-get install -y build-essential procps
+
+COPY environment_bcl2fastq.yml .
+
+RUN conda env create -f environment_bcl2fastq.yml
+
+RUN chmod 755 /root
+RUN echo "conda activate bcl2fastq" >> ~/.bashrc
diff --git a/README.md b/README.md
@@ -3,78 +3,106 @@
 Nextflow pipeline for analysis of libraries prepared using the ImmunoPETE assay.
 
 - [Daedalus](#daedalus)
-    - [Download](#download)
-    - [Build Conda Environment](#build-conda-environment)
-    - [Test Pipeline](#test-pipeline)
-    - [Run Pipeline](#run-pipeline)
-        - [Load Environment](#load-the-environment)
-        - [Generate Manifest from Sample Sheet](#generate-manifest-from-sample-sheet)
-        - [Submit Pipeline Run](#submit-pipeline-run)
-        - [Output](#output)
-    - [Workflow](#workflow)
-    - [Methods](#methods)
+  - [Install and Configure](#install-and-configure)
+    - [Software Requirements](#software-requirements)
+    - [Download git repo](#download-git-repo)
+    - [Build Conda Environment (Optional)](#build-conda-environment-optional)
+    - [Install](#install)
+    - [Build Docker images](#build-docker-images)
+    - [Configure images](#configure-images)
+    - [Configure the pipeline](#configure-the-pipeline)
+    - [Test Pipeline on a single sample](#test-pipeline-on-a-single-sample)
+  - [Running Pipeline](#running-pipeline)
+    - [Generate Manifest from Sample Sheet](#generate-manifest-from-sample-sheet)
+    - [Submit Pipeline Run](#submit-pipeline-run)
+    - [Output](#output)
+  - [Workflow](#workflow)
+  - [Methods](#methods)
+
+## Install and Configure
 
 Note... The Nextflow Config file must be configured for the queue.
 
-## Software Requirements 
-- built on a linux server: CentOS Linux release 7.7.1908 (Core)
-- miniconda3, for package management
-- nextflow 19.07.0, to run the pipeline 
-- uge, for cluster job submission
+### Software Requirements
+
+- Python 3.6
+- Java 8
+- Nextflow 19.07.0, to run the pipeline
+- UGE, for cluster job submission
+
+### Download git repo
 
-## Download git rep
 ```bash
 git clone [email protected]:bioinform/Daedalus.git
 cd Daedalus
+git checkout tags/${release-version}
+```
+
+### Build Conda Environment (Optional)
+
+It's recommended to create a conda environment:
+
+```bash
+conda create -n Daedalus python=3.6
+conda activate Daedalus
 ```
 
-## Install SWIFR aligner
-A smith waterman alignment implemention (c++) was developed and is used to identfy primers and V/J gene segements from fastq formatted reads. Please read the full README for swifr in the packages folder `./packages/swifr/` for instructions how to install.
+### Install
 
-## Build Conda Environment
+Within Daedalus directory, execute the following command.
 
-Build the conda environment for running the pipeline:
 ```bash
-conda env create -f environment.yml
+pip install .
 ```
 
-##install python packages in the loaded conda ENV
+### Build Docker images
+
+Due to license restriction, you will have to build the Bcl2fastq image using the Docker file.
+Please refer to [Dockerhub](https://docs.docker.com/docker-hub/) for creating repo and pushing images.
 
 ```bash
-conda activate Daedalus_env
-./install_packages.sh
+docker build -t {dockerhub_username}/bcl2fastq:{version} -f Dockerfile_bcl2fastq .
+docker push {dockerhub_username}/bcl2fastq:{version}
 ```
 
-## Nextflow configuration
-Nextflow must be configured for each system. The ipete profile in the nextflow config file `./nextflow/nextflow.config` should be updated accordingly.
+### Configure images
+
+After building your own images, set the following params in the `nextflow/defaults-ipete.config` with your own images.
+
+```javascript
+params.bcl2fastq_docker = "{dockerhub_username}/bcl2fastq:{version}"
+```
 
+### Configure the pipeline
+
+The pipeline runs on UGE cluster by default. If you install it on a different machine, modify the cluster settings in the `nextflow/nextflow.config` accordingly.
+
+```javascript
+ipete_docker {
+    process.clusterOptions = { "-l h_vmem=${task.ext.vmem} -S /bin/bash -l docker_version=new -V" }
+}
+docker.runOptions = "-u=\$UID --rm -v /path/to/input_and_output:/path/to/input_and_output"
+```
 
-## Test Pipeline on a single sample
+### Test Pipeline on a single sample
 Once all the software has been installed and nextflow has been configured the pipeline bats test can be run. The bats test runs the pipeline on a single sample, from the paired fastq files provided:
 - PBMC_1000ng_25ul_2_S6_R1_001.fastq.gz
 - PBMC_1000ng_25ul_2_S6_R2_001.fastq.gz
 
 In order the run the test, download both files from dropbox and move them into the data folder `Daedalus/data`. Once the data is available, run the test using the following commands:
 
 ```bash
-conda activate Daedalus_env
 cd test
 bats single-sample-ipete.bats
 ```
 
 An example of the pipeline output has also been provided: `PBMC_1000ng_25ul_2.tar.gz`
 
-
 ## Running Pipeline
-Running the pipeline requires a complete flowcell worth of immunoPETE libraries.
 
-### Load the Environment
-
-```bash
-conda activate Daedalus_env
-```
+Running the pipeline requires a complete flowcell worth of immunoPETE libraries.
 
-### Generate Manifest for ImmunoPETE Run from the Sample Sheet
+### Generate Manifest from Sample Sheet
 
 ```bash
 manifestGenerator = /path/to/Daedalus/pipeline_runner/manifest_generator.py
@@ -84,48 +112,40 @@ sampleSheet = /path/to/sampleSheet.csv
 python ${manifestGenerator} \
        --pipeline_run_id Daedalus_example_run \
        --sequencing_run_folder ${illuminaDir} \
-       --sequencing_platform NextSeq \
        --output Daedalus_example_manifest.csv \
        --subsample 1 \
        --umi_mode True \
        --umi2 'NNNNNNNNN' \
        --umi_type R2 \
-	   ${sampleSheet}
-
+       ${sampleSheet}
 ```
 
 The manifest file contains all parameters needed for the pipeline to run. Sample specific tuning of parameters or any updates to the parameters can be acheived by editing the manifest file generated. After edits are complete, the pipeline can be submitted using the manifest file alone.
 
-### Submit the Pipeline Run on the cluster
+### Submit Pipeline Run
 
-Using the output from Manifest Generator `Daedalus_example_manifest.csv` pipeline runs are submitted using the script: pipeline_runner.py. 
+Using the output from Manifest Generator `Daedalus_example_manifest.csv` pipeline runs can be submitted using the script: pipeline_runner.py.
 
 ```bash
 pipelineRunner=/path/to/Daedalus/pipeline_runner/pipeline_runner.py
 outDir=/path/to/analysis/output
 
-python ${pipelineRunner} -g rssprbf --wait --resume -o ${outDir} Daedalus_example_manifest.csv
-``````
+python ${pipelineRunner} -g rssprbfprj --wait --resume -o ${outDir} Daedalus_example_manifest.csv
+```
 
 A `-g $group` needs to be provided to submit jobs to SGE cluster on SC1. 
 
-### Pipeline Output
+### Output
 
 At the specified output directory `${outDir}`, the analysis folder will be written using the `pipeline_run_id` "Daedalus_example_run"  
 
 ```bash
 ${outDir}/Daedalus_example_run
 ```
 
-## Nextflow Workflow DAG
+## Workflow
 
 ![workflow](docs/img/flowchart.png)
 
 ## Methods
 Overview of the [Pipeline Methods](docs/Daedalus_methods.md) for key processing steps.
-
-
-
-
-
-
diff --git a/database/daedalus_db/pipeline_logger.py b/database/daedalus_db/pipeline_logger.py
@@ -113,8 +113,9 @@ def check_log_status(self, log_file):
                 if re.search("Execution status: failed", line):
                     return("fail")
                 elif re.search("Execution status: OK", line):
-                    return("pass")                
-
+                    return("pass")
+            return("fail")
+
     def get_analysis_info(self):
         """
         Summary of Pipeline/Analysis status

diff --git a/database/daedalus_db/run_info.py b/database/daedalus_db/run_info.py
@@ -34,10 +34,14 @@ def _parse(self, fname):
         self.flowcell_id = run.find('Flowcell').text
         self.instrument = run.find('Instrument').text
         self.sequencing_date = run.find('Date').text
+        if len(self.sequencing_date) > 10:
+            self.sequencing_date = self.sequencing_date.split(' ')[0]
         if len(self.sequencing_date) == 6:
             self.sequencing_date = datetime.strptime(self.sequencing_date, '%y%m%d').date()
         elif len(self.sequencing_date) == 8:
             self.sequencing_date = datetime.strptime(self.sequencing_date, '%Y%m%d').date()
+        elif len(self.sequencing_date) == 9:
+            self.sequencing_date = datetime.strptime(self.sequencing_date, '%m/%d/%Y').date()
         else:
             self.logger.warning(
                 'Unrecognized sequencing date format: {}. Record raw string instead.'.format(self.sequencing_date)

diff --git a/docs/Daedalus_methods.md b/docs/Daedalus_methods.md
@@ -1,6 +1,6 @@
 ## Methods
-- [V-J segment detection](alignment.md)
-- [Alignment Parsing, CDR3 detection](alignment_parsing.md)
+- [Alignment](alignment.md)
+- [V-D-J recombinant detection](alignment_parsing.md)
 - [UMI-CDR3 deduplication](deduplication.md)
 - [UMI-CDR3 consensus](consensus.md)
 
diff --git a/docs/alignment.md b/docs/alignment.md
@@ -1,4 +1,4 @@
-## V-J segment detection
+## V-D-J recombinant detection
 V and J gene segments are aligned against Reads using [swifr](http://ghe-rss.roche.com/plsRED-Bioinformatics/swifr). 
 SWIFR (Smith Waterman Implementation for Fast Read identification) utilizes a `kmer similarity index` and the `Smith Waterman` alignment algorithm to identify gene segements within reads.
 

diff --git a/docs/alignment_parsing.md b/docs/alignment_parsing.md
@@ -1,4 +1,4 @@
-## Alignment Parsing
+## V-D_J recombinant detection
 After V and J gene alignment, read alignments are parsed along with the reference annotation to find the CDR3 sequence.
 
 CDR3 boundary elements are annotated in the ImmunoDB reference: http://ghe-rss.roche.com/plsRED-Bioinformatics/immunoDB

diff --git a/environment.yml b/environment.yml
@@ -8,6 +8,7 @@ channels:
   - defaults
   - bioconda
   - conda-forge
+  - dranew
 dependencies:
   - bats=0.4.0=1
   - blas=1.0=mkl
@@ -49,9 +50,11 @@ dependencies:
   - wrapt=1.10.11=py36_0
   - xz=5.2.3=0
   - zlib=1.2.11=0
+  - samtools=1.9=h46bd0b3_0
+  - bzip2=1.0.6=3
   - pip:
     - argparse==1.4.0
-    - bio==0.1.0
+    - bio==1.2.8
     - boto==2.49.0
     - boto3==1.12.19
     - botocore==1.15.19

diff --git a/environment_bcl2fastq.yml b/environment_bcl2fastq.yml
@@ -0,0 +1,49 @@
+name: bcl2fastq
+channels:
+  - bioconda/label/cf201901
+  - https://repo.continuum.io/pkgs/free
+  - conda-forge/label/cf201901
+  - anaconda
+  - guillaumecharbonnier
+  - defaults
+  - bioconda
+  - conda-forge
+  - dranew
+dependencies:
+  - blas=1.0=mkl
+  - boltons=19.2.0=py_0
+  - ca-certificates=2018.11.29=ha4d7672_0
+  - certifi=2018.11.29=py36_1000
+  - contextlib2=0.6.0.post1=py_0
+  - debtcollector=1.21.0=py36_0
+  - decorator=4.4.1=py_0
+  - dit=1.2.3=py_1
+  - httplib2=0.12.0=py36_1000
+  - libgcc-ng=9.1.0=hdf63c60_0
+  - libgfortran=3.0.0=1
+  - libgfortran-ng=7.3.0=hdf63c60_0
+  - libstdcxx-ng=9.1.0=hdf63c60_0
+  - mkl=2017.0.3=0
+  - ncurses=5.9=10
+  - networkx=2.4=py_0
+  - numpy=1.13.3=py36ha266831_3
+  - openssl=1.0.2p=h470a237_2
+  - pandas=0.25.3=py36he6710b0_0
+  - pbr=5.4.2=py_0
+  - pip=19.3.1=py36_0
+  - python=3.6.2=0
+  - python-dateutil=2.8.1=py_0
+  - pytz=2019.3=py_0
+  - readline=6.2=2
+  - requests=2.12.5=py36_0
+  - setuptools=36.4.0=py36_1
+  - six=1.13.0=py36_0
+  - sqlite=3.13.0=0
+  - tk=8.5.18=0
+  - wheel=0.29.0=py36_0
+  - wrapt=1.10.11=py36_0
+  - xz=5.2.3=0
+  - zlib=1.2.11=0
+  - samtools=1.9=h46bd0b3_0
+  - bcl2fastq=2.19.0=1
+  - bzip2=1.0.6=3
diff --git a/install.sh b/install.sh
@@ -0,0 +1,15 @@
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/fastq-streamer#v0.1.0
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/bam-streamer#v0.1.0
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/VDJ-recombinant-detection#v0.1.0
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/parse-umi#v0.1.0
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/extract-umi#v0.1.0
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/SeqNetworks#v0.1.0
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/ipete-dedup#v0.1.1
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/ipete-metrics#v0.1.2
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/ipete-reporter#v0.1.2
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/spikein-split#v0.1.0
+pip install --upgrade git+http://ghe-rss.roche.com/pls-red-packages/trim-primers#v0.1.0
+
+
+
+
diff --git a/nextflow/config/VDJ_detect/VDJ_detect.config b/nextflow/config/VDJ_detect/VDJ_detect.config
@@ -1,10 +1,10 @@
 process {
     withName:  VDJ_detect {
         memory = '30 GB'
-        cpus = 1 
-        module = [
-            'miniconda3'
-	]      
+        cpus = 1
+        container = {
+          "${params.daedalus_docker}"
+        }    
         ext {
             command = {"$workflow.projectDir/config/VDJ_detect/VDJ_detect.sh"}
             vmem = '30G'

diff --git a/nextflow/config/VDJ_detect/VDJ_detect.sh b/nextflow/config/VDJ_detect/VDJ_detect.sh
@@ -1,5 +1,7 @@
-#!/bin/bash
-source activate Daedalus_env
+#!/bin/bash -e
+
+# Load conda env within Docker
+source /root/.bashrc || echo "Failed to source /root/.bashrc" >&2
 
 VDJdetector -v $vSortBam -j $jSortBam -b ${sample} -i ${percentId} -r ${referenceData}