Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TB Pipeline on Amazon Genomics CLI #24

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
72066c3
agc-project.yaml added
mmueller76 Sep 26, 2022
ee7d1f2
AGC context name changed
mmueller76 Sep 28, 2022
fb1dd53
agc profile added to nextflow.config
mmueller76 Sep 28, 2022
f2dd0e7
AGC MANIFEST and inputs.json for test scenario1 added
mmueller76 Sep 28, 2022
28006bf
test manifest and input.json moved to test/agc
mmueller76 Sep 28, 2022
4fdad13
input.json added and MANIFEST.json modified for full workflow run
mmueller76 Sep 30, 2022
e7412e3
python script to generate test scenario is being added
omrkha Sep 30, 2022
5dd02c7
fixed input file path bug in software-json.py
mmueller76 Sep 30, 2022
324b66f
make singularity directory a workflow input
mmueller76 Sep 30, 2022
7a4f76f
make resources worklow input
mmueller76 Sep 30, 2022
50de0d0
white space change
mmueller76 Sep 30, 2022
5de881a
remove resources from project
mmueller76 Sep 30, 2022
f3442e4
tests-scenario-script.py removed
mmueller76 Sep 30, 2022
6952490
add build_and_push.sh script to build Docker containers and push them…
mmueller76 Sep 30, 2022
008145d
made resources input of clockwork workflow
mmueller76 Oct 2, 2022
e501286
pull_and-push.sh script to push Docker containers to ECR added
mmueller76 Oct 2, 2022
4dbff96
fix binary name in gnomon process
mmueller76 Oct 3, 2022
e8fe6a7
use Docker containers in ECR
mmueller76 Oct 3, 2022
6f3a482
script to upload resources (bowtie index, KrakenDB and resouces conta…
mmueller76 Oct 3, 2022
14b4199
agc added to profile section of help message
mmueller76 Oct 4, 2022
fa36883
defaultCtx name changed to ondemand
mmueller76 Oct 6, 2022
9f36818
tag construction in build_and_push.sh refactored
mmueller76 Oct 6, 2022
793f243
report, trace, timeline and dag generation added to nextflow.config; …
mmueller76 Oct 6, 2022
df9f7ad
output_dir param removed from testing.config
mmueller76 Oct 6, 2022
b0dc499
fixed gnomonicus JSON output file name
mmueller76 Oct 12, 2022
5600c84
changed configuration of report, timeline, trace and dag generation
mmueller76 Oct 12, 2022
b236cfd
updated script to provision resources
mmueller76 Oct 12, 2022
bac7e3d
kraken_db and bowtie2_index parameters in testing.config updated
mmueller76 Oct 12, 2022
b5df9ba
script to submit test workflow runs
omrkha Oct 26, 2022
fa0bbb2
script to upload resources to s3 finalised
omrkha Oct 26, 2022
ac8a2c9
output file parameter added to test scenarios generation
omrkha Oct 26, 2022
36550b7
session ID added to runtime report
mmueller76 Oct 28, 2022
99be17a
typo in context name in agc-project.yaml fixed
mmueller76 Oct 28, 2022
638c1f0
workflow.sessionId added to output directory path
mmueller76 Oct 28, 2022
3ccdcab
defaultCtx removed from AGC project YAML
mmueller76 Oct 28, 2022
83b7506
tidy up dryrun-test-agc script
mmueller76 Oct 28, 2022
b469dd5
renamed ash-script-workflow.sh -> dryrun-test-agc.sh
mmueller76 Oct 28, 2022
ee033ae
fixed indentation in main.nf
mmueller76 Oct 28, 2022
08d5c99
cleaned up provision_resources.sh script
mmueller76 Oct 28, 2022
1fb0696
merge conflict in modules/vcfpredictModules.nf resolved
mmueller76 Nov 2, 2022
8a76832
made python scripts executable and updated script path in modules
mmueller76 Nov 3, 2022
7e296b0
make Nextflow execution report path configurable
mmueller76 Nov 3, 2022
cc70ebf
deleted tests/agc/MANIFEST.scenario1.json
mmueller76 Nov 3, 2022
0c146ce
refactored tests/agc/generate-test-scenario-inputs.py
mmueller76 Nov 3, 2022
f9c271e
consolidated tests/agc/generate-test-scenario-inputs.py and dryrun-te…
mmueller76 Nov 3, 2022
e68eedd
resources directory add back into project directory
mmueller76 Nov 3, 2022
06e061c
workflow sessionId removed from output path
mmueller76 Nov 4, 2022
953c7ad
worklfow run submission command in dry run script fixed
mmueller76 Nov 4, 2022
dbfa697
scenarios uncommented
mmueller76 Nov 4, 2022
99e602f
resources_dir and container_registry params added to nextflow.config
mmueller76 Nov 4, 2022
7503bf4
container_registry param added to inputs.json
mmueller76 Nov 4, 2022
3561f7c
resources_dir and container_registry added to main.nf
mmueller76 Nov 4, 2022
82cc05a
refactored provision_resources.sh script
mmueller76 Nov 4, 2022
3e6a825
data section removed from agc-project.yaml file
mmueller76 Nov 4, 2022
3e9104f
AGC section added to README.md
mmueller76 Nov 4, 2022
2c78db1
inputs.json templated
mmueller76 Nov 4, 2022
6dd9483
changes in testing.config reverted
mmueller76 Nov 4, 2022
8f633a9
kraken_db and bowtie2_index params add to dryrun-test-agc.py
mmueller76 Nov 4, 2022
f19ec2b
output_dir reinstated in testing.config
mmueller76 Nov 4, 2022
2f03654
white space change testing.config
mmueller76 Nov 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions MANIFEST.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"mainWorkflowURL": "main.nf",
"inputFileURLs": [
"inputs.json"
],
"engineOptions": "-profile agc"
}
121 changes: 121 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,127 @@ For more information on the parameters run `nextflow run main.nf --help`

The path to the singularity images can also be changed in the singularity profile in `nextflow.config`. Default value is `${baseDir}/singularity`


## Amazon Genomics CLI ##
The workflow can be executed on Amazon Web Services infrastructure using [Amazon Genomics CLI](https://aws.github.io/amazon-genomics-cli/) (ACG). See [prerequisites](https://aws.github.io/amazon-genomics-cli/docs/getting-started/prerequisites/).

### Prepare workflow execution through AGC ###

1. Download and install AGC following the [instructions](https://aws.github.io/amazon-genomics-cli/docs/getting-started/installation)

2. Activate the AWS account for use with AGC. This will deploy the AGC core infrastructure in your AWS account.

```
agc account activate
```

3. Define a username

```
agc configure email [email protected]
```

4. Configure additional S3 buckets (optional)

AGC creates an S3 bucket to store logs and outputs and for input caching. If you want to use a separate bucket for resources and inputs this needs to be configured in the `agc-project.yaml`:

```
data:
- location: s3://<your-bucket-name>
readOnly: true
```

Please note that AGC can write to the bucket provisioned on account activation. Access to any other buckets is read only. If you are not using additional S3 buckets delete the data section from `agc-project.yaml`.

5. Provision resources. Run `provision_resources.sh` to upload KrakenDB files, Bowtie index files and TB pipeline resource files to S3, e.g.:

```
./provision_resources.sh s3://<agc-bucket-name>/project/tbpipeline/resources/
```

Note that the `resources` folder in the project directory will be moved out of the directory to `../tb-pipeline-resources`. This is to avoid the `resources` being packaged up with the project directory and uploaded to AGC every time an AGC run is submitted.

6. Deploy the AGC context. This will deploy the compute environment to execute workflows in your AWS account. Two contexts are defined in `agc-project.yaml`: `ondemand` for execution on on-demand EC2 instances and `spot` for execution on spot instances.

To Deploy the `ondemand` context:

```
agc context deploy --context ondemand
```

7. Edit the `inputs.json` file as required. The `inputs.json` file defines the workflow parameters used by Nextflow to run the workflow, eg:

```
{
"input_dir": "s3://<agc-bucket-name>/input/sequencing/mtuberculosis",
"filetype": "fastq",
"pattern": "*_{1,2}.fastq.gz",
"species": "tuberculosis",
"unmix_myco": "yes",
"resources_dir": "s3://<agc-bucket-name>/project/tbpipeline/resources/tbpipeline",
"kraken_db": "s3://<agc-bucket-name>/project/tbpipeline/resources/kraken_db/k2_pluspf_16gb_20220607",
"bowtie2_index": "s3:///project/tbpipeline/resources/bowtie2_index/hg19_1kgmaj",
"bowtie_index_name": "hg19_1kgmaj",
"output_dir": "s3://<agc-bucket-name>/project/tbpipeline/output",
"vcfmix": "yes",
"gnomon": "yes",
"report_dir": "s3://<agc-bucket-name>/project/tbpipeline/reports",
"container_registry": "<ecr-container-registry>/tb-pipeline"
}
```

The `container_registry` and `report_dir` parameters are optional. If not provided the `container_registry` parameter defaults to `quay.io/pathogen-genomics-cymru`.


## Execute and track workflows through AGC##

1. Submit a workflow run

```
agc workflow status -c ondemand -n tbpipeline
```

2. Check workflow status

```
agc workflow status -c ondemand -r <workflow-instance-id>
```

3. Check Nextflow engine logs

```
agc logs engine -c ondemand -r <workflow-instance-id>
```

4. Check workflow logs

```
agc logs workflow tbpipeline -r <workflow-instance-id>
```

5. Stop a workflow run

```
agc workflow stop <workflow-instance-id>
```

See the [AGC command reference](https://aws.github.io/amazon-genomics-cli/docs/reference/) for all agc commands.

### Clean up ###

1. Destroy the context. This will remove the resources associated with the named context from your account but will keep any S3 outputs and CloudWatch logs.

```
agc context destroy ondemand
```

2. Deactivate the account. If you want stop using Amazon Genomics CLI in your AWS account entirely and remove all resources created by AGC you need to deactivate it.

```
agc account deactivate
```


## Stub-run ##
To test the stub run:
```
Expand Down
18 changes: 18 additions & 0 deletions agc-project.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
name: tbpipeline
schemaVersion: 1
workflows:
tbpipeline:
type:
language: nextflow
version: dsl2
sourceURL: ./
contexts:
ondemand:
engines:
- type: nextflow
engine: nextflow
spot:
requestSpotInstances: true
engines:
- type: nextflow
engine: nextflow
2 changes: 2 additions & 0 deletions bin/create_final_json.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import json
import os
import sys
Expand Down
2 changes: 2 additions & 0 deletions bin/identify_tophit_and_contaminants2.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import json
import os
import sys
Expand Down
2 changes: 2 additions & 0 deletions bin/parse_kraken_report2.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
#!/usr/bin/env python3

import json
import os
import sys
Expand Down
2 changes: 1 addition & 1 deletion bin/software-json.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ def go(path):
for filename in glob.glob(os.path.join(path, "Singularity.*")):
extension = filename.split('.', 1)[1]
version = filename.split('-')[-1]
with open(os.path.join(path, filename), 'r') as infile:
with open(os.path.join(filename), 'r') as infile:
copy = False
for line in infile:
if line.strip() == "%environment":
Expand Down
Empty file modified bin/vcfmix.py
100644 → 100755
Empty file.
68 changes: 68 additions & 0 deletions docker/build_and_push.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
image=$1
version=$2

if [ "$image" == "" ]
then
echo "Usage: $0 <image-name> <image-version>"
exit 1
fi

if [ "$version" == "" ]
then
echo "Usage: $0 <image-name> <image-version>"
exit 1
fi


# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

if [ $? -ne 0 ]
then
exit 255
fi


# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

dockerfile="Dockerfile.${image}-${version}"
ecr_repo="${account}.dkr.ecr.${region}.amazonaws.com"
local_tag="tb-pipeline/${image}:${version}"
ecr_tag="${ecr_repo}/tb-pipeline/${image}:${version}"


echo "AWS Region: ${region}"
echo "Dockerfile: ${dockerfile}"
echo "Local tag : ${local_tag}"
echo "ECR tag : ${ecr_tag}"


# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "tb-pipeline/${image}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
echo "The repository with name tb-pipeline/${image} does not exist in the registry ${ecr_repo}. Creating repository."
aws ecr create-repository --repository-name "tb-pipeline/${image}"
# > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region "${region}" | docker login --username AWS --password-stdin "${account}".dkr.ecr."${region}".amazonaws.com

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build -t ${local_tag} -f ${dockerfile} ./
docker tag ${local_tag} ${ecr_tag}

docker push ${ecr_tag}
66 changes: 66 additions & 0 deletions docker/pull_and_push.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
image=$1
version=$2

if [ "$image" == "" ]
then
echo "Usage: $0 <image-name> <image-version>"
exit 1
fi

if [ "$version" == "" ]
then
echo "Usage: $0 <image-name> <image-version>"
exit 1
fi


# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

if [ $? -ne 0 ]
then
exit 255
fi


# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-eu-west-2}

ecr_repo="${account}.dkr.ecr.${region}.amazonaws.com"
remote_tag="quay.io/pathogen-genomics-cymru/${image}:${version}"
ecr_tag="${ecr_repo}/tb-pipeline/${image}:${version}"


echo "AWS Region : ${region}"
echo "Source image tag: $remote_tag"
echo "Target image tag: $ecr_tag"


# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "tb-pipeline/${image}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
echo "The repository with name tb-pipeline/${image} does not exist in the registry ${ecr_repo}. Creating repository."
aws ecr create-repository --repository-name "tb-pipeline/${image}"
# > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region "${region}" | docker login --username AWS --password-stdin "${account}".dkr.ecr."${region}".amazonaws.com

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker pull ${remote_tag}
docker tag ${remote_tag} ${ecr_tag}

docker push ${ecr_tag}
Loading