Skip to content

Commit c24fd61

Browse files
moschettisamanvp
andauthored
Add location flag to documentation (#682)
* Update README.md for location flag * vcf_files_preprocessor.md for location flag * bigquery_to_vcf.md for location flag * Add docs for --use_public_ips, --subnetwork, & --location * Clarify docker worker in setting_region.md * Clarify location flag as not required Co-authored-by: Saman Vaisipour <[email protected]>
1 parent 92197df commit c24fd61

File tree

4 files changed

+69
-3
lines changed

4 files changed

+69
-3
lines changed

Diff for: README.md

+7
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,11 @@ Run the script below and replace the following parameters:
5555
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
5656
to process your data, for example: `us-west1`. For more information please refer to
5757
[Setting Regions](docs/setting_region.md).
58+
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
59+
Sciences API to orchestrate job from. This is not where the data will be processed,
60+
but where some operation metadata will be stored. This can be the same or different from
61+
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
62+
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
5863
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
5964
project has write access to. It's used to store temporary files and logs
6065
from the pipeline.
@@ -72,6 +77,7 @@ Run the script below and replace the following parameters:
7277
# Parameters to replace:
7378
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
7479
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
80+
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
7581
TEMP_LOCATION=gs://BUCKET/temp
7682
INPUT_PATTERN=gs://BUCKET/*.vcf
7783
OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
@@ -85,6 +91,7 @@ COMMAND="vcf_to_bq \
8591
docker run -v ~/.config:/root/.config \
8692
gcr.io/cloud-lifesciences/gcp-variant-transforms \
8793
--project "${GOOGLE_CLOUD_PROJECT}" \
94+
--location "${GOOGLE_CLOUD_LOCATION}" \
8895
--region "${GOOGLE_CLOUD_REGION}" \
8996
--temp_location "${TEMP_LOCATION}" \
9097
"${COMMAND}"

Diff for: docs/bigquery_to_vcf.md

+7
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,11 @@ Run the script below and replace the following parameters:
2121
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
2222
to process your data, for example: `us-west1`. For more information please refer to
2323
[Setting Regions](docs/setting_region.md).
24+
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
25+
Sciences API to orchestrate job from. This is not where the data will be processed,
26+
but where some operation metadata will be stored. This can be the same or different from
27+
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
28+
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
2429
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
2530
project has write access to. It's used to store temporary files and logs
2631
from the pipeline.
@@ -35,6 +40,7 @@ Run the script below and replace the following parameters:
3540
# Parameters to replace:
3641
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
3742
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
43+
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
3844
TEMP_LOCATION=gs://BUCKET/temp
3945
INPUT_TABLE=GOOGLE_CLOUD_PROJECT:DATASET.TABLE
4046
OUTPUT_FILE=gs://BUCKET/loaded_file.vcf
@@ -48,6 +54,7 @@ COMMAND="bq_to_vcf \
4854
docker run -v ~/.config:/root/.config \
4955
gcr.io/cloud-lifesciences/gcp-variant-transforms \
5056
--project "${GOOGLE_CLOUD_PROJECT}" \
57+
--location "${GOOGLE_CLOUD_LOCATION}" \
5158
--region "${GOOGLE_CLOUD_REGION}" \
5259
--temp_location "${TEMP_LOCATION}" \
5360
"${COMMAND}"

Diff for: docs/setting_region.md

+48-3
Original file line numberDiff line numberDiff line change
@@ -13,22 +13,37 @@ are located in the same region:
1313
* Your pipeline's temporary location set by `--temp_location` flag.
1414
* Your output BigQuery dataset set by `--output_table` flag.
1515
* Your Dataflow pipeline set by `--region` flag.
16+
* Your Life Sciences API location set by `--location` flag.
1617

1718
## Running jobs in a particular region
1819
The Dataflow API [requires](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service)
1920
setting a [GCP
2021
region](https://cloud.google.com/compute/docs/regions-zones/#available) via
21-
`--region` flag to run. In addition to this requirment you might also
22+
`--region` flag to run.
23+
24+
When running from Docker, the Cloud Life Sciences API is used to spin up a
25+
worker that launches and monitors the Dataflow job. Cloud Life Sciences API
26+
is a [regionalized service](https://cloud.google.com/life-sciences/docs/concepts/locations)
27+
that runs in multiple regions. This is set with the `--location` flag. The
28+
Life Sciences API location is where metadata about the pipeline's progress
29+
will be stored, and can be different from the region where the data is
30+
processed. Note that Cloud Life Sciences API is not available in all regions,
31+
and if this flag is left out, the metadata will be stored in us-central1. See
32+
the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
33+
34+
In addition to this requirment you might also
2235
choose to run Variant Transforms in a specific region following your project’s
2336
security and compliance requirements. For example, in order
24-
to restrict your processing job to Europe west, set the region as follows:
37+
to restrict your processing job to europe-west4 (Netherlands), set the region
38+
and location as follows:
2539

2640
```bash
2741
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
2842
2943
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
3044
--project "${GOOGLE_CLOUD_PROJECT}" \
31-
--region europe-west1 \
45+
--region europe-west4 \
46+
--location europe-west4 \
3247
--temp_location "${TEMP_LOCATION}" \
3348
"${COMMAND}"
3449
```
@@ -77,3 +92,33 @@ You can choose the region for the BigQuery dataset at dataset creation time.
7792
7893
![BigQuery dataset region](images/bigquery_dataset_region.png)
7994
95+
## Advanced Flags
96+
97+
Variant Transforms supports specifying a subnetwork to use with the `--subnetwork` flag.
98+
This can be used to start the processing VMs in a specific network of your Google Cloud
99+
project as opposed to the default network.
100+
101+
Variant Transforms allows disabling the use of external IP addresses with the
102+
`--use_public_ips` flag. If not specified, this defaults to true, so to restrict the
103+
use of external IP addresses, use `--use_public_ips false`. Note that without external
104+
IP addresses, VMs can only send packets to other internal IP addresses. To allow these
105+
VMs to connect to the external IP addresses used by Google APIs and services, you can
106+
[enable Private Google Access](https://cloud.google.com/vpc/docs/configure-private-google-access)
107+
on the subnet.
108+
109+
For example, to run Variant Transforms in a VPC you already created called
110+
`custom-network-eu-west` with no public IP addresses you can add these flags to the
111+
example above as follows:
112+
113+
```bash
114+
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
115+
116+
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
117+
--project "${GOOGLE_CLOUD_PROJECT}" \
118+
--region europe-west4 \
119+
--location europe-west4 \
120+
--temp_location "${TEMP_LOCATION}" \
121+
--subnetwork custom-network-eu-west \
122+
--use_public_ips false \
123+
"${COMMAND}"
124+
```

Diff for: docs/vcf_files_preprocessor.md

+7
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,11 @@ Run the script below and replace the following parameters:
4646
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
4747
to process your data, for example: `us-west1`. For more information please refer to
4848
[Setting Regions](docs/setting_region.md).
49+
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
50+
Sciences API to orchestrate job from. This is not where the data will be processed,
51+
but where some operation metadata will be stored. This can be the same or different from
52+
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
53+
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
4954
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
5055
project has write access to. It's used to store temporary files and logs
5156
from the pipeline.
@@ -71,6 +76,7 @@ records.
7176
# Parameters to replace:
7277
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
7378
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
79+
GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
7480
TEMP_LOCATION=gs://BUCKET/temp
7581
INPUT_PATTERN=gs://BUCKET/*.vcf
7682
REPORT_PATH=gs://BUCKET/report.tsv
@@ -87,6 +93,7 @@ COMMAND="vcf_to_bq_preprocess \
8793
docker run -v ~/.config:/root/.config \
8894
gcr.io/cloud-lifesciences/gcp-variant-transforms \
8995
--project "${GOOGLE_CLOUD_PROJECT}" \
96+
--location "${GOOGLE_CLOUD_LOCATION}" \
9097
--region "${GOOGLE_CLOUD_REGION}" \
9198
--temp_location "${TEMP_LOCATION}" \
9299
"${COMMAND}"

0 commit comments

Comments
 (0)