Add location flag to documentation (#682)

moschetti · samanvp · web-flow · commit c24fd61c314e · 2020-10-05T17:56:37.000-04:00
* Update README.md for location flag

* vcf_files_preprocessor.md for location flag

* bigquery_to_vcf.md for location flag

* Add docs for --use_public_ips, --subnetwork, &amp; --location

* Clarify docker worker in setting_region.md

* Clarify location flag as not required

Co-authored-by: Saman Vaisipour &lt;svaisipour@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -55,6 +55,11 @@ Run the script below and replace the following parameters:
   * `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
   to process your data, for example: `us-west1`. For more information please refer to
   [Setting Regions](docs/setting_region.md).
+  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
+  Sciences API to orchestrate job from. This is not where the data will be processed,
+  but where some operation metadata will be stored. This can be the same or different from
+  the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
+  us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
   * `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
   project has write access to. It's used to store temporary files and logs
   from the pipeline.
@@ -72,6 +77,7 @@ Run the script below and replace the following parameters:
 # Parameters to replace:
 GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
 GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
+GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
 TEMP_LOCATION=gs://BUCKET/temp
 INPUT_PATTERN=gs://BUCKET/*.vcf
 OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
@@ -85,6 +91,7 @@ COMMAND="vcf_to_bq \
 docker run -v ~/.config:/root/.config \
   gcr.io/cloud-lifesciences/gcp-variant-transforms \
   --project "${GOOGLE_CLOUD_PROJECT}" \
+  --location "${GOOGLE_CLOUD_LOCATION}" \
   --region "${GOOGLE_CLOUD_REGION}" \
   --temp_location "${TEMP_LOCATION}" \
   "${COMMAND}"
diff --git a/docs/bigquery_to_vcf.md b/docs/bigquery_to_vcf.md
@@ -21,6 +21,11 @@ Run the script below and replace the following parameters:
   * `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
   to process your data, for example: `us-west1`. For more information please refer to
   [Setting Regions](docs/setting_region.md).
+  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
+  Sciences API to orchestrate job from. This is not where the data will be processed,
+  but where some operation metadata will be stored. This can be the same or different from
+  the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
+  us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
   * `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
   project has write access to. It's used to store temporary files and logs
   from the pipeline.
@@ -35,6 +40,7 @@ Run the script below and replace the following parameters:
 # Parameters to replace:
 GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
 GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
+GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
 TEMP_LOCATION=gs://BUCKET/temp
 INPUT_TABLE=GOOGLE_CLOUD_PROJECT:DATASET.TABLE
 OUTPUT_FILE=gs://BUCKET/loaded_file.vcf
@@ -48,6 +54,7 @@ COMMAND="bq_to_vcf \
 docker run -v ~/.config:/root/.config \
   gcr.io/cloud-lifesciences/gcp-variant-transforms \
   --project "${GOOGLE_CLOUD_PROJECT}" \
+  --location "${GOOGLE_CLOUD_LOCATION}" \
   --region "${GOOGLE_CLOUD_REGION}" \
   --temp_location "${TEMP_LOCATION}" \
   "${COMMAND}"
diff --git a/docs/setting_region.md b/docs/setting_region.md
@@ -13,22 +13,37 @@ are located in the same region:
 * Your pipeline's temporary location set by `--temp_location` flag.
 * Your output BigQuery dataset set by `--output_table` flag.
 * Your Dataflow pipeline set by `--region` flag.
+* Your Life Sciences API location set by `--location` flag.
 
 ## Running jobs in a particular region
 The Dataflow API [requires](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service)
 setting a [GCP
 region](https://cloud.google.com/compute/docs/regions-zones/#available) via
-`--region` flag to run. In addition to this requirment you might also
+`--region` flag to run.
+
+When running from Docker, the Cloud Life Sciences API is used to spin up a
+worker that launches and monitors the Dataflow job. Cloud Life Sciences API
+is a [regionalized service](https://cloud.google.com/life-sciences/docs/concepts/locations)
+that runs in multiple regions. This is set with the `--location` flag. The
+Life Sciences API location is where metadata about the pipeline's progress
+will be stored, and can be different from the region where the data is
+processed. Note that Cloud Life Sciences API is not available in all regions,
+and if this flag is left out, the metadata will be stored in us-central1. See
+the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
+
+In addition to this requirment you might also
 choose to run Variant Transforms in a specific region following your project’s
 security and compliance requirements. For example, in order
-to restrict your processing job to Europe west, set the region as follows:
+to restrict your processing job to europe-west4 (Netherlands), set the region
+and location as follows:
 
 ```bash
 COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
 
 docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
   --project "${GOOGLE_CLOUD_PROJECT}" \
-  --region europe-west1 \
+  --region europe-west4 \
+  --location europe-west4 \
   --temp_location "${TEMP_LOCATION}" \
   "${COMMAND}"
 ```
@@ -77,3 +92,33 @@ You can choose the region for the BigQuery dataset at dataset creation time.
 
 ![BigQuery dataset region](images/bigquery_dataset_region.png)
 
+## Advanced Flags
+
+Variant Transforms supports specifying a subnetwork to use with the `--subnetwork` flag.
+This can be used to start the processing VMs in a specific network of your Google Cloud
+project as opposed to the default network.
+
+Variant Transforms allows disabling the use of external IP addresses with the
+`--use_public_ips` flag. If not specified, this defaults to true, so to restrict the
+use of external IP addresses, use `--use_public_ips false`. Note that without external
+IP addresses, VMs can only send packets to other internal IP addresses. To allow these
+VMs to connect to the external IP addresses used by Google APIs and services, you can
+[enable Private Google Access](https://cloud.google.com/vpc/docs/configure-private-google-access)
+on the subnet.
+
+For example, to run Variant Transforms in a VPC you already created called
+`custom-network-eu-west` with no public IP addresses you can add these flags to the
+example above as follows:
+
+```bash
+COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
+
+docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
+  --project "${GOOGLE_CLOUD_PROJECT}" \
+  --region europe-west4 \
+  --location europe-west4 \
+  --temp_location "${TEMP_LOCATION}" \
+  --subnetwork custom-network-eu-west \
+  --use_public_ips false \
+  "${COMMAND}"
+```
diff --git a/docs/vcf_files_preprocessor.md b/docs/vcf_files_preprocessor.md
@@ -46,6 +46,11 @@ Run the script below and replace the following parameters:
   * `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
   to process your data, for example: `us-west1`. For more information please refer to
   [Setting Regions](docs/setting_region.md).
+  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
+  Sciences API to orchestrate job from. This is not where the data will be processed,
+  but where some operation metadata will be stored. This can be the same or different from
+  the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
+  us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
   * `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
   project has write access to. It's used to store temporary files and logs
   from the pipeline.
@@ -71,6 +76,7 @@ records.
 # Parameters to replace:
 GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
 GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
+GOOGLE_CLOUD_LOCATION=GOOGLE_CLOUD_LOCATION
 TEMP_LOCATION=gs://BUCKET/temp
 INPUT_PATTERN=gs://BUCKET/*.vcf
 REPORT_PATH=gs://BUCKET/report.tsv
@@ -87,6 +93,7 @@ COMMAND="vcf_to_bq_preprocess \
 docker run -v ~/.config:/root/.config \
   gcr.io/cloud-lifesciences/gcp-variant-transforms \
   --project "${GOOGLE_CLOUD_PROJECT}" \
+  --location "${GOOGLE_CLOUD_LOCATION}" \
   --region "${GOOGLE_CLOUD_REGION}" \
   --temp_location "${TEMP_LOCATION}" \
   "${COMMAND}"