googlegenomics
diff --git a/‎docs/images/console_quotas.png
262 KB b/‎docs/images/console_quotas.png
262 KB
diff --git a/‎docs/images/query_cost_partitioned_table.svg
Lines changed: 1 addition & 0 deletions b/‎docs/images/query_cost_partitioned_table.svg
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/images/query_cost_unpartitioned_table.svg
Lines changed: 1 addition & 0 deletions b/‎docs/images/query_cost_unpartitioned_table.svg
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/images/sharding_per_chromosome.svg
Lines changed: 1 addition & 0 deletions b/‎docs/images/sharding_per_chromosome.svg
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/large_inputs.md
Lines changed: 142 additions & 119 deletions b/‎docs/large_inputs.md
Lines changed: 142 additions & 119 deletions
@@ -1,149 +1,172 @@
 # Handling large inputs
 
-The Variant Transforms pipeline can process hunderds of thousands of files,
-millions of samples, and billions of records. There are a few settings that
-may need to be adjusted depending on the size of the input files. Each of these
-settings are explained in the sections below.
+The Variant Transforms pipeline can process hundreds of thousands of files,
+millions of samples, and billions of records. There are a few settings that 
+may need to be adjusted depending on the size of the input files. Each of 
+these settings are explained in the sections below.
 
-Default settings:
+Example usage:
 
 ```
 /opt/gcp_variant_transforms/bin/vcf_to_bq ... \
-  --optimize_for_large_inputs <default false> \
   --max_num_workers <default is automatically determined> \
   --worker_machine_type <default n1-standard-1> \
   --disk_size_gb <default 250> \
   --worker_disk_type <default PD> \
-  --num_bigquery_write_shards <default 1> \
+  --keep_intermediate_avro_files \
   --sharding_config_path <default gcp_variant_transforms/data/
       sharding_configs/homo_sapiens_default.yaml> \
 ```
 
-### Important notes
+#### `--max_num_workers`
 
-#### Running preprocessor/validator tool
+By default, Dataflow uses its autoscaling algorithm to adjust the number of 
+workers assigned to each job (limited by your Compute Engine quota).
+You may adjust the maximum number of workers using `--max_num_workers`.
+You may also use `--num_workers` to specify the initial number of workers 
+to assign to the job.
 
-Since processing large inputs can take a long time and can be costly, we highly
-recommend running the [preprocessor/validator tool](vcf_files_preprocessor.md)
-prior to loading the full VCF to BigQuery pipeline to find out about any
-invalid/inconsistent records. This can avoid failures due to invalid records
-and can save time/cost. Depending on the quality of the input files, you may
-consider running with `--report_all_conflicts` to get the full report (it takes
-longer, but is more accurate and is highly recommended when you're not sure
-about the quality of the input files).
+#### `--worker_machine_type`
 
-#### Adjusting quota
+By default, Dataflow uses the `n1-standard-1` machine, which has 1 vCPU 
+and 3.75GB of RAM. You may need to request a larger machine for large 
+datasets. Please see [this page](https://cloud.google.com/compute/pricing#predefined_machine_types)
+for a list of available machine types.
+
+We have observed that Dataflow performs more efficiently when running
+with a large number of small machines rather than a small number of
+large machines (e.g. 200 `n1-standard-4` workers instead of 25 
+`n1-standard-32` workers). This is due to disk/network IOPS
+(input/output operations per second) being limited for each machine,
+especially if [merging](variant_merging.md) is enabled.
+
+Using a large number of workers may not always be possible due to disk 
+and IP quotas. As a result, we recommend using SSDs
+(see [`--worker_disk_type`](#--worker_disk_type))
+when choosing a large (`n1-standard-16` or larger) machine,
+which yields higher disk IOPS and can avoid idle CPU cycles.
+Note that disk is significantly cheaper than CPU, so always try to optimize
+for high CPU utilization rather than disk usage.
+
+#### `--disk_size_gb`
+
+By default, each worker has 250GB of disk. The aggregate amount of
+disk space from all workers must be at least as large as the uncompressed
+size of all VCF files being processed. However, to accommodate for
+intermediate stages of the pipeline and also to account for the additional
+overhead introduced by the transforms (e.g. the sample name is repeated
+in every record in the BigQuery output rather than just being specified
+once as in the VCF header).
+
+In addition, if [merging](variant_merging.md) is enabled, you may 
+need more disk space per worker (e.g. 500GB), as the same variants need
+to be aggregated together on one machine.
+
+#### `--worker_disk_type`
+SSDs provide significantly more IOPS than standard persistent disks,
+but are more expensive. However, when choosing a large machine 
+(e.g. `n1-standard-16`), they can reduce cost as they can avoid idle
+CPU cycles due to disk IOPS limitations.
+
+As a result, we recommend using SSDs if [merging](variant_merge.md) is enabled: these
+operations require "shuffling" the data (i.e. redistributing the data
+among workers), which require significant disk I/O. Add the following flag to use SSDs:
+
+```
+--worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd
+```
+
+### Adjusting Quotas and Limits 
+
+Compute Engine enforces quota on the maximum amount of resources that can
+be used at any time, please see
+[this page](https://cloud.google.com/compute/quotas)
+for more details. As a result, you may need to adjust your quota to satisfy
+the job requirements. The flags mentioned above will not be effective if you
+do not have enough quota. In other words, Dataflow autoscaling will not be
+able to raise the number of workers to reach the target number if you don't
+have enough quota for one of the required resources. One way to confirm this
+is to check the *current usage* of your quotas. The following image shows a
+situation where `Persistent Disk SSD` in `us-central1` region has reached its
+maximum value:
+
+![quotas](images/console_quotas.png)
 
-Compute Engine enforces quota on maximum amount of resources that can be used
-at any time for variety of reasons. Please see
-https://cloud.google.com/compute/quotas for more details. As a result, you may
-need to adjust the quota to satisfy the job requirements.
+To resolve situations like this, increase the following Compute Engine quotas:
 
-The main Compute Engine quotas to be adjusted are:
-* `In-use IP addresses`: One per worker.
-* `CPUs`: At least one per worker. More if larger machine type is used.
-* `Persistent Disk Standard (GB)`: At least 250GB per worker. More if larger
+* `In-use IP addresses`: One per worker. If you set `--use_public_ips false`
+then Dataflow workers use private IP addresses for all communication. 
+* `CPUs`: At least one per worker. More if a larger machine type is used.
+* `Persistent Disk Standard (GB)`: At least 250GB per worker. More if a larger
   disk is used.
 * `Persistent Disk SSD (GB)`: Only needed if `--worker_disk_type` is set to SSD.
   Required quota size is the same as `Persistent Disk Standard`.
 
-Note that the value assigned to these quotas will be the upper limit of
-available resources for your job. For example, if the quota for
-`In-use IP addresses` is 10, but you try to run with `--max_num_workers 20`,
-your job will be running with at most 10 workers because that's all your GCP
-project is allowed to use.
+For more information please refer to
+[Dataflow quotas guidelines](https://cloud.google.com/dataflow/quotas#compute-engine-quotas). 
+Values assigned to these quotas are the upper limit of available resources for
+your job. For example, if the quota for `In-use IP addresses` is 10, but you try
+to run with `--max_num_workers 20`, your job will be running with at most 10
+workers because that's all your GCP project is allowed to use.
 
-### `--optimize_for_large_inputs`
+Please note you need to set quotas for the region that your Dataflow pipeline
+is running. For more information related to regions please refer to our 
+[region documentation](setting_region.md).   
 
-This flag should be set to true when loading more than 50,000 files and/or
-[merging](variant_merging.md) is enabled for a large number (>3 billion)
-of variants. This flag optimizes the Dataflow pipeline for large inputs, which
-can save significant cost/time, but the additional overhead may hurt cost/time
-for small inputs.
+## Other options to consider 
 
-### `--max_num_workers`
+### Running preprocessor/validator tool
 
-By default, Dataflow uses its autoscaling algorithm to adjust the number of
-workers assigned to each job (limited by the Compute Engine quota). You may
-adjust the maximum number of workers using `--max_num_workers`. You may also use
-`--num_workers` to specify the initial number of workers to assign to the job.
+Because processing large inputs can take a long time and can be costly,
+we highly recommend running the
+[preprocessor/validator tool](vcf_files_preprocessor.md)
+prior to loading the full VCF to BigQuery pipeline to check for any
+invalid/inconsistent records. Doing so can avoid failures due to invalid
+records and can save you time and money. Depending on the quality of the
+input files, you may consider running with `--report_all_conflicts` to
+get the full report. Running with this flag will take longer, but it is
+more accurate and is highly recommended when you're not sure about the
+quality of the input files.
 
-### `--worker_machine_type`
+### Sharding
 
-By default, Dataflow uses the `n1-standard-1` machine, which has 1 vCPU and
-3.75GB of RAM. You may need to request a larger machine for large datasets.
-Please see https://cloud.google.com/compute/pricing#predefined_machine_types
-for a list of available machine types.
+Sharding the output significantly reduces the query costs once the data
+is queried in BigQuery. It also optimizes the cost and time of the pipeline.
+As a result, we enforce sharding for all runs of Variant Transforms,
+please see the [documentation](sharding.md) for more details.
+
+
+For very large inputs, you can use `--sharding_config_path` to only
+process and import a small region of genomes into BigQuery. For example,
+the following sharding config file produces an output table that only
+contains variants of chromosome 1 in the range of `[1000000, 2000000]`:
+
+```
+-  output_table:
+     table_name_suffix: "chr1_1M_2M"
+     regions:
+       - "chr1:1,000,000-2,000,000"
+       - "1"
+     partition_range_end: 2,000,000
+```
+
+### Saving AVRO files
+
+If you are processing large inputs, you can set the
+`--keep_intermediate_avro_files` as a safety measure to ensure that the
+result of your Dataflow pipeline is stored in Google Cloud Storage in
+case something goes wrong while the AVRO files are copied into BigQuery.
+Doing so will not increase your compute cost, because most of the cost
+of running Variant Transforms is due to resources used in the Dataflow
+pipeline, and loading AVRO files to BigQuery is free. Storing the
+intermediate AVRO files avoids wasting the output of Dataflow. For more
+information about this flag, please refer to the 
+[importing VCF files](vcf_to_bigquery.md) docs.
+
+The downside of this approach is the extra cost of storing AVRO files in a
+Google Cloud Storage bucket. To avoid this cost, we recommend deleting
+the AVRO files after they have been loaded into BigQuery. If your import
+job failed and you need help with loading AVRO files into BigQuery,
+please let us know by
+[submitting an issue](https://github.com/googlegenomics/gcp-variant-transforms/issues).
 
-We have observed that Dataflow performs more efficiently when running
-with a large number of small machines rather than a small number of large
-machines (e.g. 200 `n1-standard-4` workers instead of 25 `n1-standard-32`
-workers). This is due to disk/network IOPS (input/output operations per second)
-being limited for each machine especially if [merging](variant_merging.md) is
-enabled.
-
-Using a large number of workers may not always be possible due to disk and
-IP quotas. As a result, we recommend using SSDs (see
-[`--worker_disk_type`](#--worker_disk_type)) when choosing a large (
-`n1-standard-16` or larger) machine, which yields higher disk IOPS and can avoid
-idle CPU cycles. Note that disk is significantly cheaper than CPU, so always try
-to optimize for high CPU utilization rather than disk usage.
-
-### `--disk_size_gb`
-
-By default, each worker has 250GB of disk. The aggregate amount of disk space
-from all workers must be at least as large as the uncompressed size of all VCF
-files being processed. However, to accomoddate for intermediate stages of the
-pipeline and also to account for the additional overhead introduced by the
-transforms (e.g. the sample name is repeated in every record in the BigQuery
-output rather than just being specified once as in the VCF header), you
-typically need 3 to 4 times the total size of the raw VCF files.
-
-In addition, if [merging](variant_merging.md) or
-[--num_bigquery_write_shards](#--num_bigquery_write_shards) is enabled, you may
-need more disk per worker (e.g. 500GB) as the same variants need to be
-aggregated together on one machine.
-
-### `--worker_disk_type`
-
-SSDs provide significantly more IOPS than standard persistent disks, but are
-more expensive. However, when choosing a large machine (e.g. `n1-standard-16`),
-they can reduce cost as they can avoid idle CPU cycles due to disk IOPS
-limitations.
-
-As a result, we recommend using SSDs if [merging](variant_merge.md) or
-[--num_bigquery_write_shards](#--num_bigquery_write_shards) is enabled: these
-operations require "shuffling" the data (i.e. redistributing the data among
-workers), which require significant disk I/O.
-
-Set
-`--worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd`
-to use SSDs.
-
-### `--num_bigquery_write_shards`
-
-Currently, the write operation to BigQuery in Dataflow is performed as a
-postprocessing step after the main transforms are done. As a workaround for
-BigQuery write limitations (more details
-[here](https://github.com/googlegenomics/gcp-variant-transforms/issues/199)),
-we have added "sharding" when writing to BigQuery. This makes the data load
-to BigQuery significantly faster as it parallelizes the process and enables
-loading large (>5TB) data to BigQuery at once.
-
-As a result, we recommend setting `--num_bigquery_write_shards 20` when loading
-any data that has more than 1 billion rows (after merging) or 1TB of final
-output. You may use a smaller number of write shards (e.g. 5) when using
-[sharded output](#--sharding_config_path) as each partition also acts as a
-"shard". Note that using a larger value (e.g. 50) can cause BigQuery write to
-fail as there is a maximum limit on the number of concurrent writes per table.
-
-### `--sharding_config_path`
-
-Sharding the output can save significant query costs once the data is in
-BigQuery. It can also optimize the cost/time of the pipeline (e.g. it natively
-shards the BigQuery output per partition and merging can also occur per
-partition).
-
-As a result, we recommend setting the partition config for very large data
-where possible. Please see the [documentation](sharding.md) for more
-details.