|
1 | 1 | # Handling large inputs
|
2 | 2 |
|
3 |
| -The Variant Transforms pipeline can process hunderds of thousands of files, |
4 |
| -millions of samples, and billions of records. There are a few settings that |
5 |
| -may need to be adjusted depending on the size of the input files. Each of these |
6 |
| -settings are explained in the sections below. |
| 3 | +The Variant Transforms pipeline can process hundreds of thousands of files, |
| 4 | +millions of samples, and billions of records. There are a few settings that |
| 5 | +may need to be adjusted depending on the size of the input files. Each of |
| 6 | +these settings are explained in the sections below. |
7 | 7 |
|
8 |
| -Default settings: |
| 8 | +Example usage: |
9 | 9 |
|
10 | 10 | ```
|
11 | 11 | /opt/gcp_variant_transforms/bin/vcf_to_bq ... \
|
12 |
| - --optimize_for_large_inputs <default false> \ |
13 | 12 | --max_num_workers <default is automatically determined> \
|
14 | 13 | --worker_machine_type <default n1-standard-1> \
|
15 | 14 | --disk_size_gb <default 250> \
|
16 | 15 | --worker_disk_type <default PD> \
|
17 |
| - --num_bigquery_write_shards <default 1> \ |
| 16 | + --keep_intermediate_avro_files \ |
18 | 17 | --sharding_config_path <default gcp_variant_transforms/data/
|
19 | 18 | sharding_configs/homo_sapiens_default.yaml> \
|
20 | 19 | ```
|
21 | 20 |
|
22 |
| -### Important notes |
| 21 | +#### `--max_num_workers` |
23 | 22 |
|
24 |
| -#### Running preprocessor/validator tool |
| 23 | +By default, Dataflow uses its autoscaling algorithm to adjust the number of |
| 24 | +workers assigned to each job (limited by your Compute Engine quota). |
| 25 | +You may adjust the maximum number of workers using `--max_num_workers`. |
| 26 | +You may also use `--num_workers` to specify the initial number of workers |
| 27 | +to assign to the job. |
25 | 28 |
|
26 |
| -Since processing large inputs can take a long time and can be costly, we highly |
27 |
| -recommend running the [preprocessor/validator tool](vcf_files_preprocessor.md) |
28 |
| -prior to loading the full VCF to BigQuery pipeline to find out about any |
29 |
| -invalid/inconsistent records. This can avoid failures due to invalid records |
30 |
| -and can save time/cost. Depending on the quality of the input files, you may |
31 |
| -consider running with `--report_all_conflicts` to get the full report (it takes |
32 |
| -longer, but is more accurate and is highly recommended when you're not sure |
33 |
| -about the quality of the input files). |
| 29 | +#### `--worker_machine_type` |
34 | 30 |
|
35 |
| -#### Adjusting quota |
| 31 | +By default, Dataflow uses the `n1-standard-1` machine, which has 1 vCPU |
| 32 | +and 3.75GB of RAM. You may need to request a larger machine for large |
| 33 | +datasets. Please see [this page](https://cloud.google.com/compute/pricing#predefined_machine_types) |
| 34 | +for a list of available machine types. |
| 35 | + |
| 36 | +We have observed that Dataflow performs more efficiently when running |
| 37 | +with a large number of small machines rather than a small number of |
| 38 | +large machines (e.g. 200 `n1-standard-4` workers instead of 25 |
| 39 | +`n1-standard-32` workers). This is due to disk/network IOPS |
| 40 | +(input/output operations per second) being limited for each machine, |
| 41 | +especially if [merging](variant_merging.md) is enabled. |
| 42 | + |
| 43 | +Using a large number of workers may not always be possible due to disk |
| 44 | +and IP quotas. As a result, we recommend using SSDs |
| 45 | +(see [`--worker_disk_type`](#--worker_disk_type)) |
| 46 | +when choosing a large (`n1-standard-16` or larger) machine, |
| 47 | +which yields higher disk IOPS and can avoid idle CPU cycles. |
| 48 | +Note that disk is significantly cheaper than CPU, so always try to optimize |
| 49 | +for high CPU utilization rather than disk usage. |
| 50 | + |
| 51 | +#### `--disk_size_gb` |
| 52 | + |
| 53 | +By default, each worker has 250GB of disk. The aggregate amount of |
| 54 | +disk space from all workers must be at least as large as the uncompressed |
| 55 | +size of all VCF files being processed. However, to accommodate for |
| 56 | +intermediate stages of the pipeline and also to account for the additional |
| 57 | +overhead introduced by the transforms (e.g. the sample name is repeated |
| 58 | +in every record in the BigQuery output rather than just being specified |
| 59 | +once as in the VCF header). |
| 60 | + |
| 61 | +In addition, if [merging](variant_merging.md) is enabled, you may |
| 62 | +need more disk space per worker (e.g. 500GB), as the same variants need |
| 63 | +to be aggregated together on one machine. |
| 64 | + |
| 65 | +#### `--worker_disk_type` |
| 66 | +SSDs provide significantly more IOPS than standard persistent disks, |
| 67 | +but are more expensive. However, when choosing a large machine |
| 68 | +(e.g. `n1-standard-16`), they can reduce cost as they can avoid idle |
| 69 | +CPU cycles due to disk IOPS limitations. |
| 70 | + |
| 71 | +As a result, we recommend using SSDs if [merging](variant_merge.md) is enabled: these |
| 72 | +operations require "shuffling" the data (i.e. redistributing the data |
| 73 | +among workers), which require significant disk I/O. Add the following flag to use SSDs: |
| 74 | + |
| 75 | +``` |
| 76 | +--worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd |
| 77 | +``` |
| 78 | + |
| 79 | +### Adjusting Quotas and Limits |
| 80 | + |
| 81 | +Compute Engine enforces quota on the maximum amount of resources that can |
| 82 | +be used at any time, please see |
| 83 | +[this page](https://cloud.google.com/compute/quotas) |
| 84 | +for more details. As a result, you may need to adjust your quota to satisfy |
| 85 | +the job requirements. The flags mentioned above will not be effective if you |
| 86 | +do not have enough quota. In other words, Dataflow autoscaling will not be |
| 87 | +able to raise the number of workers to reach the target number if you don't |
| 88 | +have enough quota for one of the required resources. One way to confirm this |
| 89 | +is to check the *current usage* of your quotas. The following image shows a |
| 90 | +situation where `Persistent Disk SSD` in `us-central1` region has reached its |
| 91 | +maximum value: |
| 92 | + |
| 93 | + |
36 | 94 |
|
37 |
| -Compute Engine enforces quota on maximum amount of resources that can be used |
38 |
| -at any time for variety of reasons. Please see |
39 |
| -https://cloud.google.com/compute/quotas for more details. As a result, you may |
40 |
| -need to adjust the quota to satisfy the job requirements. |
| 95 | +To resolve situations like this, increase the following Compute Engine quotas: |
41 | 96 |
|
42 |
| -The main Compute Engine quotas to be adjusted are: |
43 |
| -* `In-use IP addresses`: One per worker. |
44 |
| -* `CPUs`: At least one per worker. More if larger machine type is used. |
45 |
| -* `Persistent Disk Standard (GB)`: At least 250GB per worker. More if larger |
| 97 | +* `In-use IP addresses`: One per worker. If you set `--use_public_ips false` |
| 98 | +then Dataflow workers use private IP addresses for all communication. |
| 99 | +* `CPUs`: At least one per worker. More if a larger machine type is used. |
| 100 | +* `Persistent Disk Standard (GB)`: At least 250GB per worker. More if a larger |
46 | 101 | disk is used.
|
47 | 102 | * `Persistent Disk SSD (GB)`: Only needed if `--worker_disk_type` is set to SSD.
|
48 | 103 | Required quota size is the same as `Persistent Disk Standard`.
|
49 | 104 |
|
50 |
| -Note that the value assigned to these quotas will be the upper limit of |
51 |
| -available resources for your job. For example, if the quota for |
52 |
| -`In-use IP addresses` is 10, but you try to run with `--max_num_workers 20`, |
53 |
| -your job will be running with at most 10 workers because that's all your GCP |
54 |
| -project is allowed to use. |
| 105 | +For more information please refer to |
| 106 | +[Dataflow quotas guidelines](https://cloud.google.com/dataflow/quotas#compute-engine-quotas). |
| 107 | +Values assigned to these quotas are the upper limit of available resources for |
| 108 | +your job. For example, if the quota for `In-use IP addresses` is 10, but you try |
| 109 | +to run with `--max_num_workers 20`, your job will be running with at most 10 |
| 110 | +workers because that's all your GCP project is allowed to use. |
55 | 111 |
|
56 |
| -### `--optimize_for_large_inputs` |
| 112 | +Please note you need to set quotas for the region that your Dataflow pipeline |
| 113 | +is running. For more information related to regions please refer to our |
| 114 | +[region documentation](setting_region.md). |
57 | 115 |
|
58 |
| -This flag should be set to true when loading more than 50,000 files and/or |
59 |
| -[merging](variant_merging.md) is enabled for a large number (>3 billion) |
60 |
| -of variants. This flag optimizes the Dataflow pipeline for large inputs, which |
61 |
| -can save significant cost/time, but the additional overhead may hurt cost/time |
62 |
| -for small inputs. |
| 116 | +## Other options to consider |
63 | 117 |
|
64 |
| -### `--max_num_workers` |
| 118 | +### Running preprocessor/validator tool |
65 | 119 |
|
66 |
| -By default, Dataflow uses its autoscaling algorithm to adjust the number of |
67 |
| -workers assigned to each job (limited by the Compute Engine quota). You may |
68 |
| -adjust the maximum number of workers using `--max_num_workers`. You may also use |
69 |
| -`--num_workers` to specify the initial number of workers to assign to the job. |
| 120 | +Because processing large inputs can take a long time and can be costly, |
| 121 | +we highly recommend running the |
| 122 | +[preprocessor/validator tool](vcf_files_preprocessor.md) |
| 123 | +prior to loading the full VCF to BigQuery pipeline to check for any |
| 124 | +invalid/inconsistent records. Doing so can avoid failures due to invalid |
| 125 | +records and can save you time and money. Depending on the quality of the |
| 126 | +input files, you may consider running with `--report_all_conflicts` to |
| 127 | +get the full report. Running with this flag will take longer, but it is |
| 128 | +more accurate and is highly recommended when you're not sure about the |
| 129 | +quality of the input files. |
70 | 130 |
|
71 |
| -### `--worker_machine_type` |
| 131 | +### Sharding |
72 | 132 |
|
73 |
| -By default, Dataflow uses the `n1-standard-1` machine, which has 1 vCPU and |
74 |
| -3.75GB of RAM. You may need to request a larger machine for large datasets. |
75 |
| -Please see https://cloud.google.com/compute/pricing#predefined_machine_types |
76 |
| -for a list of available machine types. |
| 133 | +Sharding the output significantly reduces the query costs once the data |
| 134 | +is queried in BigQuery. It also optimizes the cost and time of the pipeline. |
| 135 | +As a result, we enforce sharding for all runs of Variant Transforms, |
| 136 | +please see the [documentation](sharding.md) for more details. |
| 137 | + |
| 138 | + |
| 139 | +For very large inputs, you can use `--sharding_config_path` to only |
| 140 | +process and import a small region of genomes into BigQuery. For example, |
| 141 | +the following sharding config file produces an output table that only |
| 142 | +contains variants of chromosome 1 in the range of `[1000000, 2000000]`: |
| 143 | + |
| 144 | +``` |
| 145 | +- output_table: |
| 146 | + table_name_suffix: "chr1_1M_2M" |
| 147 | + regions: |
| 148 | + - "chr1:1,000,000-2,000,000" |
| 149 | + - "1" |
| 150 | + partition_range_end: 2,000,000 |
| 151 | +``` |
| 152 | + |
| 153 | +### Saving AVRO files |
| 154 | + |
| 155 | +If you are processing large inputs, you can set the |
| 156 | +`--keep_intermediate_avro_files` as a safety measure to ensure that the |
| 157 | +result of your Dataflow pipeline is stored in Google Cloud Storage in |
| 158 | +case something goes wrong while the AVRO files are copied into BigQuery. |
| 159 | +Doing so will not increase your compute cost, because most of the cost |
| 160 | +of running Variant Transforms is due to resources used in the Dataflow |
| 161 | +pipeline, and loading AVRO files to BigQuery is free. Storing the |
| 162 | +intermediate AVRO files avoids wasting the output of Dataflow. For more |
| 163 | +information about this flag, please refer to the |
| 164 | +[importing VCF files](vcf_to_bigquery.md) docs. |
| 165 | + |
| 166 | +The downside of this approach is the extra cost of storing AVRO files in a |
| 167 | +Google Cloud Storage bucket. To avoid this cost, we recommend deleting |
| 168 | +the AVRO files after they have been loaded into BigQuery. If your import |
| 169 | +job failed and you need help with loading AVRO files into BigQuery, |
| 170 | +please let us know by |
| 171 | +[submitting an issue](https://github.com/googlegenomics/gcp-variant-transforms/issues). |
77 | 172 |
|
78 |
| -We have observed that Dataflow performs more efficiently when running |
79 |
| -with a large number of small machines rather than a small number of large |
80 |
| -machines (e.g. 200 `n1-standard-4` workers instead of 25 `n1-standard-32` |
81 |
| -workers). This is due to disk/network IOPS (input/output operations per second) |
82 |
| -being limited for each machine especially if [merging](variant_merging.md) is |
83 |
| -enabled. |
84 |
| - |
85 |
| -Using a large number of workers may not always be possible due to disk and |
86 |
| -IP quotas. As a result, we recommend using SSDs (see |
87 |
| -[`--worker_disk_type`](#--worker_disk_type)) when choosing a large ( |
88 |
| -`n1-standard-16` or larger) machine, which yields higher disk IOPS and can avoid |
89 |
| -idle CPU cycles. Note that disk is significantly cheaper than CPU, so always try |
90 |
| -to optimize for high CPU utilization rather than disk usage. |
91 |
| - |
92 |
| -### `--disk_size_gb` |
93 |
| - |
94 |
| -By default, each worker has 250GB of disk. The aggregate amount of disk space |
95 |
| -from all workers must be at least as large as the uncompressed size of all VCF |
96 |
| -files being processed. However, to accomoddate for intermediate stages of the |
97 |
| -pipeline and also to account for the additional overhead introduced by the |
98 |
| -transforms (e.g. the sample name is repeated in every record in the BigQuery |
99 |
| -output rather than just being specified once as in the VCF header), you |
100 |
| -typically need 3 to 4 times the total size of the raw VCF files. |
101 |
| - |
102 |
| -In addition, if [merging](variant_merging.md) or |
103 |
| -[--num_bigquery_write_shards](#--num_bigquery_write_shards) is enabled, you may |
104 |
| -need more disk per worker (e.g. 500GB) as the same variants need to be |
105 |
| -aggregated together on one machine. |
106 |
| - |
107 |
| -### `--worker_disk_type` |
108 |
| - |
109 |
| -SSDs provide significantly more IOPS than standard persistent disks, but are |
110 |
| -more expensive. However, when choosing a large machine (e.g. `n1-standard-16`), |
111 |
| -they can reduce cost as they can avoid idle CPU cycles due to disk IOPS |
112 |
| -limitations. |
113 |
| - |
114 |
| -As a result, we recommend using SSDs if [merging](variant_merge.md) or |
115 |
| -[--num_bigquery_write_shards](#--num_bigquery_write_shards) is enabled: these |
116 |
| -operations require "shuffling" the data (i.e. redistributing the data among |
117 |
| -workers), which require significant disk I/O. |
118 |
| - |
119 |
| -Set |
120 |
| -`--worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd` |
121 |
| -to use SSDs. |
122 |
| - |
123 |
| -### `--num_bigquery_write_shards` |
124 |
| - |
125 |
| -Currently, the write operation to BigQuery in Dataflow is performed as a |
126 |
| -postprocessing step after the main transforms are done. As a workaround for |
127 |
| -BigQuery write limitations (more details |
128 |
| -[here](https://github.com/googlegenomics/gcp-variant-transforms/issues/199)), |
129 |
| -we have added "sharding" when writing to BigQuery. This makes the data load |
130 |
| -to BigQuery significantly faster as it parallelizes the process and enables |
131 |
| -loading large (>5TB) data to BigQuery at once. |
132 |
| - |
133 |
| -As a result, we recommend setting `--num_bigquery_write_shards 20` when loading |
134 |
| -any data that has more than 1 billion rows (after merging) or 1TB of final |
135 |
| -output. You may use a smaller number of write shards (e.g. 5) when using |
136 |
| -[sharded output](#--sharding_config_path) as each partition also acts as a |
137 |
| -"shard". Note that using a larger value (e.g. 50) can cause BigQuery write to |
138 |
| -fail as there is a maximum limit on the number of concurrent writes per table. |
139 |
| - |
140 |
| -### `--sharding_config_path` |
141 |
| - |
142 |
| -Sharding the output can save significant query costs once the data is in |
143 |
| -BigQuery. It can also optimize the cost/time of the pipeline (e.g. it natively |
144 |
| -shards the BigQuery output per partition and merging can also occur per |
145 |
| -partition). |
146 |
| - |
147 |
| -As a result, we recommend setting the partition config for very large data |
148 |
| -where possible. Please see the [documentation](sharding.md) for more |
149 |
| -details. |
|
0 commit comments