Skip to content

Commit 8103253

Browse files
authored
Update docs ahead of new schema launch (#624)
* Update docs ahead of new schema launch This commit includes the following updates: * sharding.md * large_inputs.md and adds a new page: * query_cost.md * First round of comments * Modify our docs according to the offline comments Via a shared Google Doc.
1 parent 34221e6 commit 8103253

7 files changed

+451
-341
lines changed

docs/images/console_quotas.png

262 KB
Loading

docs/images/query_cost_partitioned_table.svg

Lines changed: 1 addition & 0 deletions
Loading

docs/images/query_cost_unpartitioned_table.svg

Lines changed: 1 addition & 0 deletions
Loading

docs/images/sharding_per_chromosome.svg

Lines changed: 1 addition & 0 deletions
Loading

docs/large_inputs.md

Lines changed: 142 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,149 +1,172 @@
11
# Handling large inputs
22

3-
The Variant Transforms pipeline can process hunderds of thousands of files,
4-
millions of samples, and billions of records. There are a few settings that
5-
may need to be adjusted depending on the size of the input files. Each of these
6-
settings are explained in the sections below.
3+
The Variant Transforms pipeline can process hundreds of thousands of files,
4+
millions of samples, and billions of records. There are a few settings that
5+
may need to be adjusted depending on the size of the input files. Each of
6+
these settings are explained in the sections below.
77

8-
Default settings:
8+
Example usage:
99

1010
```
1111
/opt/gcp_variant_transforms/bin/vcf_to_bq ... \
12-
--optimize_for_large_inputs <default false> \
1312
--max_num_workers <default is automatically determined> \
1413
--worker_machine_type <default n1-standard-1> \
1514
--disk_size_gb <default 250> \
1615
--worker_disk_type <default PD> \
17-
--num_bigquery_write_shards <default 1> \
16+
--keep_intermediate_avro_files \
1817
--sharding_config_path <default gcp_variant_transforms/data/
1918
sharding_configs/homo_sapiens_default.yaml> \
2019
```
2120

22-
### Important notes
21+
#### `--max_num_workers`
2322

24-
#### Running preprocessor/validator tool
23+
By default, Dataflow uses its autoscaling algorithm to adjust the number of
24+
workers assigned to each job (limited by your Compute Engine quota).
25+
You may adjust the maximum number of workers using `--max_num_workers`.
26+
You may also use `--num_workers` to specify the initial number of workers
27+
to assign to the job.
2528

26-
Since processing large inputs can take a long time and can be costly, we highly
27-
recommend running the [preprocessor/validator tool](vcf_files_preprocessor.md)
28-
prior to loading the full VCF to BigQuery pipeline to find out about any
29-
invalid/inconsistent records. This can avoid failures due to invalid records
30-
and can save time/cost. Depending on the quality of the input files, you may
31-
consider running with `--report_all_conflicts` to get the full report (it takes
32-
longer, but is more accurate and is highly recommended when you're not sure
33-
about the quality of the input files).
29+
#### `--worker_machine_type`
3430

35-
#### Adjusting quota
31+
By default, Dataflow uses the `n1-standard-1` machine, which has 1 vCPU
32+
and 3.75GB of RAM. You may need to request a larger machine for large
33+
datasets. Please see [this page](https://cloud.google.com/compute/pricing#predefined_machine_types)
34+
for a list of available machine types.
35+
36+
We have observed that Dataflow performs more efficiently when running
37+
with a large number of small machines rather than a small number of
38+
large machines (e.g. 200 `n1-standard-4` workers instead of 25
39+
`n1-standard-32` workers). This is due to disk/network IOPS
40+
(input/output operations per second) being limited for each machine,
41+
especially if [merging](variant_merging.md) is enabled.
42+
43+
Using a large number of workers may not always be possible due to disk
44+
and IP quotas. As a result, we recommend using SSDs
45+
(see [`--worker_disk_type`](#--worker_disk_type))
46+
when choosing a large (`n1-standard-16` or larger) machine,
47+
which yields higher disk IOPS and can avoid idle CPU cycles.
48+
Note that disk is significantly cheaper than CPU, so always try to optimize
49+
for high CPU utilization rather than disk usage.
50+
51+
#### `--disk_size_gb`
52+
53+
By default, each worker has 250GB of disk. The aggregate amount of
54+
disk space from all workers must be at least as large as the uncompressed
55+
size of all VCF files being processed. However, to accommodate for
56+
intermediate stages of the pipeline and also to account for the additional
57+
overhead introduced by the transforms (e.g. the sample name is repeated
58+
in every record in the BigQuery output rather than just being specified
59+
once as in the VCF header).
60+
61+
In addition, if [merging](variant_merging.md) is enabled, you may
62+
need more disk space per worker (e.g. 500GB), as the same variants need
63+
to be aggregated together on one machine.
64+
65+
#### `--worker_disk_type`
66+
SSDs provide significantly more IOPS than standard persistent disks,
67+
but are more expensive. However, when choosing a large machine
68+
(e.g. `n1-standard-16`), they can reduce cost as they can avoid idle
69+
CPU cycles due to disk IOPS limitations.
70+
71+
As a result, we recommend using SSDs if [merging](variant_merge.md) is enabled: these
72+
operations require "shuffling" the data (i.e. redistributing the data
73+
among workers), which require significant disk I/O. Add the following flag to use SSDs:
74+
75+
```
76+
--worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd
77+
```
78+
79+
### Adjusting Quotas and Limits
80+
81+
Compute Engine enforces quota on the maximum amount of resources that can
82+
be used at any time, please see
83+
[this page](https://cloud.google.com/compute/quotas)
84+
for more details. As a result, you may need to adjust your quota to satisfy
85+
the job requirements. The flags mentioned above will not be effective if you
86+
do not have enough quota. In other words, Dataflow autoscaling will not be
87+
able to raise the number of workers to reach the target number if you don't
88+
have enough quota for one of the required resources. One way to confirm this
89+
is to check the *current usage* of your quotas. The following image shows a
90+
situation where `Persistent Disk SSD` in `us-central1` region has reached its
91+
maximum value:
92+
93+
![quotas](images/console_quotas.png)
3694

37-
Compute Engine enforces quota on maximum amount of resources that can be used
38-
at any time for variety of reasons. Please see
39-
https://cloud.google.com/compute/quotas for more details. As a result, you may
40-
need to adjust the quota to satisfy the job requirements.
95+
To resolve situations like this, increase the following Compute Engine quotas:
4196

42-
The main Compute Engine quotas to be adjusted are:
43-
* `In-use IP addresses`: One per worker.
44-
* `CPUs`: At least one per worker. More if larger machine type is used.
45-
* `Persistent Disk Standard (GB)`: At least 250GB per worker. More if larger
97+
* `In-use IP addresses`: One per worker. If you set `--use_public_ips false`
98+
then Dataflow workers use private IP addresses for all communication.
99+
* `CPUs`: At least one per worker. More if a larger machine type is used.
100+
* `Persistent Disk Standard (GB)`: At least 250GB per worker. More if a larger
46101
disk is used.
47102
* `Persistent Disk SSD (GB)`: Only needed if `--worker_disk_type` is set to SSD.
48103
Required quota size is the same as `Persistent Disk Standard`.
49104

50-
Note that the value assigned to these quotas will be the upper limit of
51-
available resources for your job. For example, if the quota for
52-
`In-use IP addresses` is 10, but you try to run with `--max_num_workers 20`,
53-
your job will be running with at most 10 workers because that's all your GCP
54-
project is allowed to use.
105+
For more information please refer to
106+
[Dataflow quotas guidelines](https://cloud.google.com/dataflow/quotas#compute-engine-quotas).
107+
Values assigned to these quotas are the upper limit of available resources for
108+
your job. For example, if the quota for `In-use IP addresses` is 10, but you try
109+
to run with `--max_num_workers 20`, your job will be running with at most 10
110+
workers because that's all your GCP project is allowed to use.
55111

56-
### `--optimize_for_large_inputs`
112+
Please note you need to set quotas for the region that your Dataflow pipeline
113+
is running. For more information related to regions please refer to our
114+
[region documentation](setting_region.md).
57115

58-
This flag should be set to true when loading more than 50,000 files and/or
59-
[merging](variant_merging.md) is enabled for a large number (>3 billion)
60-
of variants. This flag optimizes the Dataflow pipeline for large inputs, which
61-
can save significant cost/time, but the additional overhead may hurt cost/time
62-
for small inputs.
116+
## Other options to consider
63117

64-
### `--max_num_workers`
118+
### Running preprocessor/validator tool
65119

66-
By default, Dataflow uses its autoscaling algorithm to adjust the number of
67-
workers assigned to each job (limited by the Compute Engine quota). You may
68-
adjust the maximum number of workers using `--max_num_workers`. You may also use
69-
`--num_workers` to specify the initial number of workers to assign to the job.
120+
Because processing large inputs can take a long time and can be costly,
121+
we highly recommend running the
122+
[preprocessor/validator tool](vcf_files_preprocessor.md)
123+
prior to loading the full VCF to BigQuery pipeline to check for any
124+
invalid/inconsistent records. Doing so can avoid failures due to invalid
125+
records and can save you time and money. Depending on the quality of the
126+
input files, you may consider running with `--report_all_conflicts` to
127+
get the full report. Running with this flag will take longer, but it is
128+
more accurate and is highly recommended when you're not sure about the
129+
quality of the input files.
70130

71-
### `--worker_machine_type`
131+
### Sharding
72132

73-
By default, Dataflow uses the `n1-standard-1` machine, which has 1 vCPU and
74-
3.75GB of RAM. You may need to request a larger machine for large datasets.
75-
Please see https://cloud.google.com/compute/pricing#predefined_machine_types
76-
for a list of available machine types.
133+
Sharding the output significantly reduces the query costs once the data
134+
is queried in BigQuery. It also optimizes the cost and time of the pipeline.
135+
As a result, we enforce sharding for all runs of Variant Transforms,
136+
please see the [documentation](sharding.md) for more details.
137+
138+
139+
For very large inputs, you can use `--sharding_config_path` to only
140+
process and import a small region of genomes into BigQuery. For example,
141+
the following sharding config file produces an output table that only
142+
contains variants of chromosome 1 in the range of `[1000000, 2000000]`:
143+
144+
```
145+
- output_table:
146+
table_name_suffix: "chr1_1M_2M"
147+
regions:
148+
- "chr1:1,000,000-2,000,000"
149+
- "1"
150+
partition_range_end: 2,000,000
151+
```
152+
153+
### Saving AVRO files
154+
155+
If you are processing large inputs, you can set the
156+
`--keep_intermediate_avro_files` as a safety measure to ensure that the
157+
result of your Dataflow pipeline is stored in Google Cloud Storage in
158+
case something goes wrong while the AVRO files are copied into BigQuery.
159+
Doing so will not increase your compute cost, because most of the cost
160+
of running Variant Transforms is due to resources used in the Dataflow
161+
pipeline, and loading AVRO files to BigQuery is free. Storing the
162+
intermediate AVRO files avoids wasting the output of Dataflow. For more
163+
information about this flag, please refer to the
164+
[importing VCF files](vcf_to_bigquery.md) docs.
165+
166+
The downside of this approach is the extra cost of storing AVRO files in a
167+
Google Cloud Storage bucket. To avoid this cost, we recommend deleting
168+
the AVRO files after they have been loaded into BigQuery. If your import
169+
job failed and you need help with loading AVRO files into BigQuery,
170+
please let us know by
171+
[submitting an issue](https://github.com/googlegenomics/gcp-variant-transforms/issues).
77172

78-
We have observed that Dataflow performs more efficiently when running
79-
with a large number of small machines rather than a small number of large
80-
machines (e.g. 200 `n1-standard-4` workers instead of 25 `n1-standard-32`
81-
workers). This is due to disk/network IOPS (input/output operations per second)
82-
being limited for each machine especially if [merging](variant_merging.md) is
83-
enabled.
84-
85-
Using a large number of workers may not always be possible due to disk and
86-
IP quotas. As a result, we recommend using SSDs (see
87-
[`--worker_disk_type`](#--worker_disk_type)) when choosing a large (
88-
`n1-standard-16` or larger) machine, which yields higher disk IOPS and can avoid
89-
idle CPU cycles. Note that disk is significantly cheaper than CPU, so always try
90-
to optimize for high CPU utilization rather than disk usage.
91-
92-
### `--disk_size_gb`
93-
94-
By default, each worker has 250GB of disk. The aggregate amount of disk space
95-
from all workers must be at least as large as the uncompressed size of all VCF
96-
files being processed. However, to accomoddate for intermediate stages of the
97-
pipeline and also to account for the additional overhead introduced by the
98-
transforms (e.g. the sample name is repeated in every record in the BigQuery
99-
output rather than just being specified once as in the VCF header), you
100-
typically need 3 to 4 times the total size of the raw VCF files.
101-
102-
In addition, if [merging](variant_merging.md) or
103-
[--num_bigquery_write_shards](#--num_bigquery_write_shards) is enabled, you may
104-
need more disk per worker (e.g. 500GB) as the same variants need to be
105-
aggregated together on one machine.
106-
107-
### `--worker_disk_type`
108-
109-
SSDs provide significantly more IOPS than standard persistent disks, but are
110-
more expensive. However, when choosing a large machine (e.g. `n1-standard-16`),
111-
they can reduce cost as they can avoid idle CPU cycles due to disk IOPS
112-
limitations.
113-
114-
As a result, we recommend using SSDs if [merging](variant_merge.md) or
115-
[--num_bigquery_write_shards](#--num_bigquery_write_shards) is enabled: these
116-
operations require "shuffling" the data (i.e. redistributing the data among
117-
workers), which require significant disk I/O.
118-
119-
Set
120-
`--worker_disk_type compute.googleapis.com/projects//zones//diskTypes/pd-ssd`
121-
to use SSDs.
122-
123-
### `--num_bigquery_write_shards`
124-
125-
Currently, the write operation to BigQuery in Dataflow is performed as a
126-
postprocessing step after the main transforms are done. As a workaround for
127-
BigQuery write limitations (more details
128-
[here](https://github.com/googlegenomics/gcp-variant-transforms/issues/199)),
129-
we have added "sharding" when writing to BigQuery. This makes the data load
130-
to BigQuery significantly faster as it parallelizes the process and enables
131-
loading large (>5TB) data to BigQuery at once.
132-
133-
As a result, we recommend setting `--num_bigquery_write_shards 20` when loading
134-
any data that has more than 1 billion rows (after merging) or 1TB of final
135-
output. You may use a smaller number of write shards (e.g. 5) when using
136-
[sharded output](#--sharding_config_path) as each partition also acts as a
137-
"shard". Note that using a larger value (e.g. 50) can cause BigQuery write to
138-
fail as there is a maximum limit on the number of concurrent writes per table.
139-
140-
### `--sharding_config_path`
141-
142-
Sharding the output can save significant query costs once the data is in
143-
BigQuery. It can also optimize the cost/time of the pipeline (e.g. it natively
144-
shards the BigQuery output per partition and merging can also occur per
145-
partition).
146-
147-
As a result, we recommend setting the partition config for very large data
148-
where possible. Please see the [documentation](sharding.md) for more
149-
details.

0 commit comments

Comments
 (0)