Skip to content

Commit 0497492

Browse files
authored
Document flags for robust Variant Transforms. (#357)
* Document flags for robust Variant Transforms. Fixes issue #184
1 parent 6440818 commit 0497492

File tree

2 files changed

+92
-48
lines changed

2 files changed

+92
-48
lines changed

docs/malformed_files.md

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Dealing with malformed files
2+
3+
4+
## Field compatibility
5+
6+
When loading multiple files, the `INFO` and `FORMAT` fields
7+
from all VCF files are merged together to generate a "representative header",
8+
which is then used to generate a single BigQuery schema. If the same key is
9+
defined in multiple files, then its definition must be compatible across files.
10+
The compatibility rules are as follows:
11+
12+
* Fields are compatible if they have the same `Number` and `Type` fields.
13+
Annotation fields (i.e. those specified by `--annotation_fields`) must also
14+
have the same `Description`.
15+
16+
* Fields with different `Type` values are compatible in the following cases:
17+
18+
* `Integer` and `Float` fields are compatible and are converted to `Float`.
19+
* You must run the pipeline with `--allow_incompatible_records` to
20+
automatically resolve conflicts between incompatible fields (e.g. `String`
21+
and `Integer`). This is to ensure incompatible types are not silently
22+
ignored. See
23+
[below](#specifying---allow_incompatible_records) for more details.
24+
25+
* Fields with different `Number` values are compatible in the following cases:
26+
27+
* "Repeated" numbers are compatible with each other. They include:
28+
* `Number=.` (unknown number)
29+
* Any `Number` greater than 1.
30+
* `Number=G` (one value per genotype) and `Number=R` (one value for each
31+
alternate and reference).
32+
* `Number=A` (one value for each alternate) only if running with
33+
`--split_alternate_allele_info_fields False`.
34+
35+
* You must run the pipeline with `--allow_incompatible-records` to
36+
automatically resolve conflicts between incompatible fields (e.g.
37+
`Number=1` and `Number=.`). This is to ensure incompatible types
38+
are not silently ignored.
39+
See [below](#specifying---allow_incompatible_records) for more details.
40+
41+
You can run preprossing tool to get a summary of malformed/incompatible
42+
records. Please refer to
43+
[VCF files preprocessor](./vcf_files_preprocessor.md) for more details.
44+
45+
## Specifying `--representative_header_file`
46+
47+
The headers in the `--representative_header_file <path_to_file>` essentially
48+
specify the merged headers from all files being loaded to BigQuery. This file is
49+
used to directly generate the BigQuery schema. Note that we only read the
50+
header info from the file and ignore VCF records, so the
51+
`representative_header_file` can either be a file containing *just* the header
52+
fields or can point to an actual VCF file. Providing this file can be useful
53+
for:
54+
55+
* Speeding up the pipeline especially if a large number of files are provided.
56+
The pipeline will use the provided file to generate the BigQuery schema and
57+
will skip merging headers across files. This is particularly useful if all
58+
files have identical VCF headers.
59+
* Providing definitions for missing header fields. See the
60+
[troubleshooting page](./troubleshooting.md) for more details.
61+
* Resolving incompatible field definition across files. See
62+
[below](#specifying---allow-incompatible-records) for an alternative.
63+
64+
## Specifying `--infer_headers`
65+
66+
If this flag is set, pipeline will infer `TYPE` and `NUMBER` for undefined
67+
fields based on field values seen in VCF files. It will also output a
68+
representive header that contains inferred definitions as well as definitions
69+
from headers. Use this flag if there are fields with missing definition or if
70+
pipeline should ignore header definitions that are incompatible with field
71+
values, and instead should infer the correct header definitions for
72+
the corresponding fields.
73+
74+
Note that this will make pipelines do two passes on the data, which results
75+
in ~30% more compute.
76+
77+
## Specifying `--allow_incompatible_records`
78+
79+
Pipeline will fail by default if there is a mismatch between field definition
80+
and actual values or if a field has two inconsistent definitions in two
81+
different VCF files.
82+
By specifying `--allow_incompatible_records`, pipeline will resolve conflicts
83+
in header definitons. It will also cast field values to match BigQuery schema if
84+
there is a mismatch between field definition and field value (e.g. `Integer` field
85+
value is casted to `String` to match a field schema of type `String`).

docs/multiple_files.md

+7-48
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@ The following operations are performed when loading multiple files:
1212
* A merged BigQuery schema is created that contains data from all matching
1313
files. Particularly, all `INFO` and `FORMAT` fields in all VCF files will
1414
be merged together. This assumes that fields with the same key that are
15-
defined in multiple files are compatible (see [below](#field-compatibility)
16-
for field compatiblility).
15+
defined in multiple files are compatible).
1716
* Records from all files are loaded into the single table. Missing fields
1817
are set to `null` in the associated column(s).
1918
* Samples can be optionally merged together using a merging strategy. See
@@ -27,51 +26,11 @@ is included in a folder starting with `a`, `b`, or `c`. See the
2726
[Cloud Storage documentation](https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames)
2827
for more details.
2928

30-
## Field compatibility
29+
## Incompatible VCF files
3130

32-
As mentioned above, when loading multiple files, the `INFO` and `FORMAT` fields
33-
from all VCF files are merged together to generate a "representative header",
34-
which is then used to generate a single BigQuery schema. If the same key is
35-
defined in multiple files, then its definition must be compatible across files.
36-
The compatibility rules are as follows:
37-
38-
* Fields are compatible if they have the same `Number` and `Type` fields.
39-
Annotation fields (i.e. those specified by `--annotation_fields`) must also
40-
have the same `Description`.
41-
42-
* Fields with different `Type` values are compatible in the following cases:
43-
44-
* `Integer` and `Float` fields are compatible and are converted to `Float`.
45-
* We are adding more compatibility options (e.g. resolving `String` and
46-
`Integer` to `String`). Stay tuned!
47-
48-
* Fields with different `Number` values are compatible in the following cases:
49-
50-
* "Repeated" numbers are compatible with each other. They include:
51-
* `Number=.` (unknown number)
52-
* Any `Number` greater than 1.
53-
* `Number=G` (one value per genotype) and `Number=R` (one value for each
54-
alternate and reference).
55-
* `Number=A` (one value for each alternate) only if running with
56-
`--split_alternate_allele_info_fields False`.
57-
58-
* We are adding more compatibility options (e.g. resolving `Number=1` and
59-
`Number=.` with `Number=.`). Stay tuned!
60-
61-
## Specifying `--representative_header_file`
62-
63-
The headers in the `--representative_header_file <path_to_file>` essentially
64-
specify the merged headers from all files being loaded to BigQuery. This file is
65-
used to directly generate the BigQuery schema. Note that we only read the
66-
header info from the file and ignore VCF records, so the
67-
`representative_header_file` can either be a file containing *just* the header
68-
fields or can point to an actual VCF file. Providing this file can be useful
69-
for:
31+
When loading multiple files, field definitions and their values must be
32+
consistent across all VCF files, or else pipeline fails by default.
33+
However, Variant Transforms pipeline is able to handle such malformed files if
34+
instructed to do so. See [Dealing with malformed files](./malformed_files.md)
35+
for more details.
7036

71-
* Speeding up the pipeline especially if a large number of files are provided.
72-
The pipeline will use the provided file to generate the BigQuery schema and
73-
will skip merging headers across files. This is particularly useful if all
74-
files have identical VCF headers.
75-
* Providing definitions for missing header fields. See the
76-
[troubleshooting page](./troubleshooting.md) for more details.
77-
* Coming soon: resolving incompatible fields across files.

0 commit comments

Comments
 (0)