Document flags for robust Variant Transforms. (#357)

nmousavi · web-flow · commit 0497492571e1 · 2018-09-10T11:33:33.000-04:00
* Document flags for robust Variant Transforms. Fixes issue #184
diff --git a/docs/malformed_files.md b/docs/malformed_files.md
@@ -0,0 +1,85 @@
+# Dealing with malformed files
+
+
+## Field compatibility
+
+When loading multiple files, the `INFO` and `FORMAT` fields
+from all VCF files are merged together to generate a "representative header",
+which is then used to generate a single BigQuery schema. If the same key is
+defined in multiple files, then its definition must be compatible across files.
+The compatibility rules are as follows:
+
+* Fields are compatible if they have the same `Number` and `Type` fields.
+  Annotation fields (i.e. those specified by `--annotation_fields`) must also
+  have the same `Description`.
+
+* Fields with different `Type` values are compatible in the following cases:
+
+  * `Integer` and `Float` fields are compatible and are converted to `Float`.
+  * You must run the pipeline with `--allow_incompatible_records` to
+    automatically resolve conflicts between incompatible fields (e.g. `String`
+    and `Integer`). This is to ensure incompatible types are not silently
+    ignored. See 
+    [below](#specifying---allow_incompatible_records) for more details.
+
+* Fields with different `Number` values are compatible in the following cases:
+
+  * "Repeated" numbers are compatible with each other. They include:
+    * `Number=.` (unknown number)
+    * Any `Number` greater than 1.
+    * `Number=G` (one value per genotype) and `Number=R` (one value for each
+      alternate and reference).
+    * `Number=A` (one value for each alternate) only if running with
+      `--split_alternate_allele_info_fields False`.
+
+  * You must run the pipeline with `--allow_incompatible-records` to
+    automatically resolve conflicts between incompatible fields (e.g.
+    `Number=1` and `Number=.`). This is to ensure incompatible types
+    are not silently ignored.
+    See [below](#specifying---allow_incompatible_records) for more details.
+    
+You can run preprossing tool to get a summary of malformed/incompatible
+records. Please refer to 
+[VCF files preprocessor](./vcf_files_preprocessor.md) for more details.
+
+## Specifying `--representative_header_file`
+
+The headers in the `--representative_header_file <path_to_file>` essentially
+specify the merged headers from all files being loaded to BigQuery. This file is
+used to directly generate the BigQuery schema. Note that we only read the
+header info from the file and ignore VCF records, so the
+`representative_header_file` can either be a file containing *just* the header
+fields or can point to an actual VCF file. Providing this file can be useful
+for:
+
+* Speeding up the pipeline especially if a large number of files are provided.
+  The pipeline will use the provided file to generate the BigQuery schema and
+  will skip merging headers across files. This is particularly useful if all
+  files have identical VCF headers.
+* Providing definitions for missing header fields. See the
+  [troubleshooting page](./troubleshooting.md) for more details.
+* Resolving incompatible field definition across files. See
+  [below](#specifying---allow-incompatible-records) for an alternative.
+
+## Specifying `--infer_headers`
+
+If this flag is set, pipeline will infer `TYPE` and `NUMBER` for undefined
+fields based on field values seen in VCF files. It will also output a
+representive header that contains inferred definitions as well as definitions
+from headers. Use this flag if there are fields with missing definition or if
+pipeline should ignore header definitions that are incompatible with field
+values, and instead should infer the correct header definitions for
+the corresponding fields.
+
+Note that this will make pipelines do two passes on the data, which results
+in ~30% more compute.
+
+## Specifying `--allow_incompatible_records`
+
+Pipeline will fail by default if there is a mismatch between field definition
+and actual values or if a field has two inconsistent definitions in two
+different VCF files.
+By specifying `--allow_incompatible_records`, pipeline will resolve conflicts
+in header definitons. It will also cast field values to match BigQuery schema if
+there is a mismatch between field definition and field value (e.g. `Integer` field
+value is casted to `String` to match a field schema of type `String`).
diff --git a/docs/multiple_files.md b/docs/multiple_files.md
@@ -12,8 +12,7 @@ The following operations are performed when loading multiple files:
 * A merged BigQuery schema is created that contains data from all matching
   files. Particularly, all `INFO` and `FORMAT` fields in all VCF files will
   be merged together. This assumes that fields with the same key that are
-  defined in multiple files are compatible (see [below](#field-compatibility)
-  for field compatiblility).
+  defined in multiple files are compatible).
 * Records from all files are loaded into the single table. Missing fields
   are set to `null` in the associated column(s).
 * Samples can be optionally merged together using a merging strategy. See
@@ -27,51 +26,11 @@ is included in a folder starting with `a`, `b`, or `c`. See the
 [Cloud Storage documentation](https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames)
 for more details.
 
-## Field compatibility
+## Incompatible VCF files
 
-As mentioned above, when loading multiple files, the `INFO` and `FORMAT` fields
-from all VCF files are merged together to generate a "representative header",
-which is then used to generate a single BigQuery schema. If the same key is
-defined in multiple files, then its definition must be compatible across files.
-The compatibility rules are as follows:
-
-* Fields are compatible if they have the same `Number` and `Type` fields.
-  Annotation fields (i.e. those specified by `--annotation_fields`) must also
-  have the same `Description`.
-
-* Fields with different `Type` values are compatible in the following cases:
-
-  * `Integer` and `Float` fields are compatible and are converted to `Float`.
-  * We are adding more compatibility options (e.g. resolving `String` and
-  `Integer` to `String`). Stay tuned!
-
-* Fields with different `Number` values are compatible in the following cases:
-
-  * "Repeated" numbers are compatible with each other. They include:
-    * `Number=.` (unknown number)
-    * Any `Number` greater than 1.
-    * `Number=G` (one value per genotype) and `Number=R` (one value for each
-      alternate and reference).
-    * `Number=A` (one value for each alternate) only if running with
-      `--split_alternate_allele_info_fields False`.
-
-  * We are adding more compatibility options (e.g. resolving `Number=1` and
-    `Number=.` with `Number=.`). Stay tuned!
-
-## Specifying `--representative_header_file`
-
-The headers in the `--representative_header_file <path_to_file>` essentially
-specify the merged headers from all files being loaded to BigQuery. This file is
-used to directly generate the BigQuery schema. Note that we only read the
-header info from the file and ignore VCF records, so the
-`representative_header_file` can either be a file containing *just* the header
-fields or can point to an actual VCF file. Providing this file can be useful
-for:
+When loading multiple files, field definitions and their values must be
+consistent across all VCF files, or else pipeline fails by default.
+However, Variant Transforms pipeline is able to handle such malformed files if
+instructed to do so. See [Dealing with malformed files](./malformed_files.md)
+for more details.
 
-* Speeding up the pipeline especially if a large number of files are provided.
-  The pipeline will use the provided file to generate the BigQuery schema and
-  will skip merging headers across files. This is particularly useful if all
-  files have identical VCF headers.
-* Providing definitions for missing header fields. See the
-  [troubleshooting page](./troubleshooting.md) for more details.
-* Coming soon: resolving incompatible fields across files.