|
| 1 | +# Dealing with malformed files |
| 2 | + |
| 3 | + |
| 4 | +## Field compatibility |
| 5 | + |
| 6 | +When loading multiple files, the `INFO` and `FORMAT` fields |
| 7 | +from all VCF files are merged together to generate a "representative header", |
| 8 | +which is then used to generate a single BigQuery schema. If the same key is |
| 9 | +defined in multiple files, then its definition must be compatible across files. |
| 10 | +The compatibility rules are as follows: |
| 11 | + |
| 12 | +* Fields are compatible if they have the same `Number` and `Type` fields. |
| 13 | + Annotation fields (i.e. those specified by `--annotation_fields`) must also |
| 14 | + have the same `Description`. |
| 15 | + |
| 16 | +* Fields with different `Type` values are compatible in the following cases: |
| 17 | + |
| 18 | + * `Integer` and `Float` fields are compatible and are converted to `Float`. |
| 19 | + * You must run the pipeline with `--allow_incompatible_records` to |
| 20 | + automatically resolve conflicts between incompatible fields (e.g. `String` |
| 21 | + and `Integer`). This is to ensure incompatible types are not silently |
| 22 | + ignored. See |
| 23 | + [below](#specifying---allow_incompatible_records) for more details. |
| 24 | + |
| 25 | +* Fields with different `Number` values are compatible in the following cases: |
| 26 | + |
| 27 | + * "Repeated" numbers are compatible with each other. They include: |
| 28 | + * `Number=.` (unknown number) |
| 29 | + * Any `Number` greater than 1. |
| 30 | + * `Number=G` (one value per genotype) and `Number=R` (one value for each |
| 31 | + alternate and reference). |
| 32 | + * `Number=A` (one value for each alternate) only if running with |
| 33 | + `--split_alternate_allele_info_fields False`. |
| 34 | + |
| 35 | + * You must run the pipeline with `--allow_incompatible-records` to |
| 36 | + automatically resolve conflicts between incompatible fields (e.g. |
| 37 | + `Number=1` and `Number=.`). This is to ensure incompatible types |
| 38 | + are not silently ignored. |
| 39 | + See [below](#specifying---allow_incompatible_records) for more details. |
| 40 | + |
| 41 | +You can run preprossing tool to get a summary of malformed/incompatible |
| 42 | +records. Please refer to |
| 43 | +[VCF files preprocessor](./vcf_files_preprocessor.md) for more details. |
| 44 | + |
| 45 | +## Specifying `--representative_header_file` |
| 46 | + |
| 47 | +The headers in the `--representative_header_file <path_to_file>` essentially |
| 48 | +specify the merged headers from all files being loaded to BigQuery. This file is |
| 49 | +used to directly generate the BigQuery schema. Note that we only read the |
| 50 | +header info from the file and ignore VCF records, so the |
| 51 | +`representative_header_file` can either be a file containing *just* the header |
| 52 | +fields or can point to an actual VCF file. Providing this file can be useful |
| 53 | +for: |
| 54 | + |
| 55 | +* Speeding up the pipeline especially if a large number of files are provided. |
| 56 | + The pipeline will use the provided file to generate the BigQuery schema and |
| 57 | + will skip merging headers across files. This is particularly useful if all |
| 58 | + files have identical VCF headers. |
| 59 | +* Providing definitions for missing header fields. See the |
| 60 | + [troubleshooting page](./troubleshooting.md) for more details. |
| 61 | +* Resolving incompatible field definition across files. See |
| 62 | + [below](#specifying---allow-incompatible-records) for an alternative. |
| 63 | + |
| 64 | +## Specifying `--infer_headers` |
| 65 | + |
| 66 | +If this flag is set, pipeline will infer `TYPE` and `NUMBER` for undefined |
| 67 | +fields based on field values seen in VCF files. It will also output a |
| 68 | +representive header that contains inferred definitions as well as definitions |
| 69 | +from headers. Use this flag if there are fields with missing definition or if |
| 70 | +pipeline should ignore header definitions that are incompatible with field |
| 71 | +values, and instead should infer the correct header definitions for |
| 72 | +the corresponding fields. |
| 73 | + |
| 74 | +Note that this will make pipelines do two passes on the data, which results |
| 75 | +in ~30% more compute. |
| 76 | + |
| 77 | +## Specifying `--allow_incompatible_records` |
| 78 | + |
| 79 | +Pipeline will fail by default if there is a mismatch between field definition |
| 80 | +and actual values or if a field has two inconsistent definitions in two |
| 81 | +different VCF files. |
| 82 | +By specifying `--allow_incompatible_records`, pipeline will resolve conflicts |
| 83 | +in header definitons. It will also cast field values to match BigQuery schema if |
| 84 | +there is a mismatch between field definition and field value (e.g. `Integer` field |
| 85 | +value is casted to `String` to match a field schema of type `String`). |
0 commit comments