-
Notifications
You must be signed in to change notification settings - Fork 10
Duplications
Duplication - an insertion in the query sequence of an extra copy of some reference sequence not adjacent to this region, creating an interspersed repeat, or increasing the copy number of an interspersed repeat.

Figure 1: Duplication example. a) corresponds to a case where a query sequence has the same direction as a reference sequence. b) corresponds to a case with a reverse complemented query sequence. Duplication, Repeat_q and Repeat_r are similar or near-similar repeats.
A duplication difference is output in the query_struct.gff and ref_struct.gff files. Information about the locations of the repeated regions involved in a difference is contained in the query_additional.gff and ref_additional.gff files.
An example with the duplication entries in query_struct.gff :
##gff-version 3
##sequence-region query_1 1 75565
query_1 NucDiff_v2.0 SO:0000667 2561 2640 . . . ID=SV_1;Name=insertion;ins_len=80;query_dir=1;ref_sequence=ref_1;ref_coord=2515;color=#EE0000
query_1 NucDiff_v2.0 SO:1000035 2641 2645 . . . ID=SV_2;Name=duplication;ins_len=5;query_dir=1;ref_sequence=ref_1;ref_coord=2515;query_repeated_region=2556-2560;color=#EE0000
query_1 NucDiff_v2.0 SO:0000667 18768 18792 . . . ID=SV_3;Name=insertion;ins_len=25;query_dir=1;ref_sequence=ref_1;ref_coord=15907;color=#EE0000
query_1 NucDiff_v2.0 SO:1000035 18793 18877 . . . ID=SV_4;Name=duplication;ins_len=85;query_dir=1;ref_sequence=ref_1;ref_coord=15907;query_repeated_region=18683-18767;color=#EE0000
query_1 NucDiff_v2.0 SO:0000667 25426 25825 . . . ID=SV_5;Name=insertion;ins_len=400;query_dir=1;ref_sequence=ref_1;ref_coord=21305;color=#EE0000
query_1 NucDiff_v2.0 SO:1000035 25826 25905 . . . ID=SV_6;Name=duplication;ins_len=80;query_dir=1;ref_sequence=ref_1;ref_coord=21305;query_repeated_region=25346-25425;color=#EE0000
The query_struct.gff file contains the following information (see Figure 1 for notations):
| GFF3 fields | Content | Notes |
|---|---|---|
| col 1 | Query_seq | |
| col 2 | NucDiff_v2.0 | name and current version of the tool |
| col 3 | SO:1000035 | Sequence Ontology accession number corresponding to the "duplication" SO term |
| col 4 | Ins_st | |
| col 5 | Ins_end | |
| col 6/col 7/col8 | . | score/strand/phase fields are not used |
| col 9, ID | "SV_1" | ID in query_struct.gff is equal to ID in ref_struct.gff |
| col 9, Name | "duplication" | |
| col 9, ins_len | Length(Duplication) | |
| col 9, query_dir | "1" or "-1" | -1 if the duplicated fragment should be reverse complemented before its insertion to a Ref_seq |
| col 9, ref_sequence | Ref_seq | |
| col 9, ref_coord | Ref_pos | |
| col 9, query_repeated_region | St_q - End_q | Only for a duplication detected during the local difference detection step. For a duplication detected between reshuffled or inverted fragments, this information is not provided |
An example with the duplication entries in ref_struct.gff :
##gff-version 3
##sequence-region ref_1 1 13000
ref_1 NucDiff_v2.0 SO:0000667 2515 2515 . . . ID=SV_1;Name=insertion;ins_len=80;query_dir=1;query_sequence=query_1;query_coord=2561-2640;color=#EE0000
ref_1 NucDiff_v2.0 SO:1000035 2515 2515 . . . ID=SV_2;Name=duplication;ins_len=5;query_dir=1;query_sequence=query_1;query_coord=2641-2645;ref_repeated_region=2511-2515;color=#EE0000
ref_1 NucDiff_v2.0 SO:0000667 15907 15907 . . . ID=SV_3;Name=insertion;ins_len=25;query_dir=1;query_sequence=query_1;query_coord=18768-18792;color=#EE0000
ref_1 NucDiff_v2.0 SO:1000035 15907 15907 . . . ID=SV_4;Name=duplication;ins_len=85;query_dir=1;query_sequence=query_1;query_coord=18793-18877;ref_repeated_region=15823-15907;color=#EE0000
ref_1 NucDiff_v2.0 SO:0000667 21305 21305 . . . ID=SV_5;Name=insertion;ins_len=400;query_dir=1;query_sequence=query_1;query_coord=25426-25825;color=#EE0000
ref_1 NucDiff_v2.0 SO:1000035 21305 21305 . . . ID=SV_6;Name=duplication;ins_len=80;query_dir=1;query_sequence=query_1;query_coord=25826-25905;ref_repeated_region=21226-21305;color=#EE0000
The ref_struct.gff file contains the following information (see Figure 1 for notations):
| GFF3 fields | Content | Notes |
|---|---|---|
| col 1 | Ref_seq | |
| col 2 | NucDiff_v2.0 | name and current version of the tool |
| col 3 | SO:1000035 | Sequence Ontology accession number corresponding to the "duplication" SO term |
| col 4 | Ref_pos | |
| col 5 | Ref_pos | |
| col 6/col 7/col8 | . | score/strand/phase fields are not used |
| col 9, ID | "SV_1" | ID in ref_struct.gff is equal to ID in query_struct.gff |
| col 9, Name | "duplication" | |
| col 9, ins_len | Length(Duplication) | |
| col 9, query_dir | "1" or "-1" | -1, if the duplicated fragment should be reverse complemented before its insertion to a Ref_seq |
| col 9, query_sequence | Query_seq | |
| col 9, query_coord | Ins_st - Ins_end | |
| col 9, ref_repeated_region | St_r - Ref_pos | Only for a duplication detected during the local difference detection step. For a duplication detected between reshuffled or inverted fragments, this information is not provided |
An example with the additional information in query_additional.gff :
##gff-version 3
##sequence-region query_1 1 75565
query_1 NucDiff_v2.0 SO:0000657 2556 2560 . . . ID=Region_1;Name=Repeated_region;query_repeat_len=5;difference_type=duplication;difference_coord_query=2641-2645;difference_len=5
query_1 NucDiff_v2.0 SO:0000657 18683 18767 . . . ID=Region_2;Name=Repeated_region;query_repeat_len=85;difference_type=duplication;difference_coord_query=18793-18877;difference_len=85
query_1 NucDiff_v2.0 SO:0000657 25346 25425 . . . ID=Region_3;Name=Repeated_region;query_repeat_len=80;difference_type=duplication;difference_coord_query=25826-25905;difference_len=80
The query_additional.gff file contains the following information (see Figure 1 for notations):
| GFF3 fields | Content | Notes |
|---|---|---|
| col 1 | Query_seq | |
| col 2 | NucDiff_v2.0 | name and current version of the tool |
| col 3 | SO:0000657 | Sequence Ontology accession number corresponding to the "repeat_region" SO term |
| col 4 | St_q | |
| col 5 | End_q | |
| col 6/col 7/col8 | . | score/strand/phase fields are not used |
| col 9, ID | "Region_1" | IDs in query_additional.gff and ref_additional.gff are independent |
| col 9, Name | "Repeated_region" | |
| col 9, query_repeat_len | Length(Repeat_q) | |
| col 9, difference_type | "duplication" | |
| col 9, difference_coord_query | Ins_st - Ins_end | |
| col 9, difference_len | Length(Duplication) |
An example with the additional information in ref_additional.gff :
##gff-version 3
##sequence-region ref_1 1 57855
ref_1 NucDiff_v2.0 SO:0000657 2511 2515 . . . ID=Region_1;Name=Repeated_region;ref_repeat_len=5;difference_type=duplication;difference_coord_ref=2515-2515;difference_len=5;color=#DB0101
ref_1 NucDiff_v2.0 SO:0000657 15823 15907 . . . ID=Region_2;Name=Repeated_region;ref_repeat_len=85;difference_type=duplication;difference_coord_ref=15907-15907;difference_len=85;color=#DB0101
ref_1 NucDiff_v2.0 SO:0000657 21226 21305 . . . ID=Region_3;Name=Repeated_region;ref_repeat_len=80;difference_type=duplication;difference_coord_ref=21305-21305;difference_len=80;color=#DB0101
The ref_additional.gff file contains the following information (see Figure 1 for notations):
| GFF3 fields | Content | Notes |
|---|---|---|
| col 1 | Ref_seq | |
| col 2 | NucDiff_v2.0 | name and current version of the tool |
| col 3 | SO:0000657 | Sequence Ontology accession number corresponding to the "repeat_region" SO term |
| col 4 | St_r | |
| col 5 | Ref_pos | |
| col 6/col 7/col8 | . | score/strand/phase fields are not used |
| col 9, ID | "Region_1" | IDs in query_additional.gff and ref_additional.gff are independent |
| col 9, Name | "Repeated_region" | |
| col 9, ref_repeat_len | Length(Repeat_r) | |
| col 9, difference_type | "duplication" | |
| col 9, difference_coord_ref | Ref_pos - Ref_pos | |
| col 9, difference_len | Length(Duplication) |