Skip to content

Duplications

kseniakh edited this page Mar 10, 2017 · 1 revision

Duplications

Duplication - an insertion in the query sequence of an extra copy of some reference sequence not adjacent to this region, creating an interspersed repeat, or increasing the copy number of an interspersed repeat.



Figure 1: Duplication example. a) corresponds to a case where a query sequence has the same direction as a reference sequence. b) corresponds to a case with a reverse complemented query sequence. Duplication, Repeat_q and Repeat_r are similar or near-similar repeats.



A duplication difference is output in the query_struct.gff and ref_struct.gff files. Information about the locations of the repeated regions involved in a difference is contained in the query_additional.gff and ref_additional.gff files.

An example with the duplication entries in query_struct.gff :

##gff-version 3
##sequence-region	query_1	1	75565
query_1	NucDiff_v2.0	SO:0000667	2561	2640	.	.	.	ID=SV_1;Name=insertion;ins_len=80;query_dir=1;ref_sequence=ref_1;ref_coord=2515;color=#EE0000
query_1	NucDiff_v2.0	SO:1000035	2641	2645	.	.	.	ID=SV_2;Name=duplication;ins_len=5;query_dir=1;ref_sequence=ref_1;ref_coord=2515;query_repeated_region=2556-2560;color=#EE0000
query_1	NucDiff_v2.0	SO:0000667	18768	18792	.	.	.	ID=SV_3;Name=insertion;ins_len=25;query_dir=1;ref_sequence=ref_1;ref_coord=15907;color=#EE0000
query_1	NucDiff_v2.0	SO:1000035	18793	18877	.	.	.	ID=SV_4;Name=duplication;ins_len=85;query_dir=1;ref_sequence=ref_1;ref_coord=15907;query_repeated_region=18683-18767;color=#EE0000
query_1	NucDiff_v2.0	SO:0000667	25426	25825	.	.	.	ID=SV_5;Name=insertion;ins_len=400;query_dir=1;ref_sequence=ref_1;ref_coord=21305;color=#EE0000
query_1	NucDiff_v2.0	SO:1000035	25826	25905	.	.	.	ID=SV_6;Name=duplication;ins_len=80;query_dir=1;ref_sequence=ref_1;ref_coord=21305;query_repeated_region=25346-25425;color=#EE0000



The query_struct.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Query_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:1000035 Sequence Ontology accession number corresponding to the "duplication" SO term
col 4 Ins_st
col 5 Ins_end
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "SV_1" ID in query_struct.gff is equal to ID in ref_struct.gff
col 9, Name "duplication"
col 9, ins_len Length(Duplication)
col 9, query_dir "1" or "-1" -1 if the duplicated fragment should be reverse complemented before its insertion to a Ref_seq
col 9, ref_sequence Ref_seq
col 9, ref_coord Ref_pos
col 9, query_repeated_region St_q - End_q Only for a duplication detected during the local difference detection step. For a duplication detected between reshuffled or inverted fragments, this information is not provided



An example with the duplication entries in ref_struct.gff :

##gff-version 3
##sequence-region	ref_1	1	13000
ref_1	NucDiff_v2.0	SO:0000667	2515	2515	.	.	.	ID=SV_1;Name=insertion;ins_len=80;query_dir=1;query_sequence=query_1;query_coord=2561-2640;color=#EE0000
ref_1	NucDiff_v2.0	SO:1000035	2515	2515	.	.	.	ID=SV_2;Name=duplication;ins_len=5;query_dir=1;query_sequence=query_1;query_coord=2641-2645;ref_repeated_region=2511-2515;color=#EE0000
ref_1	NucDiff_v2.0	SO:0000667	15907	15907	.	.	.	ID=SV_3;Name=insertion;ins_len=25;query_dir=1;query_sequence=query_1;query_coord=18768-18792;color=#EE0000
ref_1	NucDiff_v2.0	SO:1000035	15907	15907	.	.	.	ID=SV_4;Name=duplication;ins_len=85;query_dir=1;query_sequence=query_1;query_coord=18793-18877;ref_repeated_region=15823-15907;color=#EE0000
ref_1	NucDiff_v2.0	SO:0000667	21305	21305	.	.	.	ID=SV_5;Name=insertion;ins_len=400;query_dir=1;query_sequence=query_1;query_coord=25426-25825;color=#EE0000
ref_1	NucDiff_v2.0	SO:1000035	21305	21305	.	.	.	ID=SV_6;Name=duplication;ins_len=80;query_dir=1;query_sequence=query_1;query_coord=25826-25905;ref_repeated_region=21226-21305;color=#EE0000



The ref_struct.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Ref_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:1000035 Sequence Ontology accession number corresponding to the "duplication" SO term
col 4 Ref_pos
col 5 Ref_pos
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "SV_1" ID in ref_struct.gff is equal to ID in query_struct.gff
col 9, Name "duplication"
col 9, ins_len Length(Duplication)
col 9, query_dir "1" or "-1" -1, if the duplicated fragment should be reverse complemented before its insertion to a Ref_seq
col 9, query_sequence Query_seq
col 9, query_coord Ins_st - Ins_end
col 9, ref_repeated_region St_r - Ref_pos Only for a duplication detected during the local difference detection step. For a duplication detected between reshuffled or inverted fragments, this information is not provided



An example with the additional information in query_additional.gff :

##gff-version 3
##sequence-region	query_1	1	75565
query_1	NucDiff_v2.0	SO:0000657	2556	2560	.	.	.	ID=Region_1;Name=Repeated_region;query_repeat_len=5;difference_type=duplication;difference_coord_query=2641-2645;difference_len=5
query_1	NucDiff_v2.0	SO:0000657	18683	18767	.	.	.	ID=Region_2;Name=Repeated_region;query_repeat_len=85;difference_type=duplication;difference_coord_query=18793-18877;difference_len=85
query_1	NucDiff_v2.0	SO:0000657	25346	25425	.	.	.	ID=Region_3;Name=Repeated_region;query_repeat_len=80;difference_type=duplication;difference_coord_query=25826-25905;difference_len=80



The query_additional.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Query_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000657 Sequence Ontology accession number corresponding to the "repeat_region" SO term
col 4 St_q
col 5 End_q
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "Region_1" IDs in query_additional.gff and ref_additional.gff are independent
col 9, Name "Repeated_region"
col 9, query_repeat_len Length(Repeat_q)
col 9, difference_type "duplication"
col 9, difference_coord_query Ins_st - Ins_end
col 9, difference_len Length(Duplication)



An example with the additional information in ref_additional.gff :

##gff-version 3
##sequence-region	ref_1	1	57855
ref_1	NucDiff_v2.0	SO:0000657	2511	2515	.	.	.	ID=Region_1;Name=Repeated_region;ref_repeat_len=5;difference_type=duplication;difference_coord_ref=2515-2515;difference_len=5;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	15823	15907	.	.	.	ID=Region_2;Name=Repeated_region;ref_repeat_len=85;difference_type=duplication;difference_coord_ref=15907-15907;difference_len=85;color=#DB0101
ref_1	NucDiff_v2.0	SO:0000657	21226	21305	.	.	.	ID=Region_3;Name=Repeated_region;ref_repeat_len=80;difference_type=duplication;difference_coord_ref=21305-21305;difference_len=80;color=#DB0101



The ref_additional.gff file contains the following information (see Figure 1 for notations):

GFF3 fields Content Notes
col 1 Ref_seq
col 2 NucDiff_v2.0 name and current version of the tool
col 3 SO:0000657 Sequence Ontology accession number corresponding to the "repeat_region" SO term
col 4 St_r
col 5 Ref_pos
col 6/col 7/col8 . score/strand/phase fields are not used
col 9, ID "Region_1" IDs in query_additional.gff and ref_additional.gff are independent
col 9, Name "Repeated_region"
col 9, ref_repeat_len Length(Repeat_r)
col 9, difference_type "duplication"
col 9, difference_coord_ref Ref_pos - Ref_pos
col 9, difference_len Length(Duplication)

Clone this wiki locally