Skip to content

Problem while trying to filter gtf file by longest transcript #71

@eduardopdev

Description

@eduardopdev

First of all, i downloaded Araport11 from www.arabidopsis.org
Then I used cgat-apps to generate a gtf file out of the file i downloaded
$ cat Araport11_GFF3_genes_transposons.201606.gff | cgat gff32gtf > Araport.gtf
and had no problems.

I saw that input gtf files must be sorted. so i sorted Araport.gtf:
cat Araport.gtf | cgat gtf2gtf > Araport_sort.gtf --method=sort --sort-order=gene+transcript

and then, I filtered the sorted file by longest_transcript:
cat Araport_sort.gtf | cgat gtf2gtf > Araport_longest.gtf --method=filter --filter-method=longest-transcript

the problem is that gtf2gtf is not always selecting the transcript with the bigger length. On gene AT1G01030, for example, the sum of the length of all exons in transcript AT1G01030.1 is bigger than that of AT1G01030.2, but on Araport_longest.gtf the AT1G01030.2 transcript is selected.

Here is the AT1G01030 gene in the original Araport_sort.gtf:

Chr1 Araport11 three_prime_UTR 11649 11863 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:three_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:three_prime_UTR:1";
Chr1 Araport11 exon 11649 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:exon:3"; Parent "AT1G01030.1"; Name "NGA3:exon:3";
Chr1 Araport11 CDS 11864 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:CDS:2"; Parent "AT1G01030.1"; Name "NGA3:CDS:2";
Chr1 Araport11 five_prime_UTR 12941 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:five_prime_UTR:2"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:2";
Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1";
Chr1 Araport11 five_prime_UTR 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.1"; ID "AT1G01030:five_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:1";
Chr1 Araport11 exon 11649 12354 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:4"; Parent "AT1G01030.2"; Name "NGA3:exon:4";
Chr1 Araport11 three_prime_UTR 11649 11863 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:three_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:three_prime_UTR:1";
Chr1 Araport11 CDS 11864 12354 . - 2 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:3"; Parent "AT1G01030.2"; Name "NGA3:CDS:3";
Chr1 Araport11 CDS 12424 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:1"; Parent "AT1G01030.2"; Name "NGA3:CDS:1";
Chr1 Araport11 exon 12424 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:2"; Parent "AT1G01030.2"; Name "NGA3:exon:2";
Chr1 Araport11 five_prime_UTR 12941 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:five_prime_UTR:2"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:2";
Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1";
Chr1 Araport11 five_prime_UTR 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:five_prime_UTR:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:five_prime_UTR:1";

And here is the same gene on the Araport_longest.gtf:

Chr1 Araport11 exon 11649 12354 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:4"; Parent "AT1G01030.2"; Name "NGA3:exon:4";
Chr1 Araport11 CDS 11864 12354 . - 2 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:3"; Parent "AT1G01030.2"; Name "NGA3:CDS:3";
Chr1 Araport11 CDS 12424 12940 . - 0 gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:CDS:1"; Parent "AT1G01030.2"; Name "NGA3:CDS:1";
Chr1 Araport11 exon 12424 13173 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:2"; Parent "AT1G01030.2"; Name "NGA3:exon:2";
Chr1 Araport11 exon 13335 13714 . - . gene_id "AT1G01030"; transcript_id "AT1G01030.2"; ID "AT1G01030:exon:1"; Parent "AT1G01030.2,AT1G01030.1"; Name "NGA3:exon:1";

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions