Enhancements in vfdb_parser.py for VFDB full dataset support #320

lknegendorf · 2022-01-19T15:30:09Z

Currently, when using the getref vfbd_full (...) command downloading the full VFDB dataset, it is not possible to proceed with ariba preparef (...) using the resulting reference data without manual changes to both the .fa and the .tsv files. This is, because the reference data set contains several pitfalls not adressed yet:

Duplicate sequence IDs raising errors in ariba prepareref.
Sequences with stop codons, which are filtered out. (The metadata.tsv file created by vfdb_parser is currently declaring every sequence from the dataset as gene).
Gene symbols including brackets or blank spaces, so the intended naming is not working for every sequence complicating the creation of meaningful cluster names.

The modifications proposed here adress all shortcomings mentioned above.
Furthermore, the xls-derived metadata from VFDB explaining function and mechanism of a respective virulence gene (VFs.xls.gz, see VFs description file on VFDB download page) are included into the metadata.tsv derived from vfdb_parser to allow a more comprehensive view of the ariba variant calling results for working with VFDB.

Thank you for considering to merge for a future release.

Change in <vfdb_id> group solves issue concerning vfdb GeneIDs not attributed with GenBank Accession. Change in <name> group solves issue concerning gene names including whitespace characters or brackets (e.g. `cryIA(a)`). Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-14, VFdbParser._fa_header_to_name_pieces did not return any None values.

Newly implemented functions: Extracts VFIDs from <description> part of seq.id in VFDB .fa-file and downloads VFs.xls.gz file from VFDB and links VFIDs to create metadata file including more information. Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-17. Still, manual changes are needed to resulting .fa file as VFDB contains duplicates.

Included a list-based filter for duplicate sequence ids in the downloaded VFDB fasta file. As consequence, ´ariba prepareref´ can be run after execution of vfdb_parser without manual deletion of duplicate entries. Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-18. Command ´ariba prepareref´ is not raising error because of duplicate seq.id (but 1254 sequences are filtered out because they are not recognized as genes though).

Included a check if a sequence can be translated making use of methods from pyfastaq. If sequence can not be translated, it is declared as non-coding in resulting metadata file, allowing processing with `ariba prepareref` without filtering of such sequences. Included funktion reporting maximum length giving advise for choice of parameters in further processing. Validated with VFDB_setB_nt.fas.gz downloaded on 2022-01-19. Command ´ariba prepareref´ is not removing any sequence from dataset (when run with advised parameters from vfdb_parser.

lknegendorf added 4 commits January 14, 2022 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Uh oh!

lknegendorf commented Jan 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Are you sure you want to change the base?

Enhancements in vfdb_parser.py for VFDB full dataset support #320

Uh oh!

Conversation

lknegendorf commented Jan 19, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant