Ilya 12/7/2018
Scrape GIDEON database for more up-to-date dataset of mammal hosts and diseases.
Integrate data on host species, bacterial species and traits, and human disease outcomes. Visually summarize bacteria causing disease in mammals. Apply generalized boosted models (GBM) to predict transmissibility and human disease outcomes based on bacterial traits.
6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans
- GIDEON dataset of bacterial zoonotic diseases and their mammalian hosts
- Dataset relating mammals to pathogens (GMPD)
- Dataset matching diseases to pathogens (to be collected)
- Bacterial trait datasets (Brbic et al. 2016; Barberan et al. 2017; EID2; GMPD)
- Human disease outcomes (GIDEON) and transmissibility (Han)
- Mammalian host ranges (IUCN)
## Removing package from '/Library/Frameworks/R.framework/Versions/3.4/Resources/library'
## (as 'lib' is unspecified)
## Removing package from '/Library/Frameworks/R.framework/Versions/3.4/Resources/library'
## (as 'lib' is unspecified)
##
## Attaching package: 'rlang'
## The following object is masked from 'package:data.table':
##
## :=
##
## There is a binary version available but the source version is
## later:
## binary source needs_compilation
## tibble 1.4.2 2.0.1 TRUE
## installing the source package 'tibble'
##
## The downloaded binary packages are in
## /var/folders/0d/qm_pqljx11s_ddc42g1_yscr0000gn/T//RtmpW5aBIA/downloaded_packages
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:glue':
##
## collapse
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'reshape2'
## The following objects are masked from 'package:data.table':
##
## dcast, melt
## corrplot 0.84 loaded
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## CHNOSZ version 1.1.3 (2017-11-13)
## Please run data(thermo) to create the "thermo" object
##
## Attaching package: 'CHNOSZ'
## The following objects are masked from 'package:Hmisc':
##
## mtitle, spearman
## The following object is masked from 'package:dplyr':
##
## slice
## Downloading GitHub repo TIBHannover/BacDiveR@master
## from URL https://api.github.com/repos/TIBHannover/BacDiveR/zipball/master
## Installing BacDiveR
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \
## --no-environ --no-save --no-restore --quiet CMD INSTALL \
## '/private/var/folders/0d/qm_pqljx11s_ddc42g1_yscr0000gn/T/RtmpW5aBIA/devtools1da6534663f7/TIBHannover-BacDiveR-7108220' \
## --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' \
## --install-tests
##
## Installation failed: Command failed (3)
## Skipping install of 'taxizedb' from a github remote, the SHA1 (7ee9741a) has not changed since last install.
## Use `force = TRUE` to force installation
##
## Attaching package: 'taxizedb'
## The following objects are masked from 'package:taxize':
##
## children, classification, downstream
Save as GIDEON.Rdata, including unique diseases associated with each mammal taxon
source("GIDEON_read.R")
Subset GIDEON data (on mammalian hosts and diseases) to then match up diseases with pathogens. This zdx-pathogen matching has already been done for carnivores and primates (in part), so exclude carnivores and primates. This saves animal_dx_parasites.Rdata and outputs animal-dx-parasites.csv
#Do this with Mammal Species of the World (
#http://www.departments.bucknell.edu/biology/resources/msw3/
#Note that this outputs animal-dx-parasites.csv, which includes only those mammal hosts that have been associated with a bacterial disease (Label = 1)
source("GIDEON_subset_exclude_carnivores_primates.R")
## [1] "number of species / zoonosis pairs in GIDEON including all orders"
## [1] 2256 2
## [1] "number of non-carnivore mammal species without records in GIDEON"
## [1] 7638
## [1] "number of species / zoonosis pairs in GIDEON (all orders) after merge with mammals of the world checklist"
## [1] 15687 3
## [1] "number of *unique* species / zoonosis pairs in GIDEON (all orders) after merge with mammals of the world checklist"
## [1] 2246 3
## [1] 2246 3
## [1] "number of species / zoonosis records in GIDEON excluding primates and carnivores"
## [1] 1452
## [1] "records of non-carnivore mammals, with zdx and including one row for each mammal w/o a recorded zdx"
## [1] 9090
## [1] "check that size of animal-dx-parasites matches number GIDEON records plus number of other mammals"
## [1] TRUE
#comment out this version that excludes carnivores
# source("GIDEON_subset_non_carnivores.R")
#Commenting out this subset that only includes ungulates
#source("GIDEON_subset_ungulates.R")
#This next way of doing the subset is wrong because it assumes ungulates must be in GMPD; however, there could be records in GIDEON for ungulate / disease for which associated pathogen has not been recorded in GMPD.
# source("GIDEON_subset.R")
Construct dataset animal-dx-parasites (google sheet) matching bacteria to zdx. Follow protocol in mammal-zdx-parasites instructions (google doc).
This includes data on bacterial dx that affect people but not other animals. Make separate row for each bacteria species Output = human_bacteria.Rdata
source("parse_species_bacteria.R")
Parse vectors
source("parse_vector.R")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Saving 7 x 5 in image
## Saving 7 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot1
plot2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Assemble data on primates (prim-zdx-parasites.xls, google sheet), carnivores (carnivore-zdx-parasites.csv, dropbox), other mammals ("animal-dx-parasites - animal-dx-parasites.csv", exported from google sheet).
Output list of parasites for checking in NCBI, parasiteGMPD_tax_report.txt. Output df_parasite.Rdata (mammals with parasites), df_no_parasite.Rdata (no parasite), and df_all (mammals w/ and w/o parasite)
source("mammal_zdx_assemble.R")
## [1] 9181 9
## [1] 666 5
## [1] 499 11
In GIDEON, filter: Disease --> Agent --> bacterium. Copy diseases into google sheet GIDEON_bacterium_dx. Add bacteria there. Subset df_all.Rdata with GIDEON_bacterium_dx so that GIDEON contains only bacterial diseases
source("GIDEON_subset_bacterial.R")
## [1] "rows including all types of diseases"
## [1] 10346
## [1] "number of bacterial diseases in df_all"
## [1] 129
## [1] "rows in df_all -- only bacterial diseases"
## [1] 676
## [1] "rows in df_all -- including mammals with Label = 0"
## [1] 9772
Assemble mammal (df_all) and human data (human_bacteria). Save as df_all. Note that human_bacteria also includes bacteria found in other mammals.
source("human_mammal.R")
## [1] "rows in df_all"
## [1] 10004
Graph vectors associated with each host order. This assigns vectors from human data to non-human, but does not resaves this as new dataframe
source("host_vector.R")
## Saving 7 x 5 in image
plot
Version using taxizedb. This works. If this doesn't work, try restarting R.
source("taxizedb_children.R")
## user system elapsed
## 0 0 0
Version using Catalog of Life. Works but comment out because returns only ~9000 species, which seems like small number, and because not NCBI
# source("taxize_children_col.R")
Version using taxize and NCBI with downstream_ncbi. Comment out because returns an error.
#source("taxize_children.R")
Version using dev version of taxize. Need to restart R before doing this versiom if CRAN version of taxize is installed. This runs into errors.
# source("taxize_dev.R")
Read in NCBI taxonomy This uses parent and child relationships to build up species list. This code is incomplete, would need to use "while" instead to be comprehensive with respect to parent-child relationships.
# source("ncbi_taxonomy_read.R")
Fix taxonomy in df_all.Rdata using ncbi_taxonomy.Rdata. Commenting out because ncbi_taxonomy_read.R didn't work.
#source("taxonomy_correct.Rdata")
Get all species and classify Get species in NCBI; then use "classification" in package "taxize" to get full classification of species. Add classification of each species to dataframe. This solution is not practical because it would take 44 hours even with API key.
#create list of species
#source("species_classify.R")
#classify each species
#source("R_species_classify1.R")
Upload "parasiteGMPD.csv" to NCBI website (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi). Choose option to save to file from website. Save file to working directory as "parasiteGMPD_tax_report.txt" Use "parasiteGMPD_tax_report.txt" to correct pathogen species names by merging with df_parasite, with new field "preferred.name". Note that some of the preferred.names (e.g. Borelliela) do not match GMPD names (Borrelia). Save df_parasite.Rdata that includes records for mammals without any parasites. Comment out, use instead full taxonomy from NCBI
# source("parasite_zdx_ncbi.R")
Subset df_all by bacterial diseases (excluding mammals with no diseases). Save df_all.Rdata. This is repeated here from up above, comment out.
# source("GIDEON_subset_bacterial.R")
outputs: bacteria_species.Rdata (master list of bacteria); out_synonym.Rdata (synonyms of species that were not found in master list but are in NCBI); df_all.Rdata (mammals with and without bacteria, with bacteria names corrected and assigned to taxonomic level); not_found.csv, bacteria not found in NCBI. This uses stri_detect_fixed, from stringi
source("R_bacteria_lists_compare.R")
## [1] 99441 3
## [1] 99353 3
Use classification in taxize to classify to order all bacteria in df_all. Input: df_all.Rdata. Output: df_all.Rdata
#add taxonomy id from ncbi
source("R_name2taxid.R")
## [1] 755
#use tax_id to get pathogen order, family, genus
source("R_classify_bacteria_observed.R")
## user system elapsed
## 0.266283333 0.009516667 0.284333333
Make graph of pathogenic bacteria by bacteria order, with different colors by bacteria family
source("R_graph_pathogen_order_family.R")
plot
Make graph of pathogenic bacteria by bacteria order, with different colors by bacteria genus (no legend for genus)
source("R_graph_pathogen_order_genus.R")
plot
Make graph of pathogenic bacteria by bacteria order, with different colors by host order
source("R_graph_pathogen_order_host_order.R")
plot
Make graph of host order, with different colors by bacteria order
source("R_graph_host_order_pathogen_order.R")
plot
Use classification in taxize to classify to order all bacteria in master list. Note this takes ~8 hours to do all. Input: bacteria_species.Rdata. Output: bacteria_species_out.Rdata Commenting this out for now, has been run on workstation
#source("R_classify_bacteria.R")
Assign pathogen status to bacteria_species_out. Make graph of frequency of bacteria by order, for pathogenic and non-pathogenic bacteria. Input: bacteria_species_out.Rdata; df_all.Rdata Output: bacteria_species_out1.Rdata
source("R_graph_bacteria_order_pathogen_status.R")
source("R_graph_bacteria_order_pathogen_status1.R")
plot
Graph counts across mammalian orders. Use df_all.Rdata. Includes humans among primates
source("mammal_orders_graph.R")
## Saving 7 x 5 in image
plot
Graph counts across mammalian orders, with different colors for different bacterial orders.
source("mammal_orders_graph_stacked.R")
## Saving 7 x 5 in image
plot
Graph counts across bacterial species of how many mammals they associate with. Use df_all.Rdata
source("bacteria_host_species_hist_graph.R")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Saving 7 x 5 in image
plot
Graph histogram of number of pathogens associated with each host species
source("host_pathogen_histogram.R")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
## Saving 7 x 5 in image
plot
Read in GMPD taxonomy and subset bacteria. Save gmpd_taxonomy_bacteria.Rdata
source("GMPD_bacteria.R")
Merge gmpd_taxonomy_bacteria.Rdata with GMPD parasite traits. Save gmpd.Rdata
source("GMPD_traits.R")
Assign GMPD traits to bacteria associated with zdx. Save df_parasite_gmpd.Rdata. Note: this version could be updated to use df_all, but better to integrate with traits in BacDive
source("bacteria_zdx_gmpd_traits.R")
6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans
Note: this is imperfect because only GMPD-represented species are present. Comment out
# source("bacteria_order.R")
# plot
source("bacteria_traits_gmpd_graph.R")
## Saving 7 x 5 in image
plot
Assemble all GMPD datasets
Source: ProTraits: http://protraits.irb.hr/data.html. We are using version of data in which traits have been binarized. Read in data and save as p1.Rdata.
#source("read_data_pathogen_traits1.R")
Output data to file (species.csv) to upload to NCBI website (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi). After running taxonomy_ncbi.R, go to NCBI website, upload species.csv, and choose option to save to file from website. Save file to working directory.
#source("taxonomy_ncbi.R")
Read back in file outputted from NCBI website ("tax_report.txt"). Merge tax_report.txt with p1 (original Brbic et al. data), with new field "preferred.name"
#source("taxonomy_ncbi_out.R")
Also attempted, didn't work: Attempted with R package taxize fxn synonyms and "col" (Catalog of Life). Note: taxize does not include NCBI. This requires a lot of interaction for species with multiple matches.
# source("taxonomy1.R")
Also attempted, didn't work: Attempted with R package myTAI. This option requires interaction while code is looping through species, for species with more than one entry in NCBI.
# source("taxonomy_p1_myTAI.R")
Note: commenting out corrplot for now because it throws an error.
# source("corrplot_bacteria.R")
#source("read_data_pathogen_traits2.R")
output file as species2.csv. make field Organism_name. Some species have multiple entries. There is no explanation about this (https://figshare.com/articles/International_Journal_of_Systematic_and_Evolutionary_Microbiology_IJSEM_phenotypic_database/4272392). Executive decision: filter data so that there are only records for species with one line of data, because these are data in which we have most confidence. After running taxonomy_ncbi_2.R, go to NCBI website, upload species.csv, and choose option to save to file from website. Save file to working directory.
#source("taxonomy_ncbi_2.R")
Read back in file outputted from NCBI website ("tax_report2.txt"). Merge tax_report.txt with p2 (original Barberan et al. data), with new field "preferred.name"
#source("taxonomy_ncbi_out2.R")
#source("GMPD_pathogen.R")
#source("common_p.R")