Bacterial traits and human disease outcomes

Ilya 12/7/2018

To do

Scrape GIDEON database for more up-to-date dataset of mammal hosts and diseases.

Strategy

Integrate data on host species, bacterial species and traits, and human disease outcomes. Visually summarize bacteria causing disease in mammals. Apply generalized boosted models (GBM) to predict transmissibility and human disease outcomes based on bacterial traits.

1. Get bacteria-caused dx in GIDEON

2. Match dx to pathogen spp. names

3. Match spp. names (from GIDEON zdx and GMPD) to traits (in GMPD & other)

4. Compile master list of bacteria spp & traits

5. Feature construction with bacterial traits

6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans

7. Use traits to predict transmissibility and human disease outcomes

Data sources

GIDEON dataset of bacterial zoonotic diseases and their mammalian hosts
Dataset relating mammals to pathogens (GMPD)
Dataset matching diseases to pathogens (to be collected)
Bacterial trait datasets (Brbic et al. 2016; Barberan et al. 2017; EID2; GMPD)
Human disease outcomes (GIDEON) and transmissibility (Han)
Mammalian host ranges (IUCN)

Study design

install and load required packages

## Removing package from '/Library/Frameworks/R.framework/Versions/3.4/Resources/library'
## (as 'lib' is unspecified)
## Removing package from '/Library/Frameworks/R.framework/Versions/3.4/Resources/library'
## (as 'lib' is unspecified)

## 
## Attaching package: 'rlang'

## The following object is masked from 'package:data.table':
## 
##     :=

## 
##   There is a binary version available but the source version is
##   later:
##        binary source needs_compilation
## tibble  1.4.2  2.0.1              TRUE

## installing the source package 'tibble'

## 
## The downloaded binary packages are in
##  /var/folders/0d/qm_pqljx11s_ddc42g1_yscr0000gn/T//RtmpW5aBIA/downloaded_packages

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:glue':
## 
##     collapse

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

## corrplot 0.84 loaded

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## CHNOSZ version 1.1.3 (2017-11-13)

## Please run data(thermo) to create the "thermo" object

## 
## Attaching package: 'CHNOSZ'

## The following objects are masked from 'package:Hmisc':
## 
##     mtitle, spearman

## The following object is masked from 'package:dplyr':
## 
##     slice

## Downloading GitHub repo TIBHannover/BacDiveR@master
## from URL https://api.github.com/repos/TIBHannover/BacDiveR/zipball/master

## Installing BacDiveR

## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
##   --no-environ --no-save --no-restore --quiet CMD INSTALL  \
##   '/private/var/folders/0d/qm_pqljx11s_ddc42g1_yscr0000gn/T/RtmpW5aBIA/devtools1da6534663f7/TIBHannover-BacDiveR-7108220'  \
##   --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library'  \
##   --install-tests

## 

## Installation failed: Command failed (3)

## Skipping install of 'taxizedb' from a github remote, the SHA1 (7ee9741a) has not changed since last install.
##   Use `force = TRUE` to force installation

## 
## Attaching package: 'taxizedb'

## The following objects are masked from 'package:taxize':
## 
##     children, classification, downstream

1. Get bacteria-caused dx in GIDEON

Read in GIDEON data from scrape

Save as GIDEON.Rdata, including unique diseases associated with each mammal taxon

source("GIDEON_read.R")

2. Match bacterial dx to pathogen spp names

GIDEON data: subset to include only non-carnivores and non-primates

Subset GIDEON data (on mammalian hosts and diseases) to then match up diseases with pathogens. This zdx-pathogen matching has already been done for carnivores and primates (in part), so exclude carnivores and primates. This saves animal_dx_parasites.Rdata and outputs animal-dx-parasites.csv

#Do this with Mammal Species of the World (
#http://www.departments.bucknell.edu/biology/resources/msw3/

#Note that this outputs animal-dx-parasites.csv, which includes only those mammal hosts that have been associated with a bacterial disease (Label = 1) 
source("GIDEON_subset_exclude_carnivores_primates.R")

## [1] "number of species / zoonosis pairs in GIDEON including all orders"
## [1] 2256    2
## [1] "number of non-carnivore mammal species without records in GIDEON"
## [1] 7638
## [1] "number of species / zoonosis pairs in GIDEON (all orders) after merge with mammals of the world checklist"
## [1] 15687     3
## [1] "number of *unique* species / zoonosis pairs in GIDEON (all orders) after merge with mammals of the world checklist"
## [1] 2246    3
## [1] 2246    3
## [1] "number of species / zoonosis records in GIDEON excluding primates and carnivores"
## [1] 1452
## [1] "records of non-carnivore mammals, with zdx and including one row for each mammal w/o a recorded zdx"
## [1] 9090
## [1] "check that size of animal-dx-parasites matches number GIDEON records plus number of other mammals"
## [1] TRUE

#comment out this version that excludes carnivores
# source("GIDEON_subset_non_carnivores.R")

#Commenting out this subset that only includes ungulates
#source("GIDEON_subset_ungulates.R")

#This next way of doing the subset is wrong because it assumes ungulates must be in GMPD; however, there could be records in GIDEON for ungulate / disease for which associated pathogen has not been recorded in GMPD. 
# source("GIDEON_subset.R")

Construct dataset animal-dx-parasites (google sheet) matching bacteria to zdx. Follow protocol in mammal-zdx-parasites instructions (google doc).

Bacterial diseases and bacteria: clean data in GIDEON_bacterium_dx.

This includes data on bacterial dx that affect people but not other animals. Make separate row for each bacteria species Output = human_bacteria.Rdata

source("parse_species_bacteria.R")

Parse vectors

source("parse_vector.R")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Saving 7 x 5 in image
## Saving 7 x 5 in image

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot1

plot2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Assemble data on primates (prim-zdx-parasites.xls, google sheet), carnivores (carnivore-zdx-parasites.csv, dropbox), other mammals ("animal-dx-parasites - animal-dx-parasites.csv", exported from google sheet).

Output list of parasites for checking in NCBI, parasiteGMPD_tax_report.txt. Output df_parasite.Rdata (mammals with parasites), df_no_parasite.Rdata (no parasite), and df_all (mammals w/ and w/o parasite)

source("mammal_zdx_assemble.R")

## [1] 9181    9
## [1] 666   5
## [1] 499  11

Get bacterial diseases in GIDEON

In GIDEON, filter: Disease --> Agent --> bacterium. Copy diseases into google sheet GIDEON_bacterium_dx. Add bacteria there. Subset df_all.Rdata with GIDEON_bacterium_dx so that GIDEON contains only bacterial diseases

source("GIDEON_subset_bacterial.R")

## [1] "rows including all types of diseases"
## [1] 10346
## [1] "number of bacterial diseases  in df_all"
## [1] 129
## [1] "rows in df_all -- only bacterial diseases"
## [1] 676
## [1] "rows in df_all -- including mammals with Label = 0"
## [1] 9772

Assemble mammal (df_all) and human data (human_bacteria). Save as df_all. Note that human_bacteria also includes bacteria found in other mammals.

source("human_mammal.R")

## [1] "rows in df_all"
## [1] 10004

Graph vectors associated with each host order. This assigns vectors from human data to non-human, but does not resaves this as new dataframe

source("host_vector.R")

## Saving 7 x 5 in image

plot

Get id and children of bacteria.

Version using taxizedb. This works. If this doesn't work, try restarting R.

source("taxizedb_children.R")

##    user  system elapsed 
##       0       0       0

Version using Catalog of Life. Works but comment out because returns only ~9000 species, which seems like small number, and because not NCBI

# source("taxize_children_col.R")

Version using taxize and NCBI with downstream_ncbi. Comment out because returns an error.

#source("taxize_children.R")

Version using dev version of taxize. Need to restart R before doing this versiom if CRAN version of taxize is installed. This runs into errors.

# source("taxize_dev.R")

Read in NCBI taxonomy This uses parent and child relationships to build up species list. This code is incomplete, would need to use "while" instead to be comprehensive with respect to parent-child relationships.

# source("ncbi_taxonomy_read.R")

Fix taxonomy in df_all.Rdata using ncbi_taxonomy.Rdata. Commenting out because ncbi_taxonomy_read.R didn't work.

#source("taxonomy_correct.Rdata")

Get all species and classify Get species in NCBI; then use "classification" in package "taxize" to get full classification of species. Add classification of each species to dataframe. This solution is not practical because it would take 44 hours even with API key.

#create list of species
#source("species_classify.R")
#classify each species 
#source("R_species_classify1.R")

Upload "parasiteGMPD.csv" to NCBI website (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi). Choose option to save to file from website. Save file to working directory as "parasiteGMPD_tax_report.txt" Use "parasiteGMPD_tax_report.txt" to correct pathogen species names by merging with df_parasite, with new field "preferred.name". Note that some of the preferred.names (e.g. Borelliela) do not match GMPD names (Borrelia). Save df_parasite.Rdata that includes records for mammals without any parasites. Comment out, use instead full taxonomy from NCBI

# source("parasite_zdx_ncbi.R")

Subset df_all by bacterial diseases (excluding mammals with no diseases). Save df_all.Rdata. This is repeated here from up above, comment out.

# source("GIDEON_subset_bacterial.R")

Compare pathogenic bacteria to bacteria in NCBI

outputs: bacteria_species.Rdata (master list of bacteria); out_synonym.Rdata (synonyms of species that were not found in master list but are in NCBI); df_all.Rdata (mammals with and without bacteria, with bacteria names corrected and assigned to taxonomic level); not_found.csv, bacteria not found in NCBI. This uses stri_detect_fixed, from stringi

source("R_bacteria_lists_compare.R")

## [1] 99441     3
## [1] 99353     3

Classify bacteria

Use classification in taxize to classify to order all bacteria in df_all. Input: df_all.Rdata. Output: df_all.Rdata

#add taxonomy id from ncbi
source("R_name2taxid.R")

## [1] 755

#use tax_id to get pathogen order, family, genus
source("R_classify_bacteria_observed.R")

##        user      system     elapsed 
## 0.266283333 0.009516667 0.284333333

Make graph of pathogenic bacteria by bacteria order, with different colors by bacteria family

source("R_graph_pathogen_order_family.R")
plot

Make graph of pathogenic bacteria by bacteria order, with different colors by bacteria genus (no legend for genus)

source("R_graph_pathogen_order_genus.R")
plot

Make graph of pathogenic bacteria by bacteria order, with different colors by host order

source("R_graph_pathogen_order_host_order.R")
plot

Make graph of host order, with different colors by bacteria order

source("R_graph_host_order_pathogen_order.R")
plot

Use classification in taxize to classify to order all bacteria in master list. Note this takes ~8 hours to do all. Input: bacteria_species.Rdata. Output: bacteria_species_out.Rdata Commenting this out for now, has been run on workstation

#source("R_classify_bacteria.R")

Assign pathogen status to bacteria_species_out. Make graph of frequency of bacteria by order, for pathogenic and non-pathogenic bacteria. Input: bacteria_species_out.Rdata; df_all.Rdata Output: bacteria_species_out1.Rdata

source("R_graph_bacteria_order_pathogen_status.R")
source("R_graph_bacteria_order_pathogen_status1.R")
plot

Graph counts across mammalian orders. Use df_all.Rdata. Includes humans among primates

source("mammal_orders_graph.R")

## Saving 7 x 5 in image

plot

Graph counts across mammalian orders, with different colors for different bacterial orders.

source("mammal_orders_graph_stacked.R")

## Saving 7 x 5 in image

plot

Graph counts across bacterial species of how many mammals they associate with. Use df_all.Rdata

source("bacteria_host_species_hist_graph.R")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Saving 7 x 5 in image

plot

Graph histogram of number of pathogens associated with each host species

source("host_pathogen_histogram.R")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

## Saving 7 x 5 in image

plot

3. Match spp. names (from GIDEON zdx and GMPD) to traits (in GMPD & other)

Get info in BacDive for all bacteria, including those with and without zdx associated

GMPD: assign GMPD traits to bacteria associated with zdx

Read in GMPD taxonomy and subset bacteria. Save gmpd_taxonomy_bacteria.Rdata

source("GMPD_bacteria.R")

Merge gmpd_taxonomy_bacteria.Rdata with GMPD parasite traits. Save gmpd.Rdata

source("GMPD_traits.R")

Assign GMPD traits to bacteria associated with zdx. Save df_parasite_gmpd.Rdata. Note: this version could be updated to use df_all, but better to integrate with traits in BacDive

source("bacteria_zdx_gmpd_traits.R")

6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans

Graph counts of host-bacteria pairs by bacterial order

Note: this is imperfect because only GMPD-represented species are present. Comment out

# source("bacteria_order.R")
# plot

Graph counts of host-bacteria pairs by traits in GMPD

source("bacteria_traits_gmpd_graph.R")

## Saving 7 x 5 in image

plot

SCRATCH below here

Visualize data on number of bacterial disease per host species, facet wrap by host order

Assemble all GMPD datasets

Data: Global Mammal Parasite Database

Data: bacterial traits

Bacterial traits: read in and save Brbic et al. 2016

Source: ProTraits: http://protraits.irb.hr/data.html. We are using version of data in which traits have been binarized. Read in data and save as p1.Rdata.

#source("read_data_pathogen_traits1.R")

Bacterial traits: correct taxonomy in Brbic et al. 2016 data

Output data to file (species.csv) to upload to NCBI website (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi). After running taxonomy_ncbi.R, go to NCBI website, upload species.csv, and choose option to save to file from website. Save file to working directory.

#source("taxonomy_ncbi.R")

Read back in file outputted from NCBI website ("tax_report.txt"). Merge tax_report.txt with p1 (original Brbic et al. data), with new field "preferred.name"

#source("taxonomy_ncbi_out.R")

Also attempted, didn't work: Attempted with R package taxize fxn synonyms and "col" (Catalog of Life). Note: taxize does not include NCBI. This requires a lot of interaction for species with multiple matches.

# source("taxonomy1.R")

Also attempted, didn't work: Attempted with R package myTAI. This option requires interaction while code is looping through species, for species with more than one entry in NCBI.

# source("taxonomy_p1_myTAI.R")

Bacterial traits: determine correlations among variables in Brbic et al. 2016.

Note: commenting out corrplot for now because it throws an error.

# source("corrplot_bacteria.R")

Bacterial traits: Read in data in Barberan et al. 2017 and save as p2.Rdata

Source: https://figshare.com/articles/International_Journal_of_Systematic_and_Evolutionary_Microbiology_IJSEM_phenotypic_database/4272392

#source("read_data_pathogen_traits2.R")

Bacterial traits: correct taxonomy in Barberan et al. 2017 data

output file as species2.csv. make field Organism_name. Some species have multiple entries. There is no explanation about this (https://figshare.com/articles/International_Journal_of_Systematic_and_Evolutionary_Microbiology_IJSEM_phenotypic_database/4272392). Executive decision: filter data so that there are only records for species with one line of data, because these are data in which we have most confidence. After running taxonomy_ncbi_2.R, go to NCBI website, upload species.csv, and choose option to save to file from website. Save file to working directory.

#source("taxonomy_ncbi_2.R")

Read back in file outputted from NCBI website ("tax_report2.txt"). Merge tax_report.txt with p2 (original Barberan et al. data), with new field "preferred.name"

#source("taxonomy_ncbi_out2.R")

Bacterial traits: determine overlap between species in GMPD and each trait database

#source("GMPD_pathogen.R")

find species in common between Brbic et al. (p1) and Barberan et al.

#source("common_p.R")

Files

bacteria.md

Latest commit

History

bacteria.md

File metadata and controls

Bacterial traits and human disease outcomes

To do

Strategy

1. Get bacteria-caused dx in GIDEON

2. Match dx to pathogen spp. names

3. Match spp. names (from GIDEON zdx and GMPD) to traits (in GMPD & other)

4. Compile master list of bacteria spp & traits

5. Feature construction with bacterial traits

6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans

7. Use traits to predict transmissibility and human disease outcomes

Data sources

Study design

install and load required packages

1. Get bacteria-caused dx in GIDEON

Read in GIDEON data from scrape

2. Match bacterial dx to pathogen spp names

GIDEON data: subset to include only non-carnivores and non-primates

Bacterial diseases and bacteria: clean data in GIDEON_bacterium_dx.

Assemble data on primates (prim-zdx-parasites.xls, google sheet), carnivores (carnivore-zdx-parasites.csv, dropbox), other mammals ("animal-dx-parasites - animal-dx-parasites.csv", exported from google sheet).

Get bacterial diseases in GIDEON

Assemble mammal (df_all) and human data (human_bacteria). Save as df_all. Note that human_bacteria also includes bacteria found in other mammals.

Get id and children of bacteria.

Compare pathogenic bacteria to bacteria in NCBI

Classify bacteria

3. Match spp. names (from GIDEON zdx and GMPD) to traits (in GMPD & other)

Get info in BacDive for all bacteria, including those with and without zdx associated

GMPD: assign GMPD traits to bacteria associated with zdx

6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans

Graph counts of host-bacteria pairs by bacterial order

Graph counts of host-bacteria pairs by traits in GMPD

SCRATCH below here

Visualize data on number of bacterial disease per host species, facet wrap by host order

Data: Global Mammal Parasite Database

Data: bacterial traits

Bacterial traits: read in and save Brbic et al. 2016

Bacterial traits: correct taxonomy in Brbic et al. 2016 data

Bacterial traits: determine correlations among variables in Brbic et al. 2016.

Bacterial traits: Read in data in Barberan et al. 2017 and save as p2.Rdata

Bacterial traits: correct taxonomy in Barberan et al. 2017 data

Bacterial traits: determine overlap between species in GMPD and each trait database

find species in common between Brbic et al. (p1) and Barberan et al.