-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathbacteria.Rmd
376 lines (287 loc) · 13.9 KB
/
bacteria.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
---
title: "Bacterial traits and human disease outcomes"
author: "Ilya"
date: "12/7/2018"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
###To do
Scrape GIDEON database for more up-to-date dataset of mammal hosts and diseases.
##Strategy
Integrate data on host species, bacterial species and traits, and human disease outcomes.
Visually summarize bacteria causing disease in mammals.
Apply generalized boosted models (GBM) to predict transmissibility and human disease outcomes based on bacterial traits.
###1. Get bacteria-caused dx in GIDEON
###2. Match dx to pathogen spp. names
###3. Match spp. names (from GIDEON zdx and GMPD) to traits (in GMPD & other)
###4. Compile master list of bacteria spp & traits
###5. Feature construction with bacterial traits
###6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans
###7. Use traits to predict transmissibility and human disease outcomes
##Data sources
1. GIDEON dataset of bacterial zoonotic diseases and their mammalian hosts
2. Dataset relating mammals to pathogens (GMPD)
3. Dataset matching diseases to pathogens (to be collected)
4. Bacterial trait datasets (Brbic et al. 2016; Barberan et al. 2017; EID2; GMPD)
5. Human disease outcomes (GIDEON) and transmissibility (Han)
6. Mammalian host ranges (IUCN)
##Study design
####install and load required packages
```{r packages, echo=FALSE}
source("packages_bacteria.R")
```
###1. Get bacteria-caused dx in GIDEON
####Read in GIDEON data from scrape
Save as GIDEON.Rdata, including unique diseases associated with each mammal taxon
```{r}
source("GIDEON_read.R")
```
###2. Match bacterial dx to pathogen spp names
####GIDEON data: subset to include only non-carnivores and non-primates
Subset GIDEON data (on mammalian hosts and diseases) to then match up diseases with pathogens. This zdx-pathogen matching has already been done for carnivores and primates (in part), so exclude carnivores and primates.
This saves animal_dx_parasites.Rdata and outputs animal-dx-parasites.csv
```{r}
#Do this with Mammal Species of the World (
#http://www.departments.bucknell.edu/biology/resources/msw3/
#Note that this outputs animal-dx-parasites.csv, which includes only those mammal hosts that have been associated with a bacterial disease (Label = 1)
source("GIDEON_subset_exclude_carnivores_primates.R")
#comment out this version that excludes carnivores
# source("GIDEON_subset_non_carnivores.R")
#Commenting out this subset that only includes ungulates
#source("GIDEON_subset_ungulates.R")
#This next way of doing the subset is wrong because it assumes ungulates must be in GMPD; however, there could be records in GIDEON for ungulate / disease for which associated pathogen has not been recorded in GMPD.
# source("GIDEON_subset.R")
```
Construct dataset animal-dx-parasites (google sheet) matching bacteria to zdx. Follow protocol in mammal-zdx-parasites instructions (google doc).
####Bacterial diseases and bacteria: clean data in GIDEON_bacterium_dx.
This includes data on bacterial dx that affect people but not other animals. Make separate row for each bacteria species
Output = human_bacteria.Rdata
```{r}
source("parse_species_bacteria.R")
```
Parse vectors
```{r}
source("parse_vector.R")
plot1
plot2
```
####Assemble data on primates (prim-zdx-parasites.xls, google sheet), carnivores (carnivore-zdx-parasites.csv, dropbox), other mammals ("animal-dx-parasites - animal-dx-parasites.csv", exported from google sheet).
Output list of parasites for checking in NCBI, parasiteGMPD_tax_report.txt. Output df_parasite.Rdata (mammals with parasites), df_no_parasite.Rdata (no parasite), and df_all (mammals w/ and w/o parasite)
```{r mammal_zdx_assemble}
source("mammal_zdx_assemble.R")
```
####Get bacterial diseases in GIDEON
In GIDEON, filter: Disease --> Agent --> bacterium. Copy diseases into google sheet GIDEON_bacterium_dx. Add bacteria there.
Subset df_all.Rdata with GIDEON_bacterium_dx so that GIDEON contains only bacterial diseases
```{r GIDEON_subset_bacterial_1}
source("GIDEON_subset_bacterial.R")
```
####Assemble mammal (df_all) and human data (human_bacteria). Save as df_all. Note that human_bacteria also includes bacteria found in other mammals.
```{r human_mammal}
source("human_mammal.R")
```
Graph vectors associated with each host order. This assigns vectors from human data to non-human, but does not resaves this as new dataframe
```{r host_vector}
source("host_vector.R")
plot
```
####Get id and children of bacteria.
Version using taxizedb. This works. If this doesn't work, try restarting R.
```{r taxizedb_children}
source("taxizedb_children.R")
```
Version using Catalog of Life. Works but comment out because returns only ~9000 species, which seems like small number, and because not NCBI
```{r}
# source("taxize_children_col.R")
```
Version using taxize and NCBI with downstream_ncbi. Comment out because returns an error.
```{r}
#source("taxize_children.R")
```
Version using dev version of taxize. Need to restart R before doing this versiom if CRAN version of taxize is installed. This runs into errors.
```{r}
# source("taxize_dev.R")
```
Read in NCBI taxonomy
This uses parent and child relationships to build up species list. This code is incomplete, would need to use "while" instead to be comprehensive with respect to parent-child relationships.
```{r}
# source("ncbi_taxonomy_read.R")
```
Fix taxonomy in df_all.Rdata using ncbi_taxonomy.Rdata. Commenting out because ncbi_taxonomy_read.R didn't work.
```{r}
#source("taxonomy_correct.Rdata")
```
Get all species and classify
Get species in NCBI; then use "classification" in package "taxize" to get full classification of species. Add classification of each species to dataframe. This solution is not practical because it would take 44 hours even with API key.
```{r species_classify}
#create list of species
#source("species_classify.R")
#classify each species
#source("R_species_classify1.R")
```
Upload "parasiteGMPD.csv" to NCBI website (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi). Choose option to save to file from website. Save file to working directory as "parasiteGMPD_tax_report.txt"
Use "parasiteGMPD_tax_report.txt" to correct pathogen species names by merging with df_parasite, with new field "preferred.name". Note that some of the preferred.names (e.g. Borelliela) do not match GMPD names (Borrelia). Save df_parasite.Rdata that includes records for mammals without any parasites.
Comment out, use instead full taxonomy from NCBI
```{r parasite_zdx_ncbi}
# source("parasite_zdx_ncbi.R")
```
Subset df_all by bacterial diseases (excluding mammals with no diseases). Save df_all.Rdata. This is repeated here from up above, comment out.
```{r GIDEON_subset_bacterial}
# source("GIDEON_subset_bacterial.R")
```
####Compare pathogenic bacteria to bacteria in NCBI
outputs: bacteria_species.Rdata (master list of bacteria);
out_synonym.Rdata (synonyms of species that were not found in master list but are in NCBI);
df_all.Rdata (mammals with and without bacteria, with bacteria names corrected and assigned to taxonomic level);
not_found.csv, bacteria not found in NCBI. This uses stri_detect_fixed, from stringi
```{r}
source("R_bacteria_lists_compare.R")
```
####Classify bacteria
Use classification in taxize to classify to order all bacteria in df_all. Input: df_all.Rdata. Output: df_all.Rdata
```{r R_classify_bacteria_observed}
#add taxonomy id from ncbi
source("R_name2taxid.R")
#use tax_id to get pathogen order, family, genus
source("R_classify_bacteria_observed.R")
```
Make graph of pathogenic bacteria by bacteria order, with different colors by bacteria family
```{r R_graph_pathogen_order_family}
source("R_graph_pathogen_order_family.R")
plot
```
Make graph of pathogenic bacteria by bacteria order, with different colors by bacteria genus (no legend for genus)
```{r R_graph_pathogen_order_genus}
source("R_graph_pathogen_order_genus.R")
plot
```
Make graph of pathogenic bacteria by bacteria order, with different colors by host order
```{r R_graph_pathogen_order_host_order}
source("R_graph_pathogen_order_host_order.R")
plot
```
Make graph of host order, with different colors by bacteria order
```{r R_graph_host_order_pathogen_order}
source("R_graph_host_order_pathogen_order.R")
plot
```
Use classification in taxize to classify to order all bacteria in master list. Note this takes ~8 hours to do all. Input: bacteria_species.Rdata. Output: bacteria_species_out.Rdata
Commenting this out for now, has been run on workstation
```{r R_classify_bacteria}
#source("R_classify_bacteria.R")
```
Assign pathogen status to bacteria_species_out. Make graph of frequency of bacteria by order, for pathogenic and non-pathogenic bacteria. Input: bacteria_species_out.Rdata; df_all.Rdata
Output: bacteria_species_out1.Rdata
```{r R_graph_bacteria_order_pathogen_status}
wd = getwd()
setwd(wd)
# file.exists("../../DATA/BIOLOGICAL/df_all.Rdata")#two dots and slash to go up two levels
load("df_all.Rdata")
save(df_all, file= "DATA/df_all.Rdata")
source("R_graph_bacteria_order_pathogen_status.R")
source("R_graph_bacteria_order_pathogen_status1.R")#make the graph
plot
```
Graph counts across mammalian orders. Use df_all.Rdata. Includes humans among primates
```{r mammal_orders_graph}
source("mammal_orders_graph.R")
plot
```
Graph counts across mammalian orders, with different colors for different bacterial orders.
```{r mammal_orders_graph_stacked}
source("mammal_orders_graph_stacked.R")
plot
```
Graph counts across bacterial species of how many mammals they associate with. Use df_all.Rdata
```{r bacteria_host_species_hist_graph}
source("bacteria_host_species_hist_graph.R")
plot
```
Graph histogram of number of pathogens associated with each host species
```{r host_pathogen_histogram}
source("host_pathogen_histogram.R")
plot
```
###3. Match spp. names (from GIDEON zdx and GMPD) to traits (in GMPD & other)
####Get info in BacDive for all bacteria, including those with and without zdx associated
####GMPD: assign GMPD traits to bacteria associated with zdx
Read in GMPD taxonomy and subset bacteria. Save gmpd_taxonomy_bacteria.Rdata
```{r GMPD_bacteria}
source("GMPD_bacteria.R")
```
Merge gmpd_taxonomy_bacteria.Rdata with GMPD parasite traits. Save gmpd.Rdata
```{r GMPD_traits}
source("GMPD_traits.R")
```
Assign GMPD traits to bacteria associated with zdx. Save df_parasite_gmpd.Rdata. Note: this version could be updated to use df_all, but better to integrate with traits in BacDive
```{r bacteria_zdx_gmpd_traits}
source("bacteria_zdx_gmpd_traits.R")
```
###6. Data visualization: summary “state of knowledge” on bacteria causing disease in mammals or humans
####Graph counts of host-bacteria pairs by bacterial order
Note: this is imperfect because only GMPD-represented species are present. Comment out
```{r bacteria_order}
# source("bacteria_order.R")
# plot
```
####Graph counts of host-bacteria pairs by traits in GMPD
```{r bacteria_traits_gmpd_graph}
source("bacteria_traits_gmpd_graph.R")
plot
```
###SCRATCH below here
###Data: bacterial traits
####Bacterial traits: read in and save Brbic et al. 2016
Source: ProTraits: http://protraits.irb.hr/data.html.
We are using version of data in which traits have been binarized.
Read in data and save as p1.Rdata.
```{r read_data_pathogen_traits1}
#source("read_data_pathogen_traits1.R")
```
####Bacterial traits: correct taxonomy in Brbic et al. 2016 data
Output data to file (species.csv) to upload to NCBI website (https://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi). After running taxonomy_ncbi.R, go to NCBI website, upload species.csv, and choose option to save to file from website. Save file to working directory.
```{r taxonomy_ncbi}
#source("taxonomy_ncbi.R")
```
Read back in file outputted from NCBI website ("tax_report.txt"). Merge tax_report.txt with p1 (original Brbic et al. data), with new field "preferred.name"
```{r taxonomy_ncbi_out}
#source("taxonomy_ncbi_out.R")
```
Also attempted, didn't work: Attempted with R package taxize fxn synonyms and "col" (Catalog of Life). Note: taxize does not include NCBI. This requires a lot of interaction for species with multiple matches.
```{r}
# source("taxonomy1.R")
```
Also attempted, didn't work: Attempted with R package myTAI. This option requires interaction while code is looping through species, for species with more than one entry in NCBI.
```{r}
# source("taxonomy_p1_myTAI.R")
```
####Bacterial traits: determine correlations among variables in Brbic et al. 2016.
Note: commenting out corrplot for now because it throws an error.
```{r}
# source("corrplot_bacteria.R")
```
####Bacterial traits: Read in data in Barberan et al. 2017 and save as p2.Rdata
Source: https://figshare.com/articles/International_Journal_of_Systematic_and_Evolutionary_Microbiology_IJSEM_phenotypic_database/4272392
```{r ijsem}
#source("read_data_pathogen_traits2.R")
```
####Bacterial traits: correct taxonomy in Barberan et al. 2017 data
output file as species2.csv. make field Organism_name. Some species have multiple entries. There is no explanation about this (https://figshare.com/articles/International_Journal_of_Systematic_and_Evolutionary_Microbiology_IJSEM_phenotypic_database/4272392). Executive decision: filter data so that there are only records for species with one line of data, because these are data in which we have most confidence.
After running taxonomy_ncbi_2.R, go to NCBI website, upload species.csv, and choose option to save to file from website. Save file to working directory.
```{r}
#source("taxonomy_ncbi_2.R")
```
Read back in file outputted from NCBI website ("tax_report2.txt"). Merge tax_report.txt with p2 (original Barberan et al. data), with new field "preferred.name"
```{r taxonomy_ncbi_out2}
#source("taxonomy_ncbi_out2.R")
```
####Bacterial traits: determine overlap between species in GMPD and each trait database
```{r}
#source("GMPD_pathogen.R")
```
####find species in common between Brbic et al. (p1) and Barberan et al.
```{r}
#source("common_p.R")
```