Skip to content

Commit b1348d9

Browse files
commit
1 parent 928e538 commit b1348d9

File tree

1 file changed

+240
-0
lines changed

1 file changed

+240
-0
lines changed
+240
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
#---------------------------------------------------------------------------------------------------
2+
# Appendix1_hyperparameters.R: leaning on BRT hyper-parameters to fit model
3+
#---------------------------------------------------------------------------------------------------
4+
5+
# Fit a simple model to practice and consider all the information given on Elith et al. (2008)
6+
# which is given below to fit your model.
7+
8+
#Load data
9+
wd<- paste0(output_data, "/folds_dataset")
10+
setwd(wd)
11+
train <- read.csv("folds_dataset.csv", sep = ";")
12+
names(train)
13+
head(train)
14+
15+
#Set categorical predictors as categories:
16+
train <- train %>%
17+
mutate(fold = factor(train$fold),
18+
Cloud.cover = factor(train$Cloud.cover))
19+
20+
#Set continuous predictors as numerical:
21+
train <- train %>%
22+
mutate(TL = as.numeric(gsub(",", ".", TL)), # Convert comma decimal to dot decimal
23+
diff_at_sbt = as.numeric(gsub(",", ".", diff_at_sbt)),
24+
LN_MinsExposedtoAir = as.numeric(gsub(",", ".", LN_MinsExposedtoAir)),
25+
Average_speed = as.numeric(gsub(",", ".", Average_speed)),
26+
LN_TotalBiomassHaul = as.numeric(gsub(",", ".", LN_TotalBiomassHaul)),
27+
Trawl_duration = as.numeric(gsub(",", ".", Trawl_duration)),
28+
RN = as.numeric(gsub(",", ".", RN)),
29+
Alive_Dead = as.numeric(gsub(",", ".", Alive_Dead)),
30+
MinsExposedtoAir = as.numeric(gsub(",", ".", MinsExposedtoAir)),
31+
TotalBiomassHaul = as.numeric(gsub(",", ".", TotalBiomassHaul)))
32+
head(train)
33+
34+
train <- train[, !names(train) %in% c("tripID", "organismID", "hour", "DW", "Sex", "Sea.state",
35+
"time", "time_hours", "at_celsius", "bathy", "sbt_merged")]
36+
37+
summary(train)
38+
39+
#List the name of the predictor variables
40+
vars <- c("TL", "diff_at_sbt", "Cloud.cover", "LN_MinsExposedtoAir",
41+
"Average_speed", "Trawl_duration", "RN", "LN_TotalBiomassHaul")
42+
43+
#Understand the hyper-parameters based on Elith et al. (2008):
44+
# Number of trees (nt), learning rate (lr),and tree complexity (tc), bag fraction:
45+
46+
#####* 1. ‘bag fraction’:
47+
#* Controls stochasticity by specifying the proportion of data to be selected at each step.
48+
#* The default bag fraction is 0·5, meaning that, at each iteration, 50% of the data are drawn at random,
49+
#* without replacement, from the full training set. Optimal bag fractions can be established by comparing predictive
50+
#* performance and model-to-model variability under different bag fractions.
51+
#* In our experience, stochasticity improves model performance,and
52+
#* IMPORTANT: fractions in the range 0·5–0·75 have given best results for presence–absence responses.
53+
54+
#####* 2. ‘learning rate and number of trees’:
55+
#* lr is used to shrink the contribution of each tree as it is added to the model.
56+
57+
#* IMPORTANT: Decreasing (slowing) lr increases the number of trees (nt) required,
58+
#* in general a smaller lr (and larger nt) are preferable. Although this is conditioned by the number of
59+
#* observations and time available for computation. It is recommended fitting models with at least 1000 trees.
60+
61+
#* TESTING: The usual approach is to estimate optimal nt and lr with an independent test set or with CV,
62+
#* using deviance reduction as the measure of success. GOAL: Our aim here is to find the combination of parameters (lr, tc and nt)
63+
#* that achieves minimum predictive error (minimum error for predictions to independent samples).
64+
#* Slower lr values are generally preferable to faster ones, because they shrink the contribution of each tree more,
65+
#* and help the final model to reliably estimate the response.
66+
67+
#####* 3. ‘tree complexity’:
68+
#* Tree complexity – the number of nodes in a tree – also affects the optimal nt.
69+
#* For a given lr, fitting more complex trees leads to fewer trees being required for minimum error.
70+
#* So, as tc is increased, lr must be decreased if sufficient trees are to be fitted.
71+
#* Theoretically, the tc should reflect the true interaction order in the response being modelled,
72+
#* but as this is almost always unknown, tc is best set with independent data.
73+
#* Sample size influences optimal settings for lr and tc.
74+
75+
#* IMPORTANT: As a general guide, lr needs to be decreased as tc increases to give approximately the same nt,
76+
#* you will also have to consider the trade-off of computing time.
77+
78+
#####* IDENTIFYING THE OPTIMAL PARAMETERS: CROSS-VALIDATION:
79+
#* Techniques such as cross-validation (CV) are used for model development and/or evaluation.
80+
#* Cross-validation provides a means for testing the model on withheld portions of data, while still using all data at some stage to fit the model.
81+
82+
#* METHOD:
83+
#* 1. Randomly divide available data into n subsets (Elith used 10)
84+
85+
#* 2. Make n (=10) different training sets each comprising a unique combination
86+
#* of 9 subsets. Therefore for each training set there is a unique omitted subset that is used for testing.
87+
88+
#* 3. Starting with a selected number fo trees (nt), say 50, develop 10 BRT models simultaneously
89+
#* on each training set, and test predictive performance on their respective omitted data.
90+
#* Both mean performance and standard errors are recorded.
91+
92+
#* 4. Step forward and increase the nt in each model by a selected and constant amount, and repeat step 3.
93+
94+
#* 5. Repeat step 4 and after 10 steps start comparing the predictive performance of the 6th
95+
#* and 10th previous iterations against that of the current 5th previous ones.
96+
#* Once the average of the more recent set is higher than the average of the previous set,
97+
#* the minimum has been passed.
98+
99+
#* 6. Step and record the minimum, this is the optimal nt.
100+
101+
#---------------------------------------------------------------------------------------------------
102+
#* IMPORTANT: Most usually optimal values range:
103+
#---------------------------------------------------------------------------------------------------
104+
#* tc - tree.complexity = 1 - 5 (usually try 1, 3 and 5);
105+
106+
#* lr - learning.rate = 0.001 - 0.01 (usually try 0.001, 0.005, 0.05 and 0.01)
107+
108+
#* bag.fraction = 0·5 – 0·75 (usually try 0.5, 0.6 and 0.7)
109+
110+
#* You will make a combination of these usually potential optimal values and compare
111+
#* the results of all the potential combinations, selecting the best one.
112+
113+
# By now start just with a simple model:
114+
115+
brt <- gbm.step(data=train, gbm.x = vars, gbm.y = "Alive_Dead",
116+
family = "bernoulli", tree.complexity = 5,
117+
learning.rate = 0.01, bag.fraction = 0.6,
118+
fold.vector = train$fold,
119+
n.folds = length(unique(train$fold)))
120+
121+
ggPerformance(brt)
122+
123+
124+
#* Look at the graph and follow Elith et al. (2008) recommendations to check the performance of the model fitting.
125+
#* This particular model was built with the default 10-fold cross-validation.
126+
#* The solid black curve is the mean, and the dotted curves about 1 standard error,
127+
#* for the changes in predictive deviance (i.e., as measured on the excluded folds of the cross-validation).
128+
#* The red line shows the minimum of the mean, and the green line the number of trees at which that occurs.
129+
#* The final model that is returned in the model object and it is built on the train data set,
130+
#* using the number of trees identified as optimal.
131+
132+
#---------------------------------------------------------------------------------------------------
133+
#########* IMPORTANT: Elith et al. (2008) recommendations are:
134+
#---------------------------------------------------------------------------------------------------
135+
136+
#* 1. Convergence between red and green line over 1000 trees (number of trees where the minimum of the mean occurs).
137+
138+
#* 2. The lowest cv_deviance.
139+
140+
#* 2. Prioritise models with bigger lr (quicker) and larger nt and tc
141+
142+
#* 3. Visually check the graph and it is better if the cv_deviance curve has a progressive but steep descent with convergence over 1000 trees (never less) and
143+
#* then stabilization remaining straight (never going up again, which would mean over fitting, losing predicting capacity).
144+
145+
#* 4. AUC: the higher the better (1 is the highest).
146+
147+
#IN THIS CASE, THE RESULT IS:
148+
#* fitting final gbm model with a fixed number of 1150 trees for NA
149+
#* mean total deviance = 1.255
150+
#* mean residual deviance = 0.361
151+
#* estimated cv deviance = 0.606 ; se = 0.02
152+
#* training data correlation = 0.884
153+
#* cv correlation = 0.755 ; se = 0.018
154+
#* training data AUC score = 0.985
155+
#* cv AUC score = 0.925 ; se = 0.009
156+
#* elapsed time - 0.26 minutes
157+
158+
159+
#### CHECK INFLUENCE OF VARIABLES:
160+
# Check the relative influence of each predictor:
161+
#* The measures are based on the number of times a variable is selected for splitting,
162+
#* weighted by the squared improvement to the model as a result of each split, and aver-aged over all trees
163+
#* The relative influence (or contribution) of each variable is scaled so that the sum adds to 100, with higher
164+
#* numbers indicating stronger influence on the response.
165+
summary(modtest)
166+
167+
#### PARTIAL DEPENDENCE PLOTS (variable response curves):
168+
#Then you plot the results, once you have your best model:
169+
#* While these graphs are not a perfect representation of the effects of each variable, particularly if
170+
#* there are strong interactions #* in the data or predictors are strongly correlated, they provide a
171+
#* useful basis for interpretation.
172+
173+
gbm.plot(modtest, n.plots=8, plot.layout=c(2, 4), write.title = FALSE)
174+
175+
176+
#### IDENTIFYING IMPORTANT INTERACTIONS: IN OUR CASE WE WON'T USE THIS.
177+
#* tc controls the maximum level of interaction that can be quantified, but no information is
178+
#* provided automatically on the nature and magnitude of fitted interaction effects.
179+
#* To quantify these, we use a function that creates, for each possible pair of predictors, a temporary grid of
180+
#* variables representing combinations ofvalues at fixed intervals along each of their ranges.
181+
# Once identified, interactions can be visualized withjoint partial dependence plots.
182+
183+
### PREDICTIVE PERFORMANCE:
184+
#* Prediction to any given site uses the final model, and consists of the sum of predictions
185+
#* from all trees multiplied by the learning rate.
186+
#* Predicting based on the testing dataset gives you a measure of the accuracy of your model.
187+
188+
## Model evaluation
189+
#We will use the testing data to evaluate the models
190+
#Load data
191+
wd<- paste0(output_data, "/training_testing")
192+
setwd(wd)
193+
test <- read.csv("test_dataset.csv", sep = ";")
194+
names(test)
195+
head(test)
196+
197+
#Set categorical predictors as categories: none
198+
#Set continuous predictors as numerical:
199+
test <- test %>%
200+
mutate(TL = as.numeric(gsub(",", ".", TL)), # Convert comma decimal to dot decimal
201+
diff_at_sbt = as.numeric(gsub(",", ".", diff_at_sbt)),
202+
LN_MinsExposedtoAir = as.numeric(gsub(",", ".", LN_MinsExposedtoAir)),
203+
Average_speed = as.numeric(gsub(",", ".", Average_speed)),
204+
LN_TotalBiomassHaul = as.numeric(gsub(",", ".", LN_TotalBiomassHaul)),
205+
Trawl_duration = as.numeric(gsub(",", ".", Trawl_duration)),
206+
RN = as.numeric(gsub(",", ".", RN)),
207+
Alive_Dead = as.numeric(gsub(",", ".", Alive_Dead)),
208+
Cloud.cover = as.numeric(gsub(",", ".", Cloud.cover)))
209+
head(test)
210+
211+
test <- test [, !names(test ) %in% c("tripID", "organismID", "hour", "TotalBiomassHaul", "DW", "Sex", "Sea.state", "MinsExposedtoAir",
212+
"time", "time_hours", "at_celsius", "bathy", "sbt_merged")]
213+
summary(test)
214+
215+
## predict on the testing dataset
216+
test$brt_pred <- gbm::predict.gbm(newdata = test, object = modtest, n.trees=modtest$gbm.call$best.trees, type = "response")
217+
218+
# evaluate
219+
e <- evaluate(p = as.numeric(test$brt_pred[test$Alive_Dead == 1]),
220+
a = as.numeric(test$brt_pred[test$Alive_Dead == 0]))
221+
e
222+
223+
#RESULTS:
224+
# class : ModelEvaluation
225+
# n presences : 335
226+
# n absences : 61
227+
# AUC : 0.8651333
228+
# cor : 0.5420899
229+
# max TPR+TNR at : 0.8106794
230+
231+
# plot evaluation results
232+
par(mfrow=c(1, 3))
233+
plot(e, 'ROC')
234+
density(e)
235+
boxplot(e, col=c('blue', 'red'))
236+
237+
#Grafica AUC, si sube rapido al 1 es que el AUC es muy alto, muy bueno
238+
# Mapa densidad: esto es como lo que veíamos antes
239+
# boxplot: presentado ausencias y presencias, para el dataset tested, cpied from workshop: lo que vemos es que todos los valores por encima de 0.4, corresponde con posiciones de presecia. En ausencias tenemos puntos en los que el habiat suitability, casi todos los datos caen con unvalor de prebilidad de habita tmuy bajo, lo cual es muy bueno.
240+

0 commit comments

Comments
 (0)