-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlasso_model.Rmd
347 lines (248 loc) · 10.2 KB
/
lasso_model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
---
title: "lasso"
author: "liuc"
date: '2022-05-31'
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## lasso model
复现一篇LASSO临床预测模型的文章:
Screening the Influence of Biomarkers for Metabolic Syndrome in Occupational Population Based on the Lasso Algorithm
复现的内容包括:LASSO路径图, 为LASSO交叉验证图, 瀑布图(展示了那些因子被压缩为零,那些被选入模型), 筛选独立因子, 筛选出的10个因子再次构建Logistic回归预测概率,Nomogram,森林图,校准曲线,DCA曲线。
Ridge regression shrinks all coefficients towards zero, but lasso regression has the potential to remove predictors from the model by shrinking the coefficients completely to zero.
LASSO(Least Absolute Shrinkage and Selection Operator)回归是一种用于特征选择和回归分析的统计方法。它结合了岭回归(Ridge Regression)和逐步回归(Stepwise Regression)的特点,通过对回归系数进行约束,实现模型的参数收缩和变量选择。
LASSO回归的目标是在给定预测变量(自变量)的情况下,寻找最佳的线性回归模型。与传统的最小二乘回归不同,LASSO回归引入了L1正则化惩罚项,即对回归系数的绝对值进行惩罚。这种惩罚机制促使某些回归系数变为零,从而实现了特征选择的效果,即可以自动地从给定的特征集中选择对目标变量具有显著影响的变量,同时排除对目标变量影响较小的变量。
*LASSO回归需要对变量进行标准化吗?*似乎不是必要的。
```{r, include=FALSE}
library(tidyverse)
library(tidymodels)
library(glmnet)
library(rms)
library(foreign)
library(visreg) # displaying the results of a fitted model in terms of how a predictor variable x affects an outcome y.
library(vip)
```
1. 数据清理和准备
因为文章没有提供可供分析的数据,故选择IMDB ratings数据集(回归问题)。
bdiag.csv数据集用于分类问题,The variable diagnosis classifies the biopsied tissue as M = malignant or B = benign.
```{r}
ratings_raw <- readr::read_csv("./datasets/office_ratings.csv")
remove_regex <- "[:punct:]|[:digit:]|parts |part |the |and"
office_ratings <- ratings_raw %>%
transmute(
episode_name = str_to_lower(title),
episode_name = str_remove_all(episode_name, remove_regex),
episode_name = str_trim(episode_name),
imdb_rating
)
office_info <- schrute::theoffice %>%
mutate(
season = as.numeric(season),
episode = as.numeric(episode),
episode_name = str_to_lower(episode_name),
episode_name = str_remove_all(episode_name, remove_regex),
episode_name = str_trim(episode_name)
) %>%
select(season, episode, episode_name, director, writer, character)
characters <- office_info %>%
count(episode_name, character) %>%
add_count(character, wt = n, name = "character_count") %>%
filter(character_count > 800) %>%
select(-character_count) %>%
pivot_wider(
names_from = character,
values_from = n,
values_fill = list(n = 0)
)
creators <- office_info %>%
distinct(episode_name, director, writer) %>%
pivot_longer(director:writer, names_to = "role", values_to = "person") %>%
separate_rows(person, sep = ";") %>%
add_count(person) %>%
filter(n > 10) %>%
distinct(episode_name, person) %>%
mutate(person_value = 1) %>%
pivot_wider(
names_from = person,
values_from = person_value,
values_fill = list(person_value = 0)
)
office <- office_info %>%
distinct(season, episode, episode_name) %>%
inner_join(characters) %>%
inner_join(creators) %>%
inner_join(office_ratings %>%
select(episode_name, imdb_rating)) %>%
janitor::clean_names()
office
```
特征工程,利用tidymodels会很简单。此处暂时不涉及特征工程的部分。
*lasso回归如何处理无须多分类变量:*lasso不可以处理无序多分类变量,连续变量可以直接纳入,等级变量按照等级处理也可以直接纳入。对于无序多分类变量既不可以设置dummy变量也不可以按照无序多分类处理的等级变量纳入模型中,而是要拆成多个二分类变量纳入,注意这里拆成多个二分类变量和dummy变量的不同。如血型(A、B、O、AB),在进行LASSO时,则需要拆分为A/非A、B/非B、O/非O、AB/非AB共4个二分类变量纳入。在进行线性回归、Logistic回归、COX回归时,设置的哑变量要遵守“同进同出”的原则,即只要一个哑变量有统计学意义,则其他几个哑变量必须都要留在模型;然而,对于LASSO回归,上述血型拆分的4个二分类变量,则是单独的,孤立的,单独4个不同的变量进行分析,有能力则留在模型,没能力则被提出模型;因为,LASSO回归重要目的之一,就是筛选因子,它不管你纳入模型X的专业合理性,重在得到的模型要达到最佳的预测效果。
一是:利用LASSO筛选变量,筛选好之后,在利用logisitc回归,或者COX回归进行多因素分析,此时,研究者也可以将有意义的某个拆分后二分类变量的整体无序多分类变量再纳入模型(意思是:比如上述血型,只有A/非A是有意义的,其他3个没有意义,那么在后多分析时,再将血型变量纳入先进哑变量设置即可),去构建模型,然后预测概率(P1)即可;二是:直接利用LASSO回归的系数,构建方程,也可以直接利用LASSO构建的方程去预测结局概率(P2)即可。
```{r}
dat <- office %>% select(-episode_name) %>%
as.data.frame()
index = sample(1:nrow(dat), 0.7*nrow(dat))
train = dat[index,] # Create the training data
test = dat[-index,] # Create the test data
dim(train)
dim(test)
# scale the data
cols = c('andy', 'angela', 'darryl', 'dwight')
pre_proc_val <- caret::preProcess(train[,cols], method = c("center", "scale"))
train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])
summary(train)
X <- model.matrix(imdb_rating ~ ., data=dat)[,-1]
Y <- dat[,"imdb_rating"]
```
First we need to find the amount of penalty, λ by cross-validation.
`glmnet`所需要的输入格式也是奇怪了。
`family`所支持的回归模型有,'gaussian','poisson','binomial','multinomial','cox'等,似乎`glm`支持的family其都支持。
```{r}
#Penalty type (alpha=1 is lasso
#and alpha=0 is the ridge)
cv.lambda.lasso <- glmnet::cv.glmnet(x=X,
y=Y,
nfolds = 10,
alpha = 1
)
plot(cv.lambda.lasso)#MSE for several lambdas
#Lasso path
plot(cv.lambda.lasso$glmnet.fit,
"lambda", label=FALSE)
```
*interpret:*由左图可以看到20个变量时MSE值最小。右图为随着lambda值增大变量系数的变化,前面那些系数为0的变量自然就是被惩戒掉的。
```{r}
#find coefficients of best model
best_lambda <- cv.lambda.lasso$lambda.min
cat("Best lambda selected by cross-validation:", best_lambda, "\n")
# 然后利用最佳lambda参数构建最佳函数
best_model <- glmnet(X, Y,
alpha = 1, # alpha=1 is the lasso penalty, and alpha=0 the ridge penalty
lambda = best_lambda,
family='gaussian',
)
# the two below return same result
coef(best_model)
best_model$beta #Coefficients
```
```{r}
plot(best_model)
# 因为lambda就一个值了
plot(best_model, xvar='lambda', label = TRUE)
```
*interpret:*图上的数字为剩下的变量的编号。
```{r}
# 算是提取了剩下的变量
broom::tidy(best_model)
```
在得到*best_model*后,再进行变量重要性的展示:
```{r}
# 将返回全部变量的Importance值
# Importance 为0的值和Sign并非一一对应的关系
vi_model(best_model)
vip::vip(best_model)
```
visreg似乎不适用于glmnet的结果
```{r}
visreg::visreg(best_model, 'andy')
```
```{r}
predict(best_model, newx = X |> head())
```
DCA曲线:对于回归问题
```{r}
```
### lasso model by tidymodels
用`tidymodels`建立LASSO预测模型,并提取和上述类似的lasso模型。
```{r}
office_split <- initial_split(office, strata = season)
office_train <- training(office_split)
office_test <- testing(office_split)
office_rec <- recipe(imdb_rating ~ ., data = office_train) %>%
update_role(episode_name, new_role = "ID") %>%
step_zv(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes())
office_prep <- office_rec %>%
prep(strings_as_factors = FALSE)
```
```{r}
lasso_spec <- linear_reg(penalty = 0.1, mixture = 1) %>%
set_engine("glmnet")
wf <- workflow() %>%
add_recipe(office_rec)
lasso_fit <- wf %>%
add_model(lasso_spec) %>%
fit(data = office_train)
lasso_fit %>%
pull_workflow_fit() %>%
tidy()
```
Tune lasso parameters
```{r}
set.seed(1234)
office_boot <- bootstraps(office_train, strata = season)
tune_spec <- linear_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet")
lambda_grid <- grid_regular(penalty(), levels = 50)
doParallel::registerDoParallel()
set.seed(2020)
lasso_grid <- tune_grid(
wf %>% add_model(tune_spec),
resamples = office_boot,
grid = lambda_grid
)
```
```{r}
lasso_grid %>%
collect_metrics()
```
see a visualization of performance with the regularization parameter.
```{r}
lasso_grid %>%
collect_metrics() %>%
ggplot(aes(penalty, mean, color = .metric)) +
geom_errorbar(aes(
ymin = mean - std_err,
ymax = mean + std_err
),
alpha = 0.5
) +
geom_line(size = 1.5) +
facet_wrap(~.metric, scales = "free", nrow = 2) +
scale_x_log10() +
theme(legend.position = "none")
```
```{r}
lowest_rmse <- lasso_grid %>%
select_best("rmse")
final_lasso <- finalize_workflow(
wf %>% add_model(tune_spec),
lowest_rmse
)
```
let’s see what the most important variables are using the vip package
```{r}
final_lasso %>%
fit(office_train) %>%
pull_workflow_fit() %>%
vi(lambda = lowest_rmse$penalty) %>%
mutate(
Importance = abs(Importance),
Variable = fct_reorder(Variable, Importance)
) %>%
ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
geom_col() +
scale_x_continuous(expand = c(0, 0)) +
labs(y = NULL)
```
```{r}
last_fit(
final_lasso,
office_split
) %>%
collect_metrics()
```