|
| 1 | +--- |
| 2 | +title: "What if?" |
| 3 | +subtitle: "Cap11: Why model?" |
| 4 | +author: "Grupo top de IC" |
| 5 | +institute: "UnB" |
| 6 | +format: |
| 7 | + revealjs: |
| 8 | + embed-resources: true |
| 9 | + multiplex: true |
| 10 | + incremental: true |
| 11 | + logo: logo.jpeg |
| 12 | + scrollable: true |
| 13 | + highlight-style: arrow |
| 14 | + transition: fade |
| 15 | +--- |
| 16 | + |
| 17 | +## Capítulo 11 |
| 18 | + |
| 19 | +- Part I conceitual, estimadores não paramétricos |
| 20 | +- Part II dados "reais", estimadores paramétricos |
| 21 | + |
| 22 | +- We define nonparametric estimators: those that produce estimates from the data without any a priori restrictions on the conditional mean function: Part I |
| 23 | + |
| 24 | +- Parametric estimation and other approaches to borrow information are our only hope when data are unable to speak for themselves. |
| 25 | + |
| 26 | + |
| 27 | +## Data cannot speak for themselves |
| 28 | + |
| 29 | +- 16 individuals infected with HIV randomly sampled from a larger target population. |
| 30 | + |
| 31 | +- Each individual receives a certain level of antiretroviral therapy. At the end CD4 cell count, in cells/mm3 is measured: We wish to consistently estimate the mean of cell counts for individuals with level A=a. |
| 32 | + |
| 33 | +. . . |
| 34 | + |
| 35 | +$\hat{E}=[Y|A=a]$ é um estimador consistente. |
| 36 | + |
| 37 | +```{r eval=F} |
| 38 | +#install.packages("gfoRmula") |
| 39 | +
|
| 40 | +library(tidyverse) |
| 41 | +dataurls <- list() |
| 42 | +stub <- "https://cdn1.sph.harvard.edu/wp-content/uploads/sites/1268/" |
| 43 | +dataurls[[1]] <- paste0(stub, "2012/10/nhefs_sas.zip") |
| 44 | +dataurls[[2]] <- paste0(stub, "2012/10/nhefs_stata.zip") |
| 45 | +dataurls[[3]] <- paste0(stub, "2017/01/nhefs_excel.zip") |
| 46 | +dataurls[[4]] <- paste0(stub, "1268/20/nhefs.csv") |
| 47 | +
|
| 48 | +temp <- tempfile() |
| 49 | +for (i in 1:3) { |
| 50 | + download.file(dataurls[[i]], temp) |
| 51 | + unzip(temp, exdir = "data") |
| 52 | +} |
| 53 | +
|
| 54 | +download.file(dataurls[[4]], here("data", "nhefs.csv")) |
| 55 | +``` |
| 56 | + |
| 57 | +## Tratamentos |
| 58 | + |
| 59 | +::: panel-tabset |
| 60 | +## 2 categorias |
| 61 | + |
| 62 | +```{r echo=F} |
| 63 | +library(tidyverse) |
| 64 | +A <- c(1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0) |
| 65 | +Y <- c(200, 150, 220, 110, 50, 180, 90, 170, 170, 30, |
| 66 | + 70, 110, 80, 50, 10, 20) |
| 67 | +
|
| 68 | +ggplot()+ |
| 69 | + geom_point(aes(A, Y))+ |
| 70 | + theme_classic(base_size = 14)+ |
| 71 | + scale_x_continuous(breaks = c(0,1))+ |
| 72 | + geom_label(aes(x=c(0,1), y=c(200, 30), label=c(67.5, |
| 73 | + 146.2))) |
| 74 | +``` |
| 75 | + |
| 76 | +```{r} |
| 77 | +
|
| 78 | +summary(Y[A == 0]) |
| 79 | +summary(Y[A == 1]) |
| 80 | +``` |
| 81 | + |
| 82 | +## 4 categorias |
| 83 | + |
| 84 | +```{r include=F} |
| 85 | +library(tidyverse) |
| 86 | +A2 <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4) |
| 87 | +Y2 <- c(110, 80, 50, 40, 170, 30, 70, 50, 110, 50, 180, |
| 88 | + |
| 89 | + 130, 200, 150, 220, 210) |
| 90 | +m1 <- summary(Y2[A2 == 1])[[4]] |
| 91 | +m2 <-summary(Y2[A2 == 2 ])[[4]] |
| 92 | +m3 <- summary(Y2[A2 == 3 ])[[4]] |
| 93 | +m4 <- summary(Y2[A2 == 4 ])[[4]] |
| 94 | +``` |
| 95 | + |
| 96 | +```{r} |
| 97 | +ggplot()+ |
| 98 | + geom_point(aes(A2, Y2))+ |
| 99 | + theme_classic(base_size = 14)+ |
| 100 | + scale_x_continuous(breaks = c(0,1))+ |
| 101 | + geom_label(aes(x=c(1,2,3, 4), y=c(195), label=c(m1, m2, m3, m4))) |
| 102 | +``` |
| 103 | + |
| 104 | +## 0-100mg |
| 105 | + |
| 106 | +- Quanto é a predição para A=90? |
| 107 | + |
| 108 | +```{r echo=F} |
| 109 | +
|
| 110 | +A3 <-c(3, 11, 17, 23, 29, 37, 41, 53, 67, 79, 83, 97, 60, 71, 15, 45) |
| 111 | +Y3 <-c(21, 54, 33, 101, 85, 65, 157, 120, 111, 200, 140, 220, 230, 217, |
| 112 | + 11, 190) |
| 113 | +
|
| 114 | +ggplot()+ |
| 115 | + geom_point(aes(A3, Y3))+ |
| 116 | + theme_classic(base_size = 14) |
| 117 | +``` |
| 118 | +::: |
| 119 | + |
| 120 | +## Parametric estimators of the conditional mean |
| 121 | + |
| 122 | +$E[Y|A]= \theta_0 + \theta_1A$ |
| 123 | + |
| 124 | +- modelo linear, $\theta$ são parâmetros. |
| 125 | + |
| 126 | +. . . |
| 127 | + |
| 128 | +```{r echo=F} |
| 129 | +#install.packages("ggtext") |
| 130 | +library(ggtext) |
| 131 | +library(glue) |
| 132 | +modelo <- lm(Y3~A3) |
| 133 | +
|
| 134 | +a <- format(round(modelo$coefficients[[1]], 1), decimal.mark = ",") |
| 135 | +b <- format(round(modelo$coefficients[[2]],1), decimal.mark = ",") |
| 136 | +
|
| 137 | +theta0 <- modelo$coefficients[[1]] |
| 138 | +theta1 <- modelo$coefficients[[2]] |
| 139 | +
|
| 140 | +ggplot()+ |
| 141 | + geom_point(aes(A3, Y3))+ |
| 142 | + geom_smooth(aes(A3, Y3), |
| 143 | + method="lm", se=F, color="red")+ |
| 144 | + theme_classic(base_size = 14)+ |
| 145 | + geom_richtext(aes(50, 150, label=glue("E[Y|A] = {a} + {b}A ")),angle = 28) |
| 146 | +``` |
| 147 | + |
| 148 | +```{r} |
| 149 | +(reta_90 <- theta0 + theta1*90) |
| 150 | +``` |
| 151 | + |
| 152 | + - A certain degree of model misspecification is almost always expected. |
| 153 | + |
| 154 | + |
| 155 | + |
| 156 | +## Mais parametros |
| 157 | + |
| 158 | +::: panel-tabset |
| 159 | + |
| 160 | +## Três parametros |
| 161 | +```{r echo=F, message=FALSE, warning=FALSE} |
| 162 | +library(geomtextpath) |
| 163 | +library(latex2exp) |
| 164 | +#> Loading required package: ggplot2 |
| 165 | +a <- format(round(lm(Y3 ~ A3 + I(A3^2) )$coefficients[[1]],2),decimal.mark = ",") |
| 166 | +b <- format(round(lm(Y3 ~ A3 + I(A3^2) )$coefficients[[2]],2),decimal.mark = ",") |
| 167 | +c <- format(round(lm(Y3 ~ A3 + I(A3^2) )$coefficients[[3]],2),decimal.mark = ",") |
| 168 | +
|
| 169 | +theta0 <- lm(Y3 ~ A3 + I(A3^2) )$coefficients[[1]] |
| 170 | +theta1 <- lm(Y3 ~ A3 + I(A3^2) )$coefficients[[2]] |
| 171 | +theta2 <- lm(Y3 ~ A3 + I(A3^2) )$coefficients[[3]] |
| 172 | +
|
| 173 | +dado <-data.frame(A3, Y3) |
| 174 | +label <- expression("~A^2") |
| 175 | +
|
| 176 | +
|
| 177 | +ggplot(dado, aes(A3, Y3))+ |
| 178 | + geom_point()+ |
| 179 | + geom_textsmooth(aes(label = glue("E[Y|A] = {a} + {b}A + {c}A2")), |
| 180 | + method = "lm",formula=y ~ poly(x, 3), rich=TRUE, |
| 181 | + size = 5, linetype = 3, |
| 182 | + fontface = 2, linewidth = 1, color="darkred") |
| 183 | +``` |
| 184 | + |
| 185 | +```{r} |
| 186 | +(parabola_90 <- theta0 + theta1*90 + theta2*90^2) |
| 187 | +``` |
| 188 | + |
| 189 | + |
| 190 | +## 12 parametros |
| 191 | + |
| 192 | +```{r} |
| 193 | +ggplot(dado, aes(A3, Y3))+ |
| 194 | + geom_point()+ |
| 195 | + geom_smooth(method = "lm",formula=y ~ poly(x, 12), |
| 196 | + size = 5, linetype = 3, se=F, |
| 197 | + linewidth = 1, color="darkred") |
| 198 | +``` |
| 199 | + |
| 200 | +- Often modeling can be viewed as a procedure to transform noisy data into more or less smooth curves. |
| 201 | +::: |
| 202 | + |
| 203 | + |
| 204 | + |
| 205 | +## The bias-variance trade-off |
| 206 | + |
| 207 | +Qual o certo? |
| 208 | + |
| 209 | +```{r} |
| 210 | +paste(round(reta_90,2), "IC = 172.1 - 261.6)") |
| 211 | +paste(round(parabola_90 ,2), "IC = 142.8 - 251.5)") |
| 212 | +
|
| 213 | +``` |
| 214 | + |
| 215 | +- Acabou! |
0 commit comments