diff --git a/causal-inference-for-the-brave-and-true/05-The-Unreasonable-Effectiveness-of-Linear-Regression.ipynb b/causal-inference-for-the-brave-and-true/05-The-Unreasonable-Effectiveness-of-Linear-Regression.ipynb index c0ba430..78bf224 100644 --- a/causal-inference-for-the-brave-and-true/05-The-Unreasonable-Effectiveness-of-Linear-Regression.ipynb +++ b/causal-inference-for-the-brave-and-true/05-The-Unreasonable-Effectiveness-of-Linear-Regression.ipynb @@ -106,7 +106,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "That's quite amazing. We are not only able to estimate the ATE, but we also get, for free, confidence intervals and P-Values out of it! More than that, we can see that regression is doing exactly what it supposed to do: comparing $E[Y|T=0]$ and $E[Y|T=1]$. The intercept is exactly the sample mean when $T=0$, $E[Y|T=0]$, and the coefficient of the online format is exactly the sample difference in means $E[Y|T=1] - E[Y|T=0]$. Don't trust me? No problem. You can see for yourself:" + "That's quite amazing. We are not only able to estimate the ATE, but we also get, for free, confidence intervals and P-Values out of it! More than that, we can see that regression is doing exactly what it is supposed to do: comparing $E[Y|T=0]$ and $E[Y|T=1]$. The intercept is exactly the sample mean when $T=0$, $E[Y|T=0]$, and the coefficient of the online format is exactly the sample difference in means $E[Y|T=1] - E[Y|T=0]$. Don't trust me? No problem. You can see for yourself:" ] }, { @@ -225,8 +225,8 @@ } ], "source": [ - "kapa = data[\"falsexam\"].cov(data[\"format_ol\"]) / data[\"format_ol\"].var()\n", - "kapa" + "kappa = data[\"falsexam\"].cov(data[\"format_ol\"]) / data[\"format_ol\"].var()\n", + "kappa" ] }, { @@ -417,7 +417,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Of course, it is not because we can estimate this simple model that it's correct. Notice how I was carefully with my words saying it **predicts** wage from education. I never said that this prediction was causal. In fact, by now, you probably have very serious reasons to believe this model is biased. Since our data didn't come from a random experiment, we don't know if those that got more education are comparable to those who got less. Going even further, from our understanding of how the world works, we are very certain that they are not comparable. Namely, we can argue that those with more years of education probably have richer parents, and that the increase we are seeing in wages as we increase education is just a reflection of how the family wealth is associated with more years of education. Putting it in math terms, we think that $E[Y_0|T=0] < E[Y_0|T=1]$, that is, those with more education would have higher income anyway, even without so many years of education. If you are really grim about education, you can argue that it can even *reduce* wages by keeping people out of the workforce and lowering their experience.\n", + "Of course, it is not because we can estimate this simple model that it's correct. Notice how I was careful with my words saying it **predicts** wage from education. I never said that this prediction was causal. In fact, by now, you probably have very serious reasons to believe this model is biased. Since our data didn't come from a random experiment, we don't know if those that got more education are comparable to those who got less. Going even further, from our understanding of how the world works, we are very certain that they are not comparable. Namely, we can argue that those with more years of education probably have richer parents, and that the increase we are seeing in wages as we increase education is just a reflection of how the family wealth is associated with more years of education. Putting it in math terms, we think that $E[Y_0|T=0] < E[Y_0|T=1]$, that is, those with more education would have higher income anyway, even without so many years of education. If you are really grim about education, you can argue that it can even *reduce* wages by keeping people out of the workforce and lowering their experience.\n", "\n", "Fortunately, in our data, we have access to lots of other variables. We can see the parents' education `meduc`, `feduc`, the `IQ` score for that person, the number of years of experience `exper` and the tenure of the person in his or her current company `tenure`. We even have some dummy variables for marriage and black ethnicity. " ] @@ -1023,7 +1023,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Regression, on the other hand, does so by comparing the effect of $T$ while maintaining the confounder $W$ set to a fixed level. With regression, it is not the case that W cease to cause T and Y. It is just that it is held fixed, so it can't influence changes on T and Y." + "Regression, on the other hand, does so by comparing the effect of $T$ while maintaining the confounder $W$ set to a fixed level. With regression, it is not the case that W ceases to cause T and Y. It is just that it is held fixed, so it can't influence changes on T and Y." ] }, { @@ -1121,7 +1121,7 @@ "\n", "## Key Ideas\n", "\n", - "We've covered a lot of ground with regression. We saw how regression can be used to perform A/B testing and how it conveniently gives us confidence intervals. Then, we moved to study how regression solves a prediction problem and it is the best linear approximation to the conditional expectation function (CEF). We've also discussed how, in the bivariate case, the regression treatment coefficient is the covariance between the treatment and the outcome divided by the variance of the treatment. Expanding to the multivariate case, we figured out how regression gives us a partialling out interpretation of the treatment coefficient: it can be interpreted as how the outcome would change with the treatment while keeping all other included variables constant. This is what economists love to refer as *ceteris paribus*.\n", + "We've covered a lot of ground with regression. We saw how regression can be used to perform A/B testing and how it conveniently gives us confidence intervals. Then, we moved to study how regression solves a prediction problem and is the best linear approximation to the conditional expectation function (CEF). We've also discussed how, in the bivariate case, the regression treatment coefficient is the covariance between the treatment and the outcome divided by the variance of the treatment. Expanding to the multivariate case, we figured out how regression gives us a partialling out interpretation of the treatment coefficient: it can be interpreted as how the outcome would change with the treatment while keeping all other included variables constant. This is what economists love to refer as *ceteris paribus*.\n", "\n", "Finally, we took a turn to understanding bias. We saw how `Short equals long plus the effect of omitted times the regression of omitted on included`. This shed some light to how bias comes to be. We discovered that the source of omitted variable bias is confounding: a variable that affects both the treatment and the outcome. Lastly, we used causal graphs to see how RCT and regression fixes confounding.\n", "\n", diff --git a/causal-inference-for-the-brave-and-true/06-Grouped-and-Dummy-Regression.ipynb b/causal-inference-for-the-brave-and-true/06-Grouped-and-Dummy-Regression.ipynb index 7862b84..7c8f2fc 100644 --- a/causal-inference-for-the-brave-and-true/06-Grouped-and-Dummy-Regression.ipynb +++ b/causal-inference-for-the-brave-and-true/06-Grouped-and-Dummy-Regression.ipynb @@ -9,7 +9,7 @@ "\n", "## Regression With Grouped Data\n", "\n", - "Not all data points are created equal. If we look again at our ENEM dataset, we trust the scores of big schools much more than the scores from small schools. This is not to say that big schools are better or anything. It is just due to the fact that their big size imply less variance." + "Not all data points are created equal. If we look again at our ENEM dataset, we trust the scores of big schools much more than the scores from small schools. This is not to say that big schools are better or anything. It is just due to the fact that their big size implies less variance." ] }, { @@ -968,7 +968,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "First of all, notice how this removes any assumption about the functional form of how education affects wages. We don't need to worry about logs anymore. In essence, this model is completely non-parametric. All it does is compute sample averages of wage for each year of education. This can be seen in the plot above, where the fitted line doesn't have a particular form. Instead, is the interpolation of the sample means for each year of education. We can also see that by reconstructing one parameter, for instance, that of 17 years of education. For this model, it's `9.5905`. Below, we can see how it is just the difference between the baseline years of education (9) and the individuals with 17 years\n", + "First of all, notice how this removes any assumption about the functional form of how education affects wages. We don't need to worry about logs anymore. In essence, this model is completely non-parametric. All it does is compute sample averages of wage for each year of education. This can be seen in the plot above, where the fitted line doesn't have a particular form. Instead, it is the interpolation of the sample means for each year of education. We can also see that by reconstructing one parameter, for instance, that of 17 years of education. For this model, it's `9.5905`. Below, we can see how it is just the difference between the baseline years of education (9) and the individuals with 17 years\n", "\n", "$\n", "\\beta_{17} = E[Y|T=17]-E[Y|T=9]\n", @@ -1002,7 +1002,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "If we include more dummy covariates in the model, the parameter on education become a weighted average of the effect on each dummy group:\n", + "If we include more dummy covariates in the model, the parameter on education becomes a weighted average of the effect on each dummy group:\n", "\n", "$\n", "E\\{ \\ (E[Y_i|T=1, Group_i] - E[Y_i|T=0, Group_i])w(Group_i) \\ \\}\n", diff --git a/causal-inference-for-the-brave-and-true/07-Beyond-Confounders.ipynb b/causal-inference-for-the-brave-and-true/07-Beyond-Confounders.ipynb index e513678..8b9b26f 100644 --- a/causal-inference-for-the-brave-and-true/07-Beyond-Confounders.ipynb +++ b/causal-inference-for-the-brave-and-true/07-Beyond-Confounders.ipynb @@ -13,7 +13,7 @@ "\n", "As a motivating example, let's suppose you are a data scientist in the collections team of a fintech. Your next task is to figure out the impact of sending an email asking people to negotiate their debt. Your response variable is the amount of payments from the late customers.\n", "\n", - "To answer this question, your team selects 5000 random customers from your late customers base to do a random test. For every customer, you flip a coin, if its heads, the customer receives the email; otherwise, it is left as a control. With this test, you hope to find out how much extra money the email generates." + "To answer this question, your team selects 5000 random customers from your late customers base to do a random test. For every customer, you flip a coin, if it's heads, the customer receives the email; otherwise, it is left as a control. With this test, you hope to find out how much extra money the email generates." ] }, { @@ -1196,7 +1196,7 @@ "source": [ "## Bad Controls - Selection Bias\n", "\n", - "Let's go back to the collections email example. Remember that the email was randomly assigned to customers. We've already explained what `credit_limit` and `risk_score` is. Now, let's look at the remaining variables. `opened` is a dummy variable for the customer opening the email or not. `agreement` is another dummy marking if the customers contacted the collections department to negotiate their debt, after having received the email. Which of the following models do you think is more appropriate? The first is a model with the treatment variable plus `credit_limit` and `risk_score`; the second adds `opened` and `agreement` dummies." + "Let's go back to the collections email example. Remember that the email was randomly assigned to customers. We've already explained what `credit_limit` and `risk_score` are. Now, let's look at the remaining variables. `opened` is a dummy variable for the customer opening the email or not. `agreement` is another dummy marking if the customers contacted the collections department to negotiate their debt, after having received the email. Which of the following models do you think is more appropriate? The first is a model with the treatment variable plus `credit_limit` and `risk_score`; the second adds `opened` and `agreement` dummies." ] }, { @@ -1484,11 +1484,11 @@ "email -> opened -> agreement -> payment \n", "$\n", "\n", - "We also think that different levels of risk and line have different propensity of doing an agreement, so we will mark them as also causing agreement. As for email and agreement, we could make an argument that some people just read the subject of the email and that makes them more likely to make an agreement. The point is that email could also cause agreement without passing through open.\n", + "We also think that different levels of risk and limit have different propensity of doing an agreement, so we will mark them as also causing agreement. As for email and agreement, we could make an argument that some people just read the subject of the email and that makes them more likely to make an agreement. The point is that email could also cause agreement without passing through open.\n", "\n", "What we notice with this graph is that opened and agreement are both in the causal path from email to payments. So, if we control for them with regression, we would be saying \"this is the effect of email while keeping `opened` and `agreement` fixed\". However, both are part of the causal effect of the email, so we don't want to hold them fixed. Instead, we could argue that email increases payments precisely because it boosts the agreement rate. If we fix those variables, we are removing some of the true effect from the email variable. \n", "\n", - "With potential outcome notation, we can say that, due to randomization $E[Y_0|T=0] = E[Y_0|T=1]$. However, even with randomization, when we control for agreement, treatment and control are no longer comparable. In fact, with some intuitive thinking, we can even guess how they are different:\n", + "With potential outcomes notation, we can say that, due to randomization $E[Y_0|T=0] = E[Y_0|T=1]$. However, even with randomization, when we control for agreement, treatment and control are no longer comparable. In fact, with some intuitive thinking, we can even guess how they are different:\n", "\n", "\n", "$\n", @@ -1507,13 +1507,13 @@ "\n", "![img](./data/img/beyond-conf/selection.png)\n", "\n", - "Selection bias is so pervasive that not even randomization can fix it. Better yet, it is often introduced by the ill advised, even in random data! Spotting and avoiding selection bias requires more practice than skill. Often, they appear underneath some supposedly clever idea, making it even harder to uncover. Here are some examples of selection biased I've encountered:\n", + "Selection bias is so pervasive that not even randomization can fix it. Better yet, it is often introduced by the ill advised, even in random data! Spotting and avoiding selection bias requires more practice than skill. Often, they appear underneath some supposedly clever idea, making it even harder to uncover. Here are some examples of selection biases I've encountered:\n", "\n", - " 1. Adding a dummy for paying the entire debt when trying to estimate the effect of a collections strategy on payments.\n", - " 2. Controlling for white vs blue collar jobs when trying to estimate the effect of schooling on earnings\n", - " 3. Controlling for conversion when estimating the impact of interest rates on loan duration\n", - " 4. Controlling for marital happiness when estimating the impact of children on extramarital affairs\n", - " 5. Breaking up payments modeling E[Payments] into one binary model that predict if payment will happen and another model that predict how much payment will happen given that some will: E[Payments|Payments>0]*P(Payments>0)\n", + "1. Adding a dummy for paying the entire debt when trying to estimate the effect of a collections strategy on payments.\n", + "2. Controlling for white vs blue collar jobs when trying to estimate the effect of schooling on earnings\n", + "3. Controlling for conversion when estimating the impact of interest rates on loan duration\n", + "4. Controlling for marital happiness when estimating the impact of children on extramarital affairs\n", + "5. Breaking up payments modeling $E[Payments]$ into one binary model that predicts if payment will happen and another model that predict how much payment will happen given that some will: $E[Payments|Payments>0]*P(Payments>0)$\n", " \n", "What is notable about all these ideas is how reasonable they sound. Selection bias often does. Let this be a warning. As a matter of fact, I myself have fallen into the traps above many many times before I learned how bad they were. One in particular, the last one, deserves further explanation because it looks so clever and catches lots of data scientists off guard. It's so pervasive that it has its own name: **The Bad COP**!\n", "\n", @@ -1595,7 +1595,7 @@ "\\end{align*} \n", "$$\n", " \n", - "where the second equality comes after we add and subtract $E[Y_{i0}|Y_{i1}>0]$. When we break up the COP effect, we get first the causal effect on the participant subpopulation. In our example, this would be the causal effect on those that decide to spend something. Second, we get a bias term which is the difference in $Y_0$ for those that decide to participate when assigned to the treatment ($E[Y_{i0}|Y_{i1}>0]$) and those that that participate even without the treatment ($E[Y_{i0}|Y_{i0}>0]$). In our case, this bias is probably negative, since those that spend when assigned to the treatment, had they not received the treatment, would probably spend less than those that spend even without the treatment $E[Y_{i0}|Y_{i1}>0] < E[Y_{i0}|Y_{i0}>0]$.\n", + "where the second equality comes after we add and subtract $E[Y_{i0}|Y_{i1}>0]$. When we break up the COP effect, we get first the causal effect on the participant subpopulation. In our example, this would be the causal effect on those that decide to spend something. Second, we get a bias term which is the difference in $Y_0$ for those that decide to participate when assigned to the treatment ($E[Y_{i0}|Y_{i1}>0]$) and those that would participate even without the treatment ($E[Y_{i0}|Y_{i0}>0]$). In our case, this bias is probably negative, since those that spend when assigned to the treatment, had they not received the treatment, would probably spend less than those that spend even without the treatment $E[Y_{i0}|Y_{i1}>0] < E[Y_{i0}|Y_{i0}>0]$.\n", " \n", "![img](./data/img/beyond-conf/cop.png)\n", " \n",