From 57d90da26d4289e299fe48d7c6e65f50a9bc7c62 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Wed, 1 Nov 2023 13:14:14 -0400 Subject: [PATCH 01/10] initial draft linear regression module --- .../python_linear_regression.md | 269 ++++++++++++++++++ 1 file changed, 269 insertions(+) create mode 100644 python_linear_regression/python_linear_regression.md diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md new file mode 100644 index 000000000..8de89d902 --- /dev/null +++ b/python_linear_regression/python_linear_regression.md @@ -0,0 +1,269 @@ + + +# Python Lesson on Regression for Machine Learning + +@overview + +## What is linear regression? +- Linear regression is a supervised machine learning algorithm that learns to predict a continuous target variable based on one or more predictor variables. Linear regression models the relationship between the target variable and the predictor variables using a linear equation. +- In the case of linear regression, the target variable is a continuous variable. In a supervised learning problem, the machine learning algorithm is given a set of training data and asked to learn a function that can map the input variables to the output variable. The training data consists of pairs of input and output variables. The algorithm learns the function by finding the best fit line to the data. Once the algorithm has learned the function, it can be used to make predictions on new data. To make a prediction, the algorithm simply plugs the values of the input variables into the function. +- Linear regression is a popular supervised learning algorithm because it is simple to implement and understand. It is also a versatile algorithm that can be used to solve a variety of problems. + +Which of the following is NOT a characteristic of linear regression? + + +[( )] Linear regression models the relationship between the target variable and the predictor variables using a linear equation. +[( )] Linear regression is a supervised learning algorithm. +[( )] Linear regression is a simple to implement and understand algorithm. +[(X)] Linear regression can be used to predict categorical variables. +[( )] Linear regression is a versatile algorithm that can be used to solve a variety of problems. +*** +
+ +This question is more difficult than the previous one because it requires the test-taker to have a deeper understanding of the characteristics of linear regression. The test-taker must be able to identify which of the answer choices is not a characteristic of linear regression, even though all of the other answer choices are valid characteristics. + +
+*** + + +### Applications of linear regression in machine learning +Linear Regression can be used for a variety of tasks, such as: + +- **Prediction:** Linear regression can be used to predict a wide range of continuous variables, such as house prices, stock prices, customer churn, and medical outcomes. +- **Recommendation:** Linear regression can be used to build recommender systems that recommend products, movies, or other items to users based on their past preferences. +- **Fraud detection:** Linear regression can be used to detect fraudulent transactions by identifying transactions that deviate from the expected behavior. +- **Medical diagnosis:** Linear regression can be used to help doctors diagnose diseases by identifying patterns in patient data. +- **Scientific research:** Linear regression can be used to identify relationships between variables in scientific data. +### Examples of linear regression in real-world applications +- **Predicting house prices:** Linear regression can be used to predict the price of a house based on its square footage, number of bedrooms, number of bathrooms, and other factors. +- **Predicting stock prices:** Linear regression can be used to predict the price of a stock based on its historical prices, financial data, and other factors. +- **Predicting customer churn:** Linear regression can be used to predict whether a customer is likely to churn based on their past purchase history, demographics, and other factors. +- **Predicting the risk of a customer defaulting on a loan:** Linear regression can be used to predict the risk of a customer defaulting on a loan based on their credit score, income, and other factors. +- **Predicting the likelihood of a patient having a particular disease:** Linear regression can be used to predict the likelihood of a patient having a particular disease based on their medical history, symptoms, and other factors. +- **Predicting the number of visitors to a website:** Linear regression can be used to predict the number of visitors to a website based on the website's past traffic data, marketing campaigns, and other factors. +- **Predicting the sales of a product:** Linear regression can be used to predict the sales of a product based on its price, marketing campaigns, and other factors. + +## Linear Regression Algorithm +Linear regression works by fitting a linear equation to the data. + +The linear equation is represented by the following formula: + +``` +y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn +``` + +where: + +- `y` is the target variable +- `b0` is the bias term +- `bi` is the coefficient for the ith predictor variable +- `xi` is the ith predictor variable + +The coefficients of the linear equation are estimated using the ordinary least squares (OLS) method. The OLS method minimizes the sum of the squared residuals, which are the differences between the predicted values and the actual values of the target variable. Once the linear regression model is trained, it can be used to make predictions on new data. To make a prediction, we simply plug the values of the predictor variables into the linear equation. + +Which of the following is NOT a component of the linear regression formula? + + +[( )] Target variable +[( )] Bias term +[( )] Coefficient for the ith predictor variable +[( )] ith predictor variable +[(X)] Variance of the target variable +*** +
+ +The variance of the target variable is not a component of the linear regression formula. The linear regression formula is used to predict the mean value of the target variable, not the variance. + +
+*** + + + +### Python Implementation of Linear Regression + +To implement linear regression in Python using Scikit-learn, we can follow these steps: + +1. Import the necessary libraries: +``` +import numpy as np +from sklearn.linear_model import LinearRegression +``` + +2. Load the data: +``` +# Load the data as a NumPy array +data = np.loadtxt("data.csv", delimiter=",") + +# Split the data into features and target variable +X = data[:, :-1] +y = data[:, -1] +``` + +3. Split the data into training and testing sets: +``` +from sklearn.model_selection import train_test_split + +# Split the data into 80% training and 20% testing sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) +``` + +4. Train the linear regression model: +``` +# Create a linear regression model +model = LinearRegression() + +# Fit the model to the training data +model.fit(X_train, y_train) +``` + +5. Evaluate the model on the testing set: +``` +# Make predictions on the testing set +y_pred = model.predict(X_test) + +# Evaluate the model using the mean squared error (MSE) +mse = np.mean((y_pred - y_test)**2) + +# Print the MSE +print("MSE:", mse) +``` + +6. Make predictions on new data: +``` +# New data point +new_data = np.array([[1000, 3, 2]]) + +# Make a prediction on the new data point +y_pred = model.predict(new_data) + +# Print the prediction +print("Prediction:", y_pred[0]) + +``` + +This is a basic example of how to implement linear regression in Python using Scikit-learn. There are many other ways to implement linear regression in Python, but this is a good starting point. + +Here are some additional tips for implementing linear regression in Python: + +- Make sure to scale the data before training the model. This will help to ensure that all features have equal importance in the model. +- Use a validation set to evaluate the model and tune the hyperparameters. This will help to prevent overfitting. +- Use regularization techniques, such as L1 or L2 regularization, to prevent overfitting. +- Interpret the coefficients of the linear regression model to understand the relationship between the predictor variables and the target variable. + + +### Applying Linear Regression to a Real-World Dataset +To apply linear regression to a real-world dataset, we can follow these steps: + +- **Choose a dataset:** The dataset should have at least one continuous target variable and one or more predictor variables. +- **Prepare the data:** This may involve cleaning the data, handling missing values, and scaling the data. +- **Split the data into training and testing sets:** This will help to prevent overfitting. +- **Train the linear regression model:** Use the training set to fit the model to the data. +- **Evaluate the model on the testing set:** This will give you an estimate of how well the model will generalize to new data. +- **Interpret the results:** Examine the coefficients of the model to understand the relationship between the predictor variables and the target variable. +- **Make predictions on new data:** Use the trained model to make predictions on new data points. + +### Important Notes +Linear regression is a powerful machine learning algorithm, but it has some limitations. Here are some of the most important limitations of linear regression: + +- **Linearity assumption:** Linear regression assumes that the relationship between the target variable and the predictor variables is linear. If the relationship is non-linear, then linear regression will not be able to accurately predict the target variable. +- **Overfitting:** Linear regression is prone to overfitting, which occurs when the model learns the training data too well and is unable to generalize to new data. Overfitting can be prevented by using regularization techniques such as L1 or L2 regularization. +- **Outliers:** Linear regression is sensitive to outliers, which are data points that are significantly different from the rest of the data. Outliers can have a large impact on the parameters of the linear regression model and can lead to inaccurate predictions. +- **Collinearity:** Linear regression is also sensitive to collinearity, which occurs when two or more predictor variables are highly correlated with each other. Collinearity can make it difficult to interpret the results of the linear regression model and can lead to inaccurate predictions. + +[True/False] Linear regression is sensitive to collinearity. + + +[(X)] True +[( )] False +*** +
+ +This question is designed to test the test-taker's understanding of the concept of collinearity and its impact on linear regression models. Collinearity is a serious problem in linear regression because it can make it difficult to interpret the results of the model and can lead to inaccurate predictions. + +
+*** + + +[True/False] Overfitting can be prevented by using regularization techniques. + + +[(X)] True +[( )] False +*** +
+ +This question is designed to test the test-taker's understanding of the concept of overfitting and how to prevent it. Overfitting is a common problem in machine learning, and it is important to be able to identify and prevent it. Regularization techniques such as L1 and L2 regularization can be used to prevent overfitting in linear regression models. + +
+*** + +## Conclusion + +At the end of the lesson, students should have a good understanding of the concept of linear regression and how to implement the linear regression algorithm in Python. They should also be able to apply linear regression to real-world datasets to make predictions and insights. + +## Additional Resources + +## Feedback + +@feedback From 72a1a871258fff8989ff4ce8772649def8ca3abf Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Sat, 11 Nov 2023 14:03:16 -0500 Subject: [PATCH 02/10] Added data for code exercise --- ...althcare_investments_and_hospital_stay.csv | 519 ++++++++++++++++++ 1 file changed, 519 insertions(+) create mode 100644 python_linear_regression/data/healthcare_investments_and_hospital_stay.csv diff --git a/python_linear_regression/data/healthcare_investments_and_hospital_stay.csv b/python_linear_regression/data/healthcare_investments_and_hospital_stay.csv new file mode 100644 index 000000000..95d47f0b0 --- /dev/null +++ b/python_linear_regression/data/healthcare_investments_and_hospital_stay.csv @@ -0,0 +1,519 @@ +Location,Time,Hospital_Stay,MRI_Units,CT_Scanners,Hospital_Beds +AUS,1992,6.6,1.43,16.71,1.43 +AUS,1994,6.4,2.36,18.48,2.36 +AUS,1995,6.5,2.89,20.55,2.89 +AUS,1996,6.4,2.96,21.95,2.96 +AUS,1997,6.2,3.53,23.34,3.53 +AUS,1998,6.1,4.51,24.18,4.51 +AUS,1999,6.2,6.01,25.52,6.01 +AUS,2000,6.1,3.52,26.28,3.52 +AUS,2001,6.2,3.79,29.05,3.79 +AUS,2002,6.2,3.74,34.37,3.74 +AUS,2003,6.1,3.7,40.57,3.7 +AUS,2004,6.1,3.76,45.65,3.76 +AUS,2005,6.0,4.26,51.54,4.26 +AUS,2006,5.9,4.89,56.72,4.89 +AUS,2009,5.1,5.72,39.14,5.72 +AUS,2010,5.0,5.67,43.07,5.67 +AUS,2011,4.9,5.6,44.32,5.6 +AUS,2012,4.8,5.5,50.5,5.5 +AUS,2013,4.7,13.84,53.66,13.84 +AUS,2014,4.7,14.65,56.06,14.65 +AUS,2015,4.2,14.49,59.54,14.49 +AUS,2016,4.2,14.3,63.0,14.3 +AUS,2017,4.1,14.15,64.34,14.15 +AUT,1996,9.5,7.54,24.25,7.54 +AUT,1997,8.3,8.53,25.23,8.53 +AUT,1998,8.2,8.52,26.08,8.52 +AUT,1999,7.8,11.01,26.02,11.01 +AUT,2000,7.6,10.98,26.09,10.98 +AUT,2001,7.4,11.69,26.61,11.69 +AUT,2002,7.3,13.36,27.1,13.36 +AUT,2003,7.2,13.54,27.21,13.54 +AUT,2004,7.2,15.91,29.25,15.91 +AUT,2005,6.9,16.16,29.66,16.16 +AUT,2006,6.9,16.81,29.87,16.81 +AUT,2007,6.8,17.72,30.02,17.72 +AUT,2008,6.8,18.03,29.68,18.03 +AUT,2009,6.7,18.46,29.36,18.46 +AUT,2010,6.6,18.65,29.89,18.65 +AUT,2011,6.5,18.71,29.55,18.71 +AUT,2012,6.5,19.1,29.77,19.1 +AUT,2013,6.5,19.22,29.6,19.22 +AUT,2014,6.5,19.66,29.37,19.66 +AUT,2015,6.5,20.71,28.93,20.71 +AUT,2016,6.4,22.43,29.07,22.43 +AUT,2017,6.4,22.96,28.64,22.96 +AUT,2018,6.3,23.53,28.84,23.53 +BEL,2003,8.0,6.84,10.5,6.84 +BEL,2004,7.9,7.0,11.23,7.0 +BEL,2005,7.9,6.97,12.79,6.97 +BEL,2006,7.8,7.11,12.51,7.11 +BEL,2007,7.7,7.53,13.08,7.53 +BEL,2008,7.4,10.36,13.91,10.36 +BEL,2009,7.2,10.65,14.26,10.65 +BEL,2010,7.2,10.65,13.95,10.65 +BEL,2011,7.1,10.69,13.77,10.69 +BEL,2012,7.0,10.62,15.04,10.62 +BEL,2013,6.9,10.84,22.94,10.84 +BEL,2014,6.9,11.78,21.77,11.78 +BEL,2015,6.8,11.71,23.59,11.71 +BEL,2016,6.7,11.65,23.92,11.65 +BEL,2017,6.6,11.6,23.82,11.6 +BEL,2018,6.6,11.64,23.89,11.64 +CAN,1990,10.2,0.69,7.15,0.69 +CAN,1991,10.0,0.78,7.13,0.78 +CAN,1992,9.9,0.99,7.33,0.99 +CAN,1993,9.8,1.05,7.53,1.05 +CAN,1994,7.4,1.21,7.69,1.21 +CAN,1995,7.2,1.37,7.99,1.37 +CAN,1997,7.0,1.84,8.19,1.84 +CAN,2001,7.3,4.19,9.77,4.19 +CAN,2003,7.3,4.71,10.27,4.71 +CAN,2004,7.3,4.92,10.68,4.92 +CAN,2005,7.2,5.74,11.57,5.74 +CAN,2006,7.4,6.17,12.04,6.17 +CAN,2007,7.5,6.75,12.74,6.75 +CAN,2009,7.7,7.91,13.8,7.91 +CAN,2010,7.7,8.26,14.23,8.26 +CAN,2011,7.6,8.53,14.62,8.53 +CAN,2012,7.6,8.87,14.69,8.87 +CAN,2013,7.5,8.89,14.77,8.89 +CAN,2015,7.4,9.52,15.07,9.52 +CAN,2017,7.4,10.02,15.35,10.02 +CZE,1991,11.9,0.19,2.13,0.19 +CZE,1992,11.6,0.39,4.65,0.39 +CZE,1993,11.2,0.58,5.71,0.58 +CZE,1994,10.8,0.68,6.19,0.68 +CZE,1995,10.2,0.97,6.68,0.97 +CZE,2000,7.9,1.66,9.65,1.66 +CZE,2005,7.9,3.13,12.34,3.13 +CZE,2007,7.0,4.37,12.91,4.37 +CZE,2008,6.7,5.01,13.39,5.01 +CZE,2009,6.7,5.74,14.17,5.74 +CZE,2010,6.6,6.3,14.51,6.3 +CZE,2011,6.4,6.86,14.77,6.86 +CZE,2012,6.2,6.95,15.03,6.95 +CZE,2013,6.0,7.42,15.03,7.42 +CZE,2014,6.0,7.41,15.11,7.41 +CZE,2015,5.9,8.34,16.12,8.34 +CZE,2016,5.9,8.52,15.52,8.52 +CZE,2017,5.8,9.44,15.76,9.44 +CZE,2018,5.8,10.35,16.09,10.35 +DNK,2000,3.8,5.43,11.42,5.43 +DNK,2002,3.7,8.56,13.77,8.56 +DNK,2003,3.6,9.09,14.47,9.09 +DNK,2004,3.4,10.18,14.43,10.18 +FIN,1990,7.0,1.8,9.83,1.8 +FIN,1991,7.0,2.19,10.17,2.19 +FIN,1992,6.1,2.38,10.51,2.38 +FIN,1993,5.7,2.76,11.25,2.76 +FIN,1994,5.6,3.34,11.79,3.34 +FIN,1995,5.5,4.31,11.75,4.31 +FIN,1996,6.0,5.66,12.49,5.66 +FIN,1997,5.9,6.61,12.45,6.61 +FIN,1998,5.9,8.34,12.22,8.34 +FIN,1999,5.8,9.1,12.78,9.1 +FIN,2000,6.9,9.85,13.52,9.85 +FIN,2001,7.0,10.99,13.69,10.99 +FIN,2002,7.1,12.5,13.27,12.5 +FIN,2003,7.1,13.04,14.0,13.04 +FIN,2004,7.1,13.96,14.15,13.96 +FIN,2005,7.1,14.68,14.68,14.68 +FIN,2006,7.2,15.19,14.81,15.19 +FIN,2007,7.2,15.32,16.45,15.32 +FIN,2009,7.0,15.73,20.42,15.73 +FIN,2010,7.0,18.65,21.07,18.65 +FIN,2011,6.9,20.23,21.34,20.23 +FIN,2012,6.9,21.61,21.8,21.61 +FIN,2013,6.8,22.06,21.7,22.06 +FIN,2014,6.7,23.25,21.42,23.25 +FIN,2015,6.6,25.91,21.53,25.91 +FIN,2016,6.5,25.48,24.2,25.48 +FIN,2017,6.4,27.05,24.51,27.05 +FIN,2018,6.4,27.38,16.5,27.38 +FRA,1998,5.8,1.18,6.64,1.18 +FRA,1999,5.5,1.51,7.24,1.51 +FRA,2000,5.6,1.65,7.01,1.65 +FRA,2001,5.7,1.83,7.37,1.83 +FRA,2002,5.7,2.4,7.62,2.4 +FRA,2003,6.1,3.17,8.07,3.17 +FRA,2004,6.0,3.85,8.78,3.85 +FRA,2005,5.9,4.78,10.02,4.78 +FRA,2006,5.9,5.19,10.37,5.19 +FRA,2007,5.9,5.48,10.32,5.48 +FRA,2008,5.8,6.06,10.84,6.06 +FRA,2009,5.7,6.43,11.08,6.43 +FRA,2010,5.8,6.96,11.82,6.96 +FRA,2011,5.7,7.51,12.53,7.51 +FRA,2012,5.7,8.65,13.49,8.65 +FRA,2013,5.6,9.4,14.49,9.4 +FRA,2014,5.6,10.86,15.32,10.86 +FRA,2015,5.6,12.56,16.57,12.56 +FRA,2016,5.5,13.55,16.95,13.55 +FRA,2017,5.4,14.21,17.36,14.21 +FRA,2018,5.4,14.77,17.68,14.77 +DEU,2000,10.1,14.32,24.61,14.32 +DEU,2001,9.8,15.96,25.19,15.96 +DEU,2002,9.6,17.51,27.06,17.51 +DEU,2003,9.3,18.48,27.55,18.48 +DEU,2004,8.9,18.97,28.71,18.97 +DEU,2005,8.8,19.89,29.51,19.89 +DEU,2006,8.7,21.39,29.12,21.39 +DEU,2007,8.5,22.43,29.73,22.43 +DEU,2008,8.3,23.6,31.15,23.6 +DEU,2009,8.2,25.15,31.24,25.15 +DEU,2010,8.1,27.04,32.32,27.04 +DEU,2011,7.9,28.86,33.48,28.86 +DEU,2012,7.8,28.66,34.01,28.66 +DEU,2013,7.7,28.92,33.72,28.92 +DEU,2014,7.6,30.5,35.34,30.5 +DEU,2015,7.6,33.63,35.09,33.63 +DEU,2016,7.5,34.49,35.17,34.49 +DEU,2017,7.5,34.71,35.13,34.71 +GRC,2005,5.6,13.38,25.48,13.38 +GRC,2006,5.8,16.51,26.68,16.51 +GRC,2007,5.4,18.1,29.33,18.1 +GRC,2008,5.4,19.86,31.05,19.86 +GRC,2009,5.3,22.06,31.24,22.06 +GRC,2010,5.3,22.93,32.73,22.93 +GRC,2011,5.4,22.42,33.14,22.42 +GRC,2012,5.2,21.91,33.41,21.91 +GRC,2013,5.6,22.07,33.65,22.07 +GRC,2014,5.6,22.86,34.61,22.86 +HUN,1990,9.9,0.1,1.93,0.1 +HUN,1991,9.7,0.29,2.99,0.29 +HUN,1992,9.5,0.29,3.09,0.29 +HUN,1993,9.5,0.39,3.86,0.39 +HUN,1994,9.8,0.77,4.16,0.77 +HUN,1995,9.2,0.97,4.55,0.97 +HUN,1996,8.6,1.36,4.95,1.36 +HUN,1997,8.2,1.36,4.57,1.36 +HUN,1998,7.8,1.46,4.97,1.46 +HUN,1999,7.5,1.47,5.08,1.47 +HUN,2000,7.1,1.76,5.68,1.76 +HUN,2001,7.0,1.96,5.99,1.96 +HUN,2002,6.9,2.26,6.3,2.26 +HUN,2003,6.7,2.57,6.52,2.57 +HUN,2004,6.7,2.57,6.83,2.57 +HUN,2005,6.5,2.58,7.14,2.58 +HUN,2006,6.4,2.58,7.25,2.58 +HUN,2007,6.0,2.78,7.26,2.78 +HUN,2008,6.0,2.79,7.07,2.79 +HUN,2009,5.8,2.79,7.18,2.79 +HUN,2010,5.8,3.0,7.3,3.0 +HUN,2011,5.8,3.01,7.32,3.01 +HUN,2012,5.8,2.82,7.66,2.82 +HUN,2013,5.7,3.03,7.88,3.03 +HUN,2014,5.6,3.14,8.31,3.14 +HUN,2015,5.5,3.56,8.43,3.56 +HUN,2016,5.5,3.97,8.86,3.97 +HUN,2017,5.5,4.7,9.19,4.7 +HUN,2018,5.4,4.91,9.41,4.91 +IRL,2006,6.3,7.95,12.63,7.95 +IRL,2007,6.1,8.41,14.09,8.41 +IRL,2008,6.2,8.91,14.26,8.91 +IRL,2009,6.1,11.69,14.99,11.69 +IRL,2010,6.0,12.28,15.35,12.28 +IRL,2011,5.9,13.1,15.72,13.1 +IRL,2012,5.9,12.39,16.74,12.39 +IRL,2013,5.7,13.19,17.73,13.19 +IRL,2014,5.6,13.31,16.53,13.31 +IRL,2015,5.8,14.04,17.65,14.04 +IRL,2016,5.8,14.72,17.24,14.72 +IRL,2017,5.9,15.18,19.14,15.18 +IRL,2018,5.9,16.03,20.34,16.03 +ITA,1997,7.3,4.11,14.8,4.11 +ITA,1998,7.2,5.82,18.01,5.82 +ITA,1999,7.0,6.25,18.99,6.25 +ITA,2000,7.0,7.76,21.13,7.76 +ITA,2001,7.0,9.07,23.01,9.07 +ITA,2002,6.7,10.85,24.05,10.85 +ITA,2003,6.7,11.9,23.92,11.9 +ITA,2004,6.7,14.09,26.23,14.09 +ITA,2005,6.7,15.01,27.82,15.01 +ITA,2006,6.7,16.96,29.29,16.96 +ITA,2007,6.7,18.77,30.55,18.77 +ITA,2008,6.8,20.06,30.96,20.06 +ITA,2009,6.7,21.59,31.85,21.59 +ITA,2010,6.7,22.47,32.17,22.47 +ITA,2011,6.8,24.17,32.62,24.17 +ITA,2012,6.8,24.62,33.29,24.62 +ITA,2013,6.8,25.2,33.1,25.2 +ITA,2014,6.8,26.19,32.9,26.19 +ITA,2015,6.9,28.24,33.31,28.24 +ITA,2016,6.9,28.4,34.29,28.4 +ITA,2017,6.9,28.66,34.57,28.66 +ITA,2018,7.0,28.73,35.12,28.73 +JPN,1996,32.7,18.75,74.7,18.75 +JPN,1999,27.2,23.19,84.41,23.19 +JPN,2002,22.2,35.32,92.62,35.32 +JPN,2008,18.8,42.96,96.97,42.96 +JPN,2011,17.9,46.86,101.25,46.86 +JPN,2014,16.9,51.69,107.17,51.69 +JPN,2017,16.2,55.21,111.49,55.21 +KOR,1993,11.0,1.81,12.22,1.81 +KOR,1994,11.0,2.87,13.69,2.87 +KOR,1995,11.0,3.86,15.5,3.86 +KOR,1996,11.0,4.7,20.12,4.7 +KOR,1997,11.0,5.14,21.02,5.14 +KOR,2000,11.0,5.4,28.38,5.4 +KOR,2001,11.0,6.8,27.3,6.8 +KOR,2002,11.0,7.85,30.94,7.85 +KOR,2003,10.6,8.98,31.86,8.98 +KOR,2010,10.0,19.88,35.17,19.88 +KOR,2011,10.1,21.27,35.79,21.27 +KOR,2012,9.2,23.37,36.93,23.37 +KOR,2013,8.9,24.35,37.5,24.35 +KOR,2014,8.0,25.5,36.85,25.5 +KOR,2015,7.9,26.27,37.03,26.27 +KOR,2016,7.6,27.81,37.8,27.81 +KOR,2017,7.6,29.08,38.18,29.08 +KOR,2018,7.5,30.08,38.56,30.08 +LUX,2002,7.5,4.48,24.65,4.48 +LUX,2003,7.4,11.07,26.57,11.07 +LUX,2004,7.2,10.91,28.38,10.91 +LUX,2005,7.2,10.75,27.95,10.75 +LUX,2006,7.4,10.58,27.51,10.58 +LUX,2007,7.5,10.42,27.08,10.42 +LUX,2008,7.3,12.28,26.6,12.28 +LUX,2009,7.5,14.06,26.12,14.06 +LUX,2010,7.6,13.81,25.64,13.81 +LUX,2011,7.3,13.5,25.08,13.5 +LUX,2012,7.4,13.18,24.48,13.18 +LUX,2013,7.3,12.88,22.08,12.88 +LUX,2014,7.3,12.58,21.57,12.58 +LUX,2015,7.4,12.29,17.56,12.29 +LUX,2016,7.4,12.0,17.14,12.0 +LUX,2017,7.4,11.74,16.77,11.74 +LUX,2018,7.6,11.51,16.45,11.51 +NLD,1990,11.2,0.87,7.29,0.87 +NLD,1992,10.6,1.78,7.24,1.78 +NLD,1993,10.4,2.49,9.03,2.49 +NLD,2004,7.5,6.2,7.12,6.2 +NLD,2005,7.2,6.56,8.21,6.56 +NLD,2006,6.6,7.83,8.38,7.83 +NLD,2007,6.2,7.63,7.81,7.63 +NLD,2008,6.0,10.4,10.22,10.4 +NLD,2009,5.6,10.95,11.25,10.95 +NLD,2010,5.6,12.22,12.34,12.22 +NLD,2011,6.5,12.88,12.52,12.88 +NLD,2012,6.4,11.82,10.92,11.82 +NLD,2013,6.7,11.49,11.54,11.49 +NLD,2014,6.7,12.87,13.34,12.87 +NLD,2015,5.0,12.51,13.75,12.51 +NLD,2016,5.0,12.8,13.04,12.8 +NLD,2017,5.0,13.02,13.48,13.02 +NLD,2018,5.1,13.06,14.22,13.06 +NZL,2003,5.2,3.72,11.42,3.72 +NZL,2007,5.9,8.76,12.31,8.76 +NZL,2008,6.2,9.62,12.44,9.62 +NZL,2009,6.1,9.76,14.64,9.76 +NZL,2010,6.1,10.57,15.63,10.57 +NZL,2011,6.2,11.18,15.51,11.18 +NZL,2012,6.0,11.12,15.43,11.12 +NZL,2013,5.6,11.26,16.66,11.26 +NZL,2015,5.4,13.3,17.88,13.3 +NZL,2016,5.1,13.89,17.96,13.89 +NZL,2017,5.0,13.64,16.79,13.64 +POL,2005,7.9,2.02,7.94,2.02 +POL,2006,7.6,1.94,9.23,1.94 +POL,2007,7.4,2.7,9.65,2.7 +POL,2008,7.5,2.94,10.86,2.94 +POL,2009,7.4,3.7,12.4,3.7 +POL,2010,7.3,4.71,14.38,4.71 +POL,2011,7.1,4.83,13.61,4.83 +POL,2012,6.8,5.49,15.4,5.49 +POL,2013,6.7,6.78,17.09,6.78 +POL,2014,6.6,6.6,15.63,6.6 +POL,2015,6.9,7.63,17.16,7.63 +POL,2016,6.7,7.87,17.33,7.87 +POL,2017,6.6,7.93,16.88,7.93 +POL,2018,6.5,9.22,18.14,9.22 +PRT,2006,8.6,5.8,25.94,5.8 +PRT,2007,8.4,8.92,26.18,8.92 +PRT,2008,8.3,9.28,27.56,9.28 +SVK,2003,7.4,2.05,9.12,2.05 +SVK,2004,7.3,3.72,10.24,3.72 +SVK,2005,7.3,4.28,11.35,4.28 +SVK,2006,7.2,4.47,12.28,4.47 +SVK,2007,7.0,5.77,13.77,5.77 +SVK,2008,6.9,6.13,13.76,6.13 +SVK,2009,6.7,6.13,13.37,6.13 +SVK,2010,6.6,6.86,14.1,6.86 +SVK,2011,6.3,7.04,15.0,7.04 +SVK,2012,6.2,6.29,15.53,6.29 +SVK,2013,6.2,6.65,15.33,6.65 +SVK,2014,7.0,8.3,17.35,8.3 +SVK,2015,6.9,8.85,17.88,8.85 +SVK,2016,6.8,9.02,17.31,9.02 +SVK,2017,6.8,9.56,17.28,9.56 +SVK,2018,6.7,9.55,18.36,9.55 +ESP,2010,6.4,11.98,15.95,11.98 +ESP,2011,6.2,13.76,16.64,13.76 +ESP,2012,6.1,14.77,17.19,14.77 +ESP,2013,6.1,15.34,17.59,15.34 +ESP,2014,6.0,15.51,17.6,15.51 +ESP,2015,6.1,15.85,18.02,15.85 +ESP,2016,6.0,16.09,18.31,16.09 +ESP,2017,6.0,16.38,18.65,16.38 +ESP,2018,6.0,17.2,19.12,17.2 +TUR,2002,5.8,0.88,4.89,0.88 +TUR,2003,5.7,1.48,5.63,1.48 +TUR,2004,5.6,2.2,6.6,2.2 +TUR,2005,5.3,2.91,7.44,2.91 +TUR,2006,5.1,4.47,8.56,4.47 +TUR,2007,4.4,5.84,9.62,5.84 +TUR,2008,4.1,7.91,10.68,7.91 +TUR,2009,4.1,8.68,11.63,8.68 +TUR,2010,4.0,9.27,12.36,9.27 +TUR,2011,3.9,9.55,13.12,9.55 +TUR,2012,3.9,9.58,13.53,9.58 +TUR,2013,3.9,9.86,13.89,9.86 +TUR,2014,4.0,9.81,13.88,9.81 +TUR,2015,3.9,10.15,14.31,10.15 +TUR,2016,4.0,10.55,14.53,10.55 +TUR,2017,4.1,11.01,14.77,11.01 +TUR,2018,4.1,11.24,14.88,11.24 +GBR,2001,7.7,6.21,6.88,6.21 +GBR,2002,7.5,4.99,7.29,4.99 +GBR,2003,7.3,4.54,6.91,4.54 +GBR,2004,7.1,5.0,7.02,5.0 +GBR,2005,6.9,5.4,7.45,5.4 +GBR,2006,6.6,5.62,7.53,5.62 +GBR,2008,6.3,5.5,7.26,5.5 +GBR,2010,6.1,6.55,7.92,6.55 +GBR,2011,6.0,6.96,8.48,6.96 +GBR,2012,6.0,7.16,9.09,7.16 +GBR,2013,6.0,7.2,9.3,7.2 +GBR,2014,6.0,7.23,9.46,7.23 +USA,1997,6.1,11.41,24.1,11.41 +USA,1999,5.9,13.19,25.09,13.19 +USA,2001,5.8,17.44,28.88,17.44 +USA,2003,5.7,19.32,29.26,19.32 +USA,2004,5.6,26.67,32.29,26.67 +USA,2006,5.6,26.58,34.02,26.58 +USA,2007,5.5,25.93,34.31,25.93 +USA,2012,5.4,34.46,43.89,34.46 +USA,2013,5.4,35.51,43.5,35.51 +USA,2014,5.5,38.12,41.05,38.12 +USA,2015,5.5,39.03,41.01,39.03 +USA,2016,5.5,36.74,41.88,36.74 +USA,2017,5.5,37.65,42.74,37.65 +EST,2005,6.0,2.21,7.38,2.21 +EST,2006,5.9,3.71,7.42,3.71 +EST,2007,5.9,5.22,11.19,5.22 +EST,2008,5.7,8.23,14.96,8.23 +EST,2009,5.6,7.49,14.99,7.49 +EST,2010,5.5,8.26,15.77,8.26 +EST,2011,5.5,9.79,16.57,9.79 +EST,2012,5.6,9.83,17.39,9.83 +EST,2013,6.0,11.38,18.97,11.38 +EST,2014,5.9,11.41,19.78,11.41 +EST,2015,6.0,12.16,16.72,12.16 +EST,2016,6.1,13.68,17.48,13.68 +EST,2017,6.1,13.66,18.22,13.66 +EST,2018,6.1,13.62,18.91,13.62 +ISR,2000,7.1,1.43,5.57,1.43 +ISR,2001,6.2,1.4,6.37,1.4 +ISR,2002,5.9,1.37,6.24,1.37 +ISR,2003,5.8,1.64,5.83,1.64 +ISR,2004,6.0,1.62,6.32,1.62 +ISR,2005,5.7,1.73,6.49,1.73 +ISR,2006,5.5,1.84,6.38,1.84 +ISR,2007,5.2,2.23,7.94,2.23 +ISR,2008,5.1,2.33,8.21,2.33 +ISR,2009,5.1,2.27,8.68,2.27 +ISR,2010,5.2,2.23,8.79,2.23 +ISR,2011,5.2,2.7,8.76,2.7 +ISR,2012,5.1,3.29,8.98,3.29 +ISR,2013,5.2,3.47,8.93,3.47 +ISR,2014,5.1,4.02,9.49,4.02 +ISR,2015,5.2,4.06,9.67,4.06 +ISR,2016,5.2,4.91,9.6,4.91 +ISR,2017,5.1,5.16,9.53,5.16 +ISR,2018,5.0,5.18,9.57,5.18 +RUS,1993,13.6,0.92,1.58,0.92 +RUS,1994,13.6,0.77,1.48,0.77 +RUS,1995,13.6,0.61,1.82,0.61 +RUS,1996,13.6,0.7,2.1,0.7 +RUS,1997,14.3,0.85,2.21,0.85 +RUS,1998,14.0,0.74,2.32,0.74 +RUS,1999,13.7,0.88,2.39,0.88 +RUS,2000,13.5,1.13,2.58,1.13 +RUS,2001,13.2,1.11,2.66,1.11 +RUS,2002,12.9,1.31,2.77,1.31 +RUS,2004,12.2,1.36,3.32,1.36 +RUS,2005,11.9,1.54,3.77,1.54 +RUS,2006,11.5,2.12,4.04,2.12 +RUS,2007,11.4,2.02,4.42,2.02 +RUS,2008,11.3,2.27,5.02,2.27 +RUS,2009,11.0,2.52,6.02,2.52 +RUS,2010,10.8,2.51,6.9,2.51 +RUS,2011,11.3,2.62,7.72,2.62 +RUS,2012,10.8,4.17,9.09,4.17 +RUS,2013,10.3,3.99,11.28,3.99 +RUS,2014,9.9,4.44,12.2,4.44 +RUS,2015,9.7,4.64,12.56,4.64 +RUS,2016,9.4,4.52,12.76,4.52 +RUS,2017,9.3,4.6,13.0,4.6 +RUS,2018,9.1,4.84,13.37,4.84 +SVN,2006,5.8,6.48,10.46,6.48 +SVN,2008,5.7,6.43,12.37,6.43 +SVN,2009,5.6,7.35,11.77,7.35 +SVN,2010,5.5,7.81,12.69,7.81 +SVN,2011,6.8,8.77,12.67,8.77 +SVN,2012,6.9,8.75,12.64,8.75 +SVN,2013,6.6,9.22,12.14,9.22 +SVN,2014,6.6,9.21,13.09,9.21 +SVN,2015,6.5,9.21,13.08,9.21 +SVN,2016,6.5,11.14,14.04,11.14 +SVN,2017,6.6,11.61,15.0,11.61 +SVN,2018,6.7,12.05,15.91,12.05 +ISL,2007,5.6,19.26,32.1,19.26 +ISL,2008,5.5,18.9,31.5,18.9 +ISL,2009,5.5,21.98,34.54,21.98 +ISL,2010,5.4,22.01,37.73,22.01 +ISL,2011,5.3,21.94,40.75,21.94 +ISL,2012,5.5,21.83,40.53,21.83 +ISL,2013,5.6,21.62,40.15,21.62 +ISL,2014,5.8,21.38,39.71,21.38 +ISL,2015,5.9,21.16,39.3,21.16 +ISL,2016,5.9,20.87,38.76,20.87 +ISL,2017,5.7,20.38,43.68,20.38 +ISL,2018,5.6,19.85,48.2,19.85 +LVA,2003,7.9,1.31,13.55,1.31 +LVA,2004,7.8,0.88,15.02,0.88 +LVA,2005,7.4,2.68,18.31,2.68 +LVA,2006,7.2,2.7,18.48,2.7 +LVA,2007,7.1,5.0,21.81,5.0 +LVA,2008,7.1,6.89,23.88,6.89 +LVA,2009,6.1,7.47,25.68,7.47 +LVA,2010,6.2,8.1,29.08,8.1 +LVA,2011,6.0,9.22,31.07,9.22 +LVA,2012,5.8,9.83,32.44,9.83 +LVA,2013,5.8,10.43,34.78,10.43 +LVA,2014,5.9,12.54,36.11,12.54 +LVA,2015,6.0,12.64,36.91,12.64 +LVA,2016,5.9,13.78,36.23,13.78 +LVA,2017,6.0,13.9,39.13,13.9 +LVA,2018,6.0,13.49,38.4,13.49 +LTU,2000,9.2,0.29,6.57,0.29 +LTU,2001,9.0,0.86,7.2,0.86 +LTU,2002,8.7,0.87,8.71,0.87 +LTU,2003,8.3,0.88,9.08,0.88 +LTU,2004,8.2,1.18,11.55,1.18 +LTU,2005,8.1,1.5,12.04,1.5 +LTU,2006,8.0,3.06,12.84,3.06 +LTU,2007,7.7,3.4,10.52,3.4 +LTU,2008,7.5,4.38,13.76,4.38 +LTU,2009,7.2,5.37,16.12,5.37 +LTU,2010,7.1,4.84,18.73,4.84 +LTU,2011,7.0,5.94,20.14,5.94 +LTU,2012,6.9,10.04,23.76,10.04 +LTU,2013,6.9,10.48,23.67,10.48 +LTU,2014,6.8,10.57,22.17,10.57 +LTU,2015,6.6,11.02,21.0,11.02 +LTU,2016,6.6,12.2,23.01,12.2 +LTU,2017,6.5,12.37,23.33,12.37 +LTU,2018,6.5,12.49,24.27,12.49 From 2d5f77a0599a65d50a7e5a2b9a5292084a792c19 Mon Sep 17 00:00:00 2001 From: Daniel Schwartz Date: Sat, 11 Nov 2023 14:43:35 -0500 Subject: [PATCH 03/10] Added python code exercise to module --- .../python_linear_regression.md | 91 ++++++++++++++----- 1 file changed, 66 insertions(+), 25 deletions(-) diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md index 8de89d902..de946ad20 100644 --- a/python_linear_regression/python_linear_regression.md +++ b/python_linear_regression/python_linear_regression.md @@ -58,7 +58,8 @@ Previous versions: @end import: https://raw.githubusercontent.com/arcus/education_modules/main/_module_templates/macros.md -import: https://raw.githubusercontent.com/arcus/education_modules/main/_module_templates/macros_python.md +import: https://raw.githubusercontent.com/arcus/education_modules/pyodide_testing/_module_templates/macros_python.md +import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md --> # Python Lesson on Regression for Machine Learning @@ -144,41 +145,89 @@ The variance of the target variable is not a component of the linear regression To implement linear regression in Python using Scikit-learn, we can follow these steps: + + 1. Import the necessary libraries: -``` +```python import numpy as np +import pandas as pd + +from sklearn.model_selection import train_test_split +from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LinearRegression ``` +@Pyodide.eval + 2. Load the data: -``` -# Load the data as a NumPy array -data = np.loadtxt("data.csv", delimiter=",") +```python @Pyodide.exec -# Split the data into features and target variable -X = data[:, :-1] -y = data[:, -1] +import pandas as pd +import io +from pyodide.http import open_url + +url = "https://raw.githubusercontent.com/arcus/education_modules/linear_regression/python_linear_regression/data/healthcare_investments_and_hospital_stay.csv" + +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) + +data = pd.read_csv(file) + +# Analyze data and features +data.info() ``` 3. Split the data into training and testing sets: -``` -from sklearn.model_selection import train_test_split +```python + +# Encode categorical data into numbers +def onehot_encode(df, column): + df = df.copy() + dummies = pd.get_dummies(df[column]) + df = pd.concat([df, dummies], axis=1) + df = df.drop(column, axis=1) + return df + +def preprocess_inputs(df): + df = df.copy() + + # One-hot encode Location column + df = onehot_encode(df, column='Location') + + # Split df into X and y + y = df['Hospital_Stay'].copy() + X = df.drop('Hospital_Stay', axis=1).copy() + + # Train-test split + X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123) + + # Scale X with a standard scaler + scaler = StandardScaler() + scaler.fit(X_train) + + X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns) + X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns) + + return X_train, X_test, y_train, y_test + +X_train, X_test, y_train, y_test = preprocess_inputs(data) -# Split the data into 80% training and 20% testing sets -X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) ``` +@Pyodide.eval 4. Train the linear regression model: -``` +```python # Create a linear regression model model = LinearRegression() # Fit the model to the training data model.fit(X_train, y_train) ``` +@Pyodide.eval 5. Evaluate the model on the testing set: -``` +```python # Make predictions on the testing set y_pred = model.predict(X_test) @@ -187,20 +236,12 @@ mse = np.mean((y_pred - y_test)**2) # Print the MSE print("MSE:", mse) -``` -6. Make predictions on new data: +# Evaluate R^2 Score +print(" R^2 Score: {:.5f}".format(model.score(X_test, y_test))) ``` -# New data point -new_data = np.array([[1000, 3, 2]]) +@Pyodide.eval -# Make a prediction on the new data point -y_pred = model.predict(new_data) - -# Print the prediction -print("Prediction:", y_pred[0]) - -``` This is a basic example of how to implement linear regression in Python using Scikit-learn. There are many other ways to implement linear regression in Python, but this is a good starting point. From 5954c8c00f70cbbfc4c8ff36e7817d09ad1ddb3d Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 21 Mar 2024 09:36:21 -0400 Subject: [PATCH 04/10] Added real world example for python linear regression --- python_linear_regression/data/strep_tb.csv | 108 +++++++++++++++++++++ 1 file changed, 108 insertions(+) create mode 100644 python_linear_regression/data/strep_tb.csv diff --git a/python_linear_regression/data/strep_tb.csv b/python_linear_regression/data/strep_tb.csv new file mode 100644 index 000000000..143d7b913 --- /dev/null +++ b/python_linear_regression/data/strep_tb.csv @@ -0,0 +1,108 @@ +"","patient_id","arm","dose_strep_g","dose_PAS_g","gender","baseline_condition","baseline_temp","baseline_esr","baseline_cavitation","strep_resistance","radiologic_6m","rad_num","improved" +"1",1,"Control",0,0,"M","1_Good","1_<=98.9F/37.2C","2_11-20","yes","1_sens_0-8","6_Considerable_improvement",6,TRUE +"2",2,"Control",0,0,"F","1_Good","3_100-100.9F/37.8-38.2C","2_11-20","no","1_sens_0-8","5_Moderate_improvement",5,TRUE +"3",3,"Control",0,0,"F","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE +"4",4,"Control",0,0,"M","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE +"5",5,"Control",0,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE +"6",6,"Control",0,0,"M","1_Good","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"7",7,"Control",0,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"8",8,"Control",0,0,"M","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"9",9,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"10",10,"Control",0,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"11",11,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"12",12,"Control",0,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"13",13,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"14",14,"Control",0,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"15",15,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"16",16,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"17",17,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE +"18",18,"Control",0,0,"M","2_Fair","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","4_No_change",4,FALSE +"19",19,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"20",20,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"21",21,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"22",22,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"23",23,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","4_No_change",4,FALSE +"24",24,"Control",0,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","4_51+","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"25",25,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","4_51+","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"26",26,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"27",27,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"28",28,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"29",29,"Control",0,0,"F","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","4_No_change",4,FALSE +"30",30,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"31",31,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"32",32,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE +"33",33,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE +"34",34,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE +"35",35,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE +"36",36,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE +"37",37,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","1_Death",1,FALSE +"38",38,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","2_Considerable_deterioration",2,FALSE +"39",39,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","1_Death",1,FALSE +"40",40,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","2_Considerable_deterioration",2,FALSE +"41",41,"Control",0,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","no","1_sens_0-8","1_Death",1,FALSE +"42",42,"Control",0,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"43",43,"Control",0,0,"F","3_Poor","3_100-100.9F/37.8-38.2C",NA,"yes","1_sens_0-8","1_Death",1,FALSE +"44",44,"Control",0,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"45",45,"Control",0,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"46",46,"Control",0,0,"F","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"47",47,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"48",48,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"49",49,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"50",50,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"51",51,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"52",52,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE +"53",53,"Streptomycin",2,0,"M","1_Good","4_>=101F/38.3C","2_11-20","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"54",55,"Streptomycin",2,0,"F","1_Good","1_<=98.9F/37.2C","2_11-20","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"55",56,"Streptomycin",2,0,"M","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"56",57,"Streptomycin",2,0,"F","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"57",58,"Streptomycin",2,0,"M","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"58",59,"Streptomycin",2,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"59",60,"Streptomycin",2,0,"M","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"60",67,"Streptomycin",2,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"61",74,"Streptomycin",2,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE +"62",54,"Streptomycin",2,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","2_11-20","no","2_mod_8-99","6_Considerable_improvement",6,TRUE +"63",61,"Streptomycin",2,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","2_mod_8-99","6_Considerable_improvement",6,TRUE +"64",68,"Streptomycin",2,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","no","2_mod_8-99","6_Considerable_improvement",6,TRUE +"65",75,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","no","2_mod_8-99","5_Moderate_improvement",5,TRUE +"66",62,"Streptomycin",2,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","3_resist_100+","2_Considerable_deterioration",2,FALSE +"67",63,"Streptomycin",2,0,"F","2_Fair","4_>=101F/38.3C","3_21-50","no","3_resist_100+","2_Considerable_deterioration",2,FALSE +"68",64,"Streptomycin",2,0,"M","2_Fair","2_99-99.9F/37.3-37.7C/37.3-37.7C","3_21-50","no","3_resist_100+","6_Considerable_improvement",6,TRUE +"69",65,"Streptomycin",2,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","3_resist_100+","5_Moderate_improvement",5,TRUE +"70",66,"Streptomycin",2,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","3_resist_100+","6_Considerable_improvement",6,TRUE +"71",69,"Streptomycin",2,0,"F","2_Fair","4_>=101F/38.3C","3_21-50","no","3_resist_100+","3_Moderate_deterioration",3,FALSE +"72",70,"Streptomycin",2,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","no","3_resist_100+","6_Considerable_improvement",6,TRUE +"73",71,"Streptomycin",2,0,"F","2_Fair","4_>=101F/38.3C","4_51+","no","3_resist_100+","6_Considerable_improvement",6,TRUE +"74",72,"Streptomycin",2,0,"M","2_Fair","4_>=101F/38.3C","4_51+","no","3_resist_100+","5_Moderate_improvement",5,TRUE +"75",73,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","no","3_resist_100+","6_Considerable_improvement",6,TRUE +"76",81,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","6_Considerable_improvement",6,TRUE +"77",88,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE +"78",82,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","2_mod_8-99","6_Considerable_improvement",6,TRUE +"79",89,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","2_mod_8-99","5_Moderate_improvement",5,TRUE +"80",76,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE +"81",77,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE +"82",78,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE +"83",79,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE +"84",80,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE +"85",83,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE +"86",84,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE +"87",85,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE +"88",86,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE +"89",87,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","4_No_change",4,FALSE +"90",90,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE +"91",91,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE +"92",95,"Streptomycin",2,0,"F","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","2_Considerable_deterioration",2,FALSE +"93",102,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","6_Considerable_improvement",6,TRUE +"94",96,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","2_mod_8-99","6_Considerable_improvement",6,TRUE +"95",103,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","2_mod_8-99","5_Moderate_improvement",5,TRUE +"96",92,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE +"97",93,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE +"98",94,"Streptomycin",2,0,"M","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","3_resist_100+","2_Considerable_deterioration",2,FALSE +"99",97,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE +"100",98,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE +"101",99,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE +"102",100,"Streptomycin",2,0,"M","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","3_resist_100+","4_No_change",4,FALSE +"103",101,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","2_Considerable_deterioration",2,FALSE +"104",104,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE +"105",105,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","2_Considerable_deterioration",2,FALSE +"106",106,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE +"107",107,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE From 0fa3e71092225cdce84a5b1125bc09495f47a28d Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 21 Mar 2024 09:41:34 -0400 Subject: [PATCH 05/10] Added real-world example to linear regression --- .../python_linear_regression.md | 90 +++++++++++++++++++ 1 file changed, 90 insertions(+) diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md index de946ad20..7730be398 100644 --- a/python_linear_regression/python_linear_regression.md +++ b/python_linear_regression/python_linear_regression.md @@ -299,6 +299,96 @@ This question is designed to test the test-taker's understanding of the concept *** + + + +### Real World Code Example + +The Streptomycin for Tuberculosis dataset originates from a groundbreaking clinical trial published in 1948, often recognized as the first modern randomized clinical trial. It comprises data from a prospective, randomized, placebo-controlled study investigating the efficacy of streptomycin treatment for pulmonary tuberculosis. The dataset includes variables such as participant ID, study arm (Streptomycin or Control), doses of Streptomycin and Para-Amino-Salicylate in grams, gender, baseline conditions (categorized as good, fair, or poor), oral temperature at baseline, erythrocyte sedimentation rate at baseline, presence of lung cavitation on chest X-ray at baseline, streptomycin resistance at 6 months, radiologic outcomes at 6 months, numeric rating of chest X-ray at month 6, and a dichotomous outcome indicating improvement. These variables provide comprehensive information for analyzing the effectiveness of streptomycin treatment for tuberculosis, allowing for various statistical analyses such as logistic regression modeling. + + + +1. Install Packages: +```python @Pyodide.exec + +import pandas as pd +import io +from pyodide.http import open_url +from sklearn.preprocessing import LabelEncoder +from sklearn.linear_model import LinearRegression +from sklearn.metrics import r2_score, mean_squared_error + + + + + +``` + +2. Load the data: +```python +# Load dataset and read to pandas dataframe +url = "https://raw.githubusercontent.com/arcus/education_modules/linear_regression/python_linear_regression/data/strep_tb.csv" + +url_contents = open_url(url) +text = url_contents.read() +file = io.StringIO(text) +df = pd.read_csv(file) + +# Analyze data and features +df.info() + +# Encode Categorical Features +categorical_cols = [ + 'arm', 'gender', 'baseline_condition', 'baseline_temp', + 'baseline_esr', 'baseline_cavitation', 'strep_resistance', 'radiologic_6m' +] + # Create a LabelEncoder for transforming columns +le = LabelEncoder() + +# Apply label encoding to each categorical column +for col in categorical_cols: + df[col] = le.fit_transform(df[col]) +``` +@Pyodide.eval + + +3. Compute Regression: +```python +# Feature Selection and Target Definition +features = [ + 'patient_id', 'arm', 'dose_strep_g', 'dose_PAS_g', 'gender', + 'baseline_condition', 'baseline_temp', 'baseline_esr', + 'baseline_cavitation', 'strep_resistance', 'radiologic_6m', 'rad_num' +] +target = 'improved' + +# Separate inputs (features) and output (target variable) +inputs = df[features] +output = df[target] + +# SECTION 4: Linear Regression Modeling +model = LinearRegression() # Create the regression model +model.fit(inputs, output) # Train the model +``` +@Pyodide.eval + + +4. Evaluate Model: +```python +# Predict data +predictions = model.predict(inputs) + +# Analyze predictions +print('R-squared:', r2_score(output, predictions)) +print('Mean Squared Error:', mean_squared_error(output, predictions)) +``` +@Pyodide.eval + +The results obtained from the Streptomycin for Tuberculosis dataset reveal promising findings regarding the efficacy of streptomycin treatment for pulmonary tuberculosis. Originating from a groundbreaking clinical trial in 1948, widely acknowledged as the first modern randomized clinical trial, this dataset offers a rich array of variables encompassing participant demographics, treatment dosages, baseline conditions, and clinical outcomes. The R-squared value of 0.834306790075451 indicates that the linear regression model explains approximately 83.4% of the variance in the outcome variable, showcasing a strong fit of the model to the data. Additionally, the low Mean Squared Error of 0.04139073983616123 suggests that the model's predictions are relatively accurate. However, it's essential to recognize that linear regression serves as an initial step in analyzing this dataset. While these results provide valuable insights, further analyses employing advanced statistical techniques such as logistic regression modeling are warranted to fully comprehend the effectiveness of streptomycin treatment for tuberculosis management. + + + + ## Conclusion At the end of the lesson, students should have a good understanding of the concept of linear regression and how to implement the linear regression algorithm in Python. They should also be able to apply linear regression to real-world datasets to make predictions and insights. From 013095896abe5903f8e50cff9bb77752e9379598 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 21 Mar 2024 11:15:48 -0400 Subject: [PATCH 06/10] Changed real world data to continuous diabetes example --- python_linear_regression/data/strep_tb.csv | 108 ------------------ .../python_linear_regression.md | 67 ++++------- 2 files changed, 21 insertions(+), 154 deletions(-) delete mode 100644 python_linear_regression/data/strep_tb.csv diff --git a/python_linear_regression/data/strep_tb.csv b/python_linear_regression/data/strep_tb.csv deleted file mode 100644 index 143d7b913..000000000 --- a/python_linear_regression/data/strep_tb.csv +++ /dev/null @@ -1,108 +0,0 @@ -"","patient_id","arm","dose_strep_g","dose_PAS_g","gender","baseline_condition","baseline_temp","baseline_esr","baseline_cavitation","strep_resistance","radiologic_6m","rad_num","improved" -"1",1,"Control",0,0,"M","1_Good","1_<=98.9F/37.2C","2_11-20","yes","1_sens_0-8","6_Considerable_improvement",6,TRUE -"2",2,"Control",0,0,"F","1_Good","3_100-100.9F/37.8-38.2C","2_11-20","no","1_sens_0-8","5_Moderate_improvement",5,TRUE -"3",3,"Control",0,0,"F","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE -"4",4,"Control",0,0,"M","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE -"5",5,"Control",0,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE -"6",6,"Control",0,0,"M","1_Good","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"7",7,"Control",0,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"8",8,"Control",0,0,"M","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"9",9,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"10",10,"Control",0,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"11",11,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"12",12,"Control",0,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"13",13,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"14",14,"Control",0,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"15",15,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"16",16,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"17",17,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","5_Moderate_improvement",5,TRUE -"18",18,"Control",0,0,"M","2_Fair","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","4_No_change",4,FALSE -"19",19,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"20",20,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"21",21,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"22",22,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"23",23,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","4_No_change",4,FALSE -"24",24,"Control",0,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","4_51+","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"25",25,"Control",0,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","4_51+","no","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"26",26,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"27",27,"Control",0,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"28",28,"Control",0,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"29",29,"Control",0,0,"F","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","4_No_change",4,FALSE -"30",30,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"31",31,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"32",32,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","3_Moderate_deterioration",3,FALSE -"33",33,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE -"34",34,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE -"35",35,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE -"36",36,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","2_Considerable_deterioration",2,FALSE -"37",37,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","1_Death",1,FALSE -"38",38,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","2_Considerable_deterioration",2,FALSE -"39",39,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","no","1_sens_0-8","1_Death",1,FALSE -"40",40,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","2_Considerable_deterioration",2,FALSE -"41",41,"Control",0,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","no","1_sens_0-8","1_Death",1,FALSE -"42",42,"Control",0,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"43",43,"Control",0,0,"F","3_Poor","3_100-100.9F/37.8-38.2C",NA,"yes","1_sens_0-8","1_Death",1,FALSE -"44",44,"Control",0,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"45",45,"Control",0,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"46",46,"Control",0,0,"F","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"47",47,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"48",48,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"49",49,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"50",50,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"51",51,"Control",0,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"52",52,"Control",0,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","1_Death",1,FALSE -"53",53,"Streptomycin",2,0,"M","1_Good","4_>=101F/38.3C","2_11-20","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"54",55,"Streptomycin",2,0,"F","1_Good","1_<=98.9F/37.2C","2_11-20","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"55",56,"Streptomycin",2,0,"M","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"56",57,"Streptomycin",2,0,"F","1_Good","1_<=98.9F/37.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"57",58,"Streptomycin",2,0,"M","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"58",59,"Streptomycin",2,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"59",60,"Streptomycin",2,0,"M","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"60",67,"Streptomycin",2,0,"F","1_Good","2_99-99.9F/37.3-37.7C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"61",74,"Streptomycin",2,0,"M","2_Fair","3_100-100.9F/37.8-38.2C","3_21-50","no","1_sens_0-8","6_Considerable_improvement",6,TRUE -"62",54,"Streptomycin",2,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","2_11-20","no","2_mod_8-99","6_Considerable_improvement",6,TRUE -"63",61,"Streptomycin",2,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","2_mod_8-99","6_Considerable_improvement",6,TRUE -"64",68,"Streptomycin",2,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","no","2_mod_8-99","6_Considerable_improvement",6,TRUE -"65",75,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","no","2_mod_8-99","5_Moderate_improvement",5,TRUE -"66",62,"Streptomycin",2,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","3_resist_100+","2_Considerable_deterioration",2,FALSE -"67",63,"Streptomycin",2,0,"F","2_Fair","4_>=101F/38.3C","3_21-50","no","3_resist_100+","2_Considerable_deterioration",2,FALSE -"68",64,"Streptomycin",2,0,"M","2_Fair","2_99-99.9F/37.3-37.7C/37.3-37.7C","3_21-50","no","3_resist_100+","6_Considerable_improvement",6,TRUE -"69",65,"Streptomycin",2,0,"F","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","3_resist_100+","5_Moderate_improvement",5,TRUE -"70",66,"Streptomycin",2,0,"M","2_Fair","2_99-99.9F/37.3-37.7C","3_21-50","no","3_resist_100+","6_Considerable_improvement",6,TRUE -"71",69,"Streptomycin",2,0,"F","2_Fair","4_>=101F/38.3C","3_21-50","no","3_resist_100+","3_Moderate_deterioration",3,FALSE -"72",70,"Streptomycin",2,0,"M","2_Fair","4_>=101F/38.3C","3_21-50","no","3_resist_100+","6_Considerable_improvement",6,TRUE -"73",71,"Streptomycin",2,0,"F","2_Fair","4_>=101F/38.3C","4_51+","no","3_resist_100+","6_Considerable_improvement",6,TRUE -"74",72,"Streptomycin",2,0,"M","2_Fair","4_>=101F/38.3C","4_51+","no","3_resist_100+","5_Moderate_improvement",5,TRUE -"75",73,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","no","3_resist_100+","6_Considerable_improvement",6,TRUE -"76",81,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","6_Considerable_improvement",6,TRUE -"77",88,"Streptomycin",2,0,"F","2_Fair","3_100-100.9F/37.8-38.2C","4_51+","yes","1_sens_0-8","5_Moderate_improvement",5,TRUE -"78",82,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","2_mod_8-99","6_Considerable_improvement",6,TRUE -"79",89,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","2_mod_8-99","5_Moderate_improvement",5,TRUE -"80",76,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE -"81",77,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE -"82",78,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE -"83",79,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE -"84",80,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE -"85",83,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE -"86",84,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE -"87",85,"Streptomycin",2,0,"M","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE -"88",86,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE -"89",87,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","4_No_change",4,FALSE -"90",90,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE -"91",91,"Streptomycin",2,0,"F","3_Poor","3_100-100.9F/37.8-38.2C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE -"92",95,"Streptomycin",2,0,"F","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","1_sens_0-8","2_Considerable_deterioration",2,FALSE -"93",102,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","1_sens_0-8","6_Considerable_improvement",6,TRUE -"94",96,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","2_mod_8-99","6_Considerable_improvement",6,TRUE -"95",103,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","2_mod_8-99","5_Moderate_improvement",5,TRUE -"96",92,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE -"97",93,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE -"98",94,"Streptomycin",2,0,"M","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","3_resist_100+","2_Considerable_deterioration",2,FALSE -"99",97,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE -"100",98,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","3_Moderate_deterioration",3,FALSE -"101",99,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE -"102",100,"Streptomycin",2,0,"M","3_Poor","2_99-99.9F/37.3-37.7C","4_51+","yes","3_resist_100+","4_No_change",4,FALSE -"103",101,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","2_Considerable_deterioration",2,FALSE -"104",104,"Streptomycin",2,0,"M","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","5_Moderate_improvement",5,TRUE -"105",105,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","2_Considerable_deterioration",2,FALSE -"106",106,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","1_Death",1,FALSE -"107",107,"Streptomycin",2,0,"F","3_Poor","4_>=101F/38.3C","4_51+","yes","3_resist_100+","6_Considerable_improvement",6,TRUE diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md index 7730be398..d5f3c28b4 100644 --- a/python_linear_regression/python_linear_regression.md +++ b/python_linear_regression/python_linear_regression.md @@ -304,7 +304,7 @@ This question is designed to test the test-taker's understanding of the concept ### Real World Code Example -The Streptomycin for Tuberculosis dataset originates from a groundbreaking clinical trial published in 1948, often recognized as the first modern randomized clinical trial. It comprises data from a prospective, randomized, placebo-controlled study investigating the efficacy of streptomycin treatment for pulmonary tuberculosis. The dataset includes variables such as participant ID, study arm (Streptomycin or Control), doses of Streptomycin and Para-Amino-Salicylate in grams, gender, baseline conditions (categorized as good, fair, or poor), oral temperature at baseline, erythrocyte sedimentation rate at baseline, presence of lung cavitation on chest X-ray at baseline, streptomycin resistance at 6 months, radiologic outcomes at 6 months, numeric rating of chest X-ray at month 6, and a dichotomous outcome indicating improvement. These variables provide comprehensive information for analyzing the effectiveness of streptomycin treatment for tuberculosis, allowing for various statistical analyses such as logistic regression modeling. +The dataset comprises information on 442 diabetes patients, including their age, sex, body mass index (BMI), average blood pressure, and six blood serum measurements. Each patient's data includes ten baseline variables, with the first ten columns representing numeric predictive values. The eleventh column contains a quantitative measure of disease progression one year after baseline, serving as the target variable. Attributes include age in years, sex, BMI, average blood pressure, and measurements such as total serum cholesterol, low-density lipoproteins, high-density lipoproteins, total cholesterol/HDL ratio, possibly log of serum triglycerides level, and blood sugar level. Notably, each feature variable has been mean-centered and scaled by the standard deviation times the square root of the number of samples. This dataset is commonly used for predictive modeling and statistical analysis in the field of diabetes research. For more details, reference can be made to the original paper by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani titled "Least Angle Regression," published in the Annals of Statistics in 2004. @@ -312,63 +312,34 @@ The Streptomycin for Tuberculosis dataset originates from a groundbreaking clini ```python @Pyodide.exec import pandas as pd -import io -from pyodide.http import open_url -from sklearn.preprocessing import LabelEncoder +import numpy as np +import matplotlib.pyplot as plt +from sklearn import datasets from sklearn.linear_model import LinearRegression -from sklearn.metrics import r2_score, mean_squared_error - - - - - +from sklearn.metrics import mean_squared_error, r2_score ``` 2. Load the data: ```python # Load dataset and read to pandas dataframe -url = "https://raw.githubusercontent.com/arcus/education_modules/linear_regression/python_linear_regression/data/strep_tb.csv" +diabetes = datasets.load_diabetes() -url_contents = open_url(url) -text = url_contents.read() -file = io.StringIO(text) -df = pd.read_csv(file) # Analyze data and features -df.info() - -# Encode Categorical Features -categorical_cols = [ - 'arm', 'gender', 'baseline_condition', 'baseline_temp', - 'baseline_esr', 'baseline_cavitation', 'strep_resistance', 'radiologic_6m' -] - # Create a LabelEncoder for transforming columns -le = LabelEncoder() - -# Apply label encoding to each categorical column -for col in categorical_cols: - df[col] = le.fit_transform(df[col]) +print(diabetes) +print(diabetes.DESCR) + +# Now we will split the data into the independent and independent variable +X = diabetes.data +Y = diabetes.target ``` @Pyodide.eval 3. Compute Regression: ```python -# Feature Selection and Target Definition -features = [ - 'patient_id', 'arm', 'dose_strep_g', 'dose_PAS_g', 'gender', - 'baseline_condition', 'baseline_temp', 'baseline_esr', - 'baseline_cavitation', 'strep_resistance', 'radiologic_6m', 'rad_num' -] -target = 'improved' - -# Separate inputs (features) and output (target variable) -inputs = df[features] -output = df[target] - -# SECTION 4: Linear Regression Modeling model = LinearRegression() # Create the regression model -model.fit(inputs, output) # Train the model +model.fit(X, Y) # Train the model ``` @Pyodide.eval @@ -376,15 +347,19 @@ model.fit(inputs, output) # Train the model 4. Evaluate Model: ```python # Predict data -predictions = model.predict(inputs) +predictions = model.predict(X) + +# Check equation +print('Coefficient', model.coef_) +print('Intercept', model.intercept_) # Analyze predictions -print('R-squared:', r2_score(output, predictions)) -print('Mean Squared Error:', mean_squared_error(output, predictions)) +print('R-squared:', r2_score(Y, predictions)) +print('Mean Squared Error:', mean_squared_error(Y, predictions)) ``` @Pyodide.eval -The results obtained from the Streptomycin for Tuberculosis dataset reveal promising findings regarding the efficacy of streptomycin treatment for pulmonary tuberculosis. Originating from a groundbreaking clinical trial in 1948, widely acknowledged as the first modern randomized clinical trial, this dataset offers a rich array of variables encompassing participant demographics, treatment dosages, baseline conditions, and clinical outcomes. The R-squared value of 0.834306790075451 indicates that the linear regression model explains approximately 83.4% of the variance in the outcome variable, showcasing a strong fit of the model to the data. Additionally, the low Mean Squared Error of 0.04139073983616123 suggests that the model's predictions are relatively accurate. However, it's essential to recognize that linear regression serves as an initial step in analyzing this dataset. While these results provide valuable insights, further analyses employing advanced statistical techniques such as logistic regression modeling are warranted to fully comprehend the effectiveness of streptomycin treatment for tuberculosis management. +While linear regression provides valuable insights into the relationship between the predictor variables and the target variable, it represents just the initial step in data analysis, particularly in the context of this diabetes dataset. The R-squared value of 0.518 indicates that approximately 51.8% of the variance in the response variable (disease progression) can be explained by the linear relationship with the predictor variables. Additionally, the mean squared error of 2859.70 suggests that the model's predictions deviate from the actual values by this amount, on average. However, it's essential to recognize that linear regression assumes a linear relationship between the predictors and the response, which may not always hold true. Further analysis is warranted to explore potential nonlinear relationships, assess the model's assumptions and limitations, evaluate the significance of each predictor variable, and possibly employ more sophisticated techniques such as feature selection, regularization, or non-linear regression methods to improve predictive accuracy and better understand the underlying patterns in the data. Additionally, validation techniques such as cross-validation should be employed to assess the model's generalizability and robustness. Therefore, while linear regression provides a foundational understanding, it is crucial to conduct comprehensive analyses to ensure robust and accurate modeling in the context of diabetes progression prediction. From a8bc950e5b24cdc4cb0b1ba24ea3410ae8da68c9 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 18 Apr 2024 11:43:21 -0400 Subject: [PATCH 07/10] Updated module given Rose's comments in PR --- .../python_linear_regression.md | 173 ++++++++++++------ 1 file changed, 121 insertions(+), 52 deletions(-) diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md index d5f3c28b4..86c24a444 100644 --- a/python_linear_regression/python_linear_regression.md +++ b/python_linear_regression/python_linear_regression.md @@ -71,6 +71,21 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md - In the case of linear regression, the target variable is a continuous variable. In a supervised learning problem, the machine learning algorithm is given a set of training data and asked to learn a function that can map the input variables to the output variable. The training data consists of pairs of input and output variables. The algorithm learns the function by finding the best fit line to the data. Once the algorithm has learned the function, it can be used to make predictions on new data. To make a prediction, the algorithm simply plugs the values of the input variables into the function. - Linear regression is a popular supervised learning algorithm because it is simple to implement and understand. It is also a versatile algorithm that can be used to solve a variety of problems. +
+A little encouragement...
+ +As in many fields, machine learning involves a lot of technical language, some of which is unclear, redundant, or downright confusing. +For example: + +**Outcome** variables are also called **response variables**, **dependent variables**, or **labels**. + +**Input** variables are also called **predictors**, **features**, **independent variables**, or even just **variables**. + +To make matters worse, sometimes the same words are used to mean different things in different subfields. +If you find yourself stumbling on vocabulary as you read about machine learning, know you're not alone! + +
+ Which of the following is NOT a characteristic of linear regression? @@ -82,28 +97,34 @@ Which of the following is NOT a characteristic of linear regression? ***
-This question is more difficult than the previous one because it requires the test-taker to have a deeper understanding of the characteristics of linear regression. The test-taker must be able to identify which of the answer choices is not a characteristic of linear regression, even though all of the other answer choices are valid characteristics. +This question presents a deeper challenge as it requires a solid understanding of linear regression's characteristics. To answer correctly, you need to identify the feature that doesn't align with linear regression. The incorrect option, "Linear regression can be used to predict categorical variables," deviates from the typical usage of linear regression, which is primarily for continuous variables. Understanding this distinction enhances your comprehension of linear regression's scope and limitations.
*** -### Applications of linear regression in machine learning -Linear Regression can be used for a variety of tasks, such as: +### Applications of linear regression in biomedical research +Linear regression finds extensive application in biomedical research, offering insights into various domains, such as: + +- **Disease prognosis:** Linear regression aids in predicting the progression of diseases based on patient demographics, biomarkers, and clinical data. For instance, it can forecast the advancement of cancer stages or the deterioration of chronic conditions like diabetes. + - A specific example of this in research can be found in ["A longitudinal study defined circulating microRNAs as reliable biomarkers for disease prognosis and progression in ALS human patients"](https://www.nature.com/articles/s41420-020-00397-6) In the realm of disease prognosis, longitudinal research has illuminated the potential of circulating microRNAs as dependable biomarkers for assessing disease progression and prognosis in ALS patients. By integrating patient demographics, biomarkers, and clinical data, linear regression models can be leveraged to forecast the trajectory of diseases, akin to predicting cancer stages or the progression of chronic ailments like diabetes. + +- **Treatment efficacy:** Linear regression assists in evaluating the effectiveness of medical treatments by analyzing patient response data. Researchers can utilize it to assess the impact of medications, therapies, or interventions on disease outcomes and patient well-being. + - ["Meta-analysis of the Age-Dependent Efficacy of Multiple Sclerosis Treatments"](https://www.frontiersin.org/journals/neurology/articles/10.3389/fneur.2017.00577/full) demonstrates a specific application of linear regression. This study uses linear regression to determine how the effectiveness of Multiple Sclerosis treatments changes as patients age. -- **Prediction:** Linear regression can be used to predict a wide range of continuous variables, such as house prices, stock prices, customer churn, and medical outcomes. -- **Recommendation:** Linear regression can be used to build recommender systems that recommend products, movies, or other items to users based on their past preferences. -- **Fraud detection:** Linear regression can be used to detect fraudulent transactions by identifying transactions that deviate from the expected behavior. -- **Medical diagnosis:** Linear regression can be used to help doctors diagnose diseases by identifying patterns in patient data. -- **Scientific research:** Linear regression can be used to identify relationships between variables in scientific data. -### Examples of linear regression in real-world applications -- **Predicting house prices:** Linear regression can be used to predict the price of a house based on its square footage, number of bedrooms, number of bathrooms, and other factors. -- **Predicting stock prices:** Linear regression can be used to predict the price of a stock based on its historical prices, financial data, and other factors. -- **Predicting customer churn:** Linear regression can be used to predict whether a customer is likely to churn based on their past purchase history, demographics, and other factors. -- **Predicting the risk of a customer defaulting on a loan:** Linear regression can be used to predict the risk of a customer defaulting on a loan based on their credit score, income, and other factors. -- **Predicting the likelihood of a patient having a particular disease:** Linear regression can be used to predict the likelihood of a patient having a particular disease based on their medical history, symptoms, and other factors. -- **Predicting the number of visitors to a website:** Linear regression can be used to predict the number of visitors to a website based on the website's past traffic data, marketing campaigns, and other factors. -- **Predicting the sales of a product:** Linear regression can be used to predict the sales of a product based on its price, marketing campaigns, and other factors. + +- **Genetic studies:** Linear regression plays a pivotal role in genetic research by exploring associations between genetic variants and phenotypic traits. It helps identify genetic markers linked to disease susceptibility, treatment response, and disease progression, contributing to personalized medicine approaches. + - The article ["Prediction of Gene Expression Patterns With Generalized Linear Regression Model"](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.00120/full) describes a method using generalized linear regression to predict how gene expression levels change in response to the binding intensity of the Oct4 transcription factor. This model aids researchers in understanding the complex regulatory mechanisms behind cell reprogramming and development. + +- **Public health analysis:** Linear regression facilitates the analysis of population-level health trends, aiding in the identification of risk factors, disease clusters, and health disparities. It enables researchers to model the impact of interventions, policies, and socio-economic factors on public health outcomes. + - The article ["Regression Analysis for COVID-19 Infections and Deaths Based on Food Access and Health Issues"](https://www.mdpi.com/2227-9032/10/2/324) investigates the relationship between food access, pre-existing health conditions, and the severity of COVID-19 outcomes. Researchers used regression models to discover potential correlations that could inform future pandemic preparedness efforts. + +- **Epidemiological modeling:** Linear regression serves as a fundamental tool in epidemiology for modeling disease spread and understanding risk factors. It assists in forecasting disease outbreaks, estimating transmission rates, and evaluating interventions' effectiveness in controlling infectious diseases. + - The article ["SEIR and Regression Model based COVID-19 outbreak predictions in India"](https://arxiv.org/abs/2004.00958) utilizes a combination of SEIR modeling and regression analysis to forecast COVID-19 outbreaks in India, providing valuable insights into disease spread dynamics. This approach contributes to epidemiological modeling by showcasing how linear regression, alongside SEIR models, aids in predicting disease outbreaks, estimating transmission rates, and assessing the effectiveness of interventions, thereby informing proactive measures to control infectious diseases. + + + +By leveraging linear regression in these contexts, biomedical researchers can glean valuable insights into disease mechanisms, treatment strategies, and public health interventions, ultimately advancing healthcare practices and improving patient outcomes. ## Linear Regression Algorithm Linear regression works by fitting a linear equation to the data. @@ -123,6 +144,13 @@ where: The coefficients of the linear equation are estimated using the ordinary least squares (OLS) method. The OLS method minimizes the sum of the squared residuals, which are the differences between the predicted values and the actual values of the target variable. Once the linear regression model is trained, it can be used to make predictions on new data. To make a prediction, we simply plug the values of the predictor variables into the linear equation. +
+Learning connection
+ +To learn more about Linear Regression and for a visual explanation, watch [Linear Regression, Clearly Explained!!!](https://www.youtube.com/watch?v=nk2CQITm_eo). + +
+ Which of the following is NOT a component of the linear regression formula? @@ -139,7 +167,35 @@ The variance of the target variable is not a component of the linear regression *** +### Understanding Machine Learning Techniques + +Before diving into the example, it's valuable to understand some key concepts used in machine learning. These techniques help us build more accurate and reliable models for prediction. + +- **Splitting Data (Training and Testing):** Machine learning models 'learn' from data. We divide our dataset into two parts: + + - **Training set:** This part is used to train the model, allowing it to find patterns. + +- **Testing set:** This is held-out data used to evaluate how well our model performs on unseen examples. This prevents overfitting, where the model becomes too specific to the training data and performs poorly on new data. + +- **Recoding Categorical Predictors:** Many machine learning models work best with numerical data. Categorical features (like 'gender' or 'treatment group') need to be converted into numbers. Label encoding is a common technique, where each category is assigned a unique numerical label. +- **Scaling Continuous Predictors:** When features have vastly different scales (e.g., age vs. body temperature), some models might be biased towards features with larger ranges. Scaling brings features into a similar range, often between 0 and 1, or standardizing them to have a mean of 0 and a standard deviation of 1. This ensures all features are treated fairly during training. +- **Evaluating Model Predictions (MSE):** To assess the performance of a model, we use these metrics: + + - **Mean Squared Error (MSE):** This calculates the average of the squared differences between the model's predictions and the actual true values. A lower MSE indicates that a model's predictions are generally closer to the real targets. + + - MSE and Outliers: MSE is sensitive to outliers because squaring the errors emphasizes larger deviations. + +- **R-squared (R²):** This represents the proportion of the variance in the dependent variable that can be explained by the independent variables in the model. It ranges from 0 to 1. A higher R-squared value suggests a better model fit, meaning your model is doing a better job of explaining the variation in the data. + + - R-squared and Additional Variables: R-squared tends to increase as you add more variables to your model, even if those variables don't actually improve the model's explanatory power. To address this, you can use the Adjusted R-squared, which takes the number of variables into account. + +#### Why do we use these techniques? +- **Improved Accuracy:** These steps help our model identify true patterns and relationships within the data and not just memorize specific examples from the training set. +- **Preventing Overfitting:** By testing the model on unseen data, we ensure it generalizes well to new situations. +- **Fair Feature Influence:** Scaling makes sure no single feature dominates the model's predictions due to differences in measurement ranges. + +Let's continue with our example, keeping these concepts in mind. ### Python Implementation of Linear Regression @@ -178,69 +234,82 @@ data = pd.read_csv(file) data.info() ``` -3. Split the data into training and testing sets: -```python -# Encode categorical data into numbers +3. This function performs one-hot encoding on a specified column within a Pandas DataFrame. One-hot encoding is a technique for transforming categorical data into a numerical format suitable for machine learning algorithms. +```python def onehot_encode(df, column): df = df.copy() dummies = pd.get_dummies(df[column]) df = pd.concat([df, dummies], axis=1) df = df.drop(column, axis=1) return df +``` +@Pyodide.eval -def preprocess_inputs(df): - df = df.copy() - - # One-hot encode Location column - df = onehot_encode(df, column='Location') - - # Split df into X and y - y = df['Hospital_Stay'].copy() - X = df.drop('Hospital_Stay', axis=1).copy() - - # Train-test split - X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123) - - # Scale X with a standard scaler - scaler = StandardScaler() - scaler.fit(X_train) - - X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns) - X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns) - - return X_train, X_test, y_train, y_test +4. Make a copy of the dataframe to avoid modifying the original data +```python +df = df.copy() +``` + +5. One-hot encode the 'Location' column to convert categorical data into numerical form +```python +df = onehot_encode(df, column='Location') +``` -X_train, X_test, y_train, y_test = preprocess_inputs(data) +6. Separate the target variable 'Hospital_Stay' from the features +```python +y = df['Hospital_Stay'].copy() +X = df.drop('Hospital_Stay', axis=1).copy() ``` -@Pyodide.eval -4. Train the linear regression model: + +7. Split the dataset into training and testing sets. 70% of the data will be used for training, and the remaining 30% for testing +```python +X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123) +``` + +8. Standardize the features by scaling them using a StandardScaler. This helps in bringing all the feature values onto the same scale +```python +scaler = StandardScaler() +scaler.fit(X_train) +``` + +9. Transform both the training and testing features using the fitted scaler. This ensures that both sets of data are scaled in the same way +```python +X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns) +X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns) +``` + +10. Create a linear regression model instance ```python -# Create a linear regression model model = LinearRegression() +``` +@Pyodide.eval -# Fit the model to the training data +11. Fit the linear regression model to the training data +```python model.fit(X_train, y_train) ``` @Pyodide.eval -5. Evaluate the model on the testing set: +12. Make predictions on the testing set ```python -# Make predictions on the testing set y_pred = model.predict(X_test) +``` -# Evaluate the model using the mean squared error (MSE) +13. Evaluate the model using the mean squared error (MSE) +```python mse = np.mean((y_pred - y_test)**2) # Print the MSE print("MSE:", mse) +``` -# Evaluate R^2 Score +14. Evaluate R^2 Score +```python print(" R^2 Score: {:.5f}".format(model.score(X_test, y_test))) ``` -@Pyodide.eval This is a basic example of how to implement linear regression in Python using Scikit-learn. There are many other ways to implement linear regression in Python, but this is a good starting point. @@ -329,7 +398,7 @@ diabetes = datasets.load_diabetes() print(diabetes) print(diabetes.DESCR) -# Now we will split the data into the independent and independent variable +# Now we will split the data into the independent and dependent variable X = diabetes.data Y = diabetes.target ``` @@ -366,7 +435,7 @@ While linear regression provides valuable insights into the relationship between ## Conclusion -At the end of the lesson, students should have a good understanding of the concept of linear regression and how to implement the linear regression algorithm in Python. They should also be able to apply linear regression to real-world datasets to make predictions and insights. +By the end of this module, you'll have gained a solid grasp of linear regression and its practical implementation in Python. You'll be equipped to apply linear regression techniques to real-world datasets, enabling you to make predictions and uncover valuable insights. With this knowledge, you'll be well-prepared to embark on your journey into the world of data analysis and machine learning. ## Additional Resources From c3d0c93b2744721ed57d6e8ee733433355958a6d Mon Sep 17 00:00:00 2001 From: Schwartz Date: Thu, 18 Apr 2024 12:20:19 -0400 Subject: [PATCH 08/10] Updated answers to quiz questions to be more contextualized --- .../python_linear_regression.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md index 86c24a444..c6d5629a7 100644 --- a/python_linear_regression/python_linear_regression.md +++ b/python_linear_regression/python_linear_regression.md @@ -349,7 +349,12 @@ Linear regression is a powerful machine learning algorithm, but it has some limi ***
-This question is designed to test the test-taker's understanding of the concept of collinearity and its impact on linear regression models. Collinearity is a serious problem in linear regression because it can make it difficult to interpret the results of the model and can lead to inaccurate predictions. +This is because of a condition called collinearity. Here's why it matters: +- Understanding the impact: When two or more of your predictor variables are highly correlated (meaning they change in similar ways), it becomes difficult for linear regression to figure out which variable is really affecting the outcome. +- Less reliable results: Collinearity can make the estimates for your model's coefficients less stable. A small change in your data might lead to big changes in how the model interprets the relationship between the variables. +- Tricky interpretation: It becomes harder to say with confidence how much a change in one specific predictor variable will impact the outcome you're trying to predict. + +Key Takeaway: It's a good idea to be aware of collinearity and check for it before building a linear regression model.
*** @@ -363,8 +368,12 @@ This question is designed to test the test-taker's understanding of the concept ***
-This question is designed to test the test-taker's understanding of the concept of overfitting and how to prevent it. Overfitting is a common problem in machine learning, and it is important to be able to identify and prevent it. Regularization techniques such as L1 and L2 regularization can be used to prevent overfitting in linear regression models. +Regularization techniques are designed to combat overfitting. Let's break down why: + +- Overfitting: Occurs when a machine learning model learns the training data too well, including the noise. This makes it perform well on the training set but poorly on new, unseen data. It's like memorizing answers instead of truly understanding a subject. +- Regularization: It adds a penalty term to the model's loss function. This penalty discourages overly complex models, forcing them to generalize better to new data. +Key takeaway: Regularization techniques like L1 and L2 regularization help create models that are better at understanding the underlying patterns in data, not just the specific examples they were trained on.
*** From 74e2707cb5c23858318a73e056df6c99284010b1 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Mon, 27 May 2024 15:34:58 -0400 Subject: [PATCH 09/10] Updated python implementation of linear regression --- .../python_linear_regression.md | 326 ++++-------------- .../python_linear_regression_exercise.ipynb | 206 +++++++++++ 2 files changed, 266 insertions(+), 266 deletions(-) create mode 100644 python_linear_regression/python_linear_regression_exercise.ipynb diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md index c6d5629a7..4a5fd3252 100644 --- a/python_linear_regression/python_linear_regression.md +++ b/python_linear_regression/python_linear_regression.md @@ -66,144 +66,24 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md @overview -## What is linear regression? -- Linear regression is a supervised machine learning algorithm that learns to predict a continuous target variable based on one or more predictor variables. Linear regression models the relationship between the target variable and the predictor variables using a linear equation. -- In the case of linear regression, the target variable is a continuous variable. In a supervised learning problem, the machine learning algorithm is given a set of training data and asked to learn a function that can map the input variables to the output variable. The training data consists of pairs of input and output variables. The algorithm learns the function by finding the best fit line to the data. Once the algorithm has learned the function, it can be used to make predictions on new data. To make a prediction, the algorithm simply plugs the values of the input variables into the function. -- Linear regression is a popular supervised learning algorithm because it is simple to implement and understand. It is also a versatile algorithm that can be used to solve a variety of problems. -
-A little encouragement...
- -As in many fields, machine learning involves a lot of technical language, some of which is unclear, redundant, or downright confusing. -For example: - -**Outcome** variables are also called **response variables**, **dependent variables**, or **labels**. - -**Input** variables are also called **predictors**, **features**, **independent variables**, or even just **variables**. - -To make matters worse, sometimes the same words are used to mean different things in different subfields. -If you find yourself stumbling on vocabulary as you read about machine learning, know you're not alone! - -
- -Which of the following is NOT a characteristic of linear regression? - - -[( )] Linear regression models the relationship between the target variable and the predictor variables using a linear equation. -[( )] Linear regression is a supervised learning algorithm. -[( )] Linear regression is a simple to implement and understand algorithm. -[(X)] Linear regression can be used to predict categorical variables. -[( )] Linear regression is a versatile algorithm that can be used to solve a variety of problems. -*** -
- -This question presents a deeper challenge as it requires a solid understanding of linear regression's characteristics. To answer correctly, you need to identify the feature that doesn't align with linear regression. The incorrect option, "Linear regression can be used to predict categorical variables," deviates from the typical usage of linear regression, which is primarily for continuous variables. Understanding this distinction enhances your comprehension of linear regression's scope and limitations. - -
-*** - - -### Applications of linear regression in biomedical research -Linear regression finds extensive application in biomedical research, offering insights into various domains, such as: - -- **Disease prognosis:** Linear regression aids in predicting the progression of diseases based on patient demographics, biomarkers, and clinical data. For instance, it can forecast the advancement of cancer stages or the deterioration of chronic conditions like diabetes. - - A specific example of this in research can be found in ["A longitudinal study defined circulating microRNAs as reliable biomarkers for disease prognosis and progression in ALS human patients"](https://www.nature.com/articles/s41420-020-00397-6) In the realm of disease prognosis, longitudinal research has illuminated the potential of circulating microRNAs as dependable biomarkers for assessing disease progression and prognosis in ALS patients. By integrating patient demographics, biomarkers, and clinical data, linear regression models can be leveraged to forecast the trajectory of diseases, akin to predicting cancer stages or the progression of chronic ailments like diabetes. - -- **Treatment efficacy:** Linear regression assists in evaluating the effectiveness of medical treatments by analyzing patient response data. Researchers can utilize it to assess the impact of medications, therapies, or interventions on disease outcomes and patient well-being. - - ["Meta-analysis of the Age-Dependent Efficacy of Multiple Sclerosis Treatments"](https://www.frontiersin.org/journals/neurology/articles/10.3389/fneur.2017.00577/full) demonstrates a specific application of linear regression. This study uses linear regression to determine how the effectiveness of Multiple Sclerosis treatments changes as patients age. - - -- **Genetic studies:** Linear regression plays a pivotal role in genetic research by exploring associations between genetic variants and phenotypic traits. It helps identify genetic markers linked to disease susceptibility, treatment response, and disease progression, contributing to personalized medicine approaches. - - The article ["Prediction of Gene Expression Patterns With Generalized Linear Regression Model"](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.00120/full) describes a method using generalized linear regression to predict how gene expression levels change in response to the binding intensity of the Oct4 transcription factor. This model aids researchers in understanding the complex regulatory mechanisms behind cell reprogramming and development. - -- **Public health analysis:** Linear regression facilitates the analysis of population-level health trends, aiding in the identification of risk factors, disease clusters, and health disparities. It enables researchers to model the impact of interventions, policies, and socio-economic factors on public health outcomes. - - The article ["Regression Analysis for COVID-19 Infections and Deaths Based on Food Access and Health Issues"](https://www.mdpi.com/2227-9032/10/2/324) investigates the relationship between food access, pre-existing health conditions, and the severity of COVID-19 outcomes. Researchers used regression models to discover potential correlations that could inform future pandemic preparedness efforts. - -- **Epidemiological modeling:** Linear regression serves as a fundamental tool in epidemiology for modeling disease spread and understanding risk factors. It assists in forecasting disease outbreaks, estimating transmission rates, and evaluating interventions' effectiveness in controlling infectious diseases. - - The article ["SEIR and Regression Model based COVID-19 outbreak predictions in India"](https://arxiv.org/abs/2004.00958) utilizes a combination of SEIR modeling and regression analysis to forecast COVID-19 outbreaks in India, providing valuable insights into disease spread dynamics. This approach contributes to epidemiological modeling by showcasing how linear regression, alongside SEIR models, aids in predicting disease outbreaks, estimating transmission rates, and assessing the effectiveness of interventions, thereby informing proactive measures to control infectious diseases. - - - -By leveraging linear regression in these contexts, biomedical researchers can glean valuable insights into disease mechanisms, treatment strategies, and public health interventions, ultimately advancing healthcare practices and improving patient outcomes. - -## Linear Regression Algorithm -Linear regression works by fitting a linear equation to the data. - -The linear equation is represented by the following formula: - -``` -y = b0 + b1 * x1 + b2 * x2 + ... + bn * xn -``` - -where: - -- `y` is the target variable -- `b0` is the bias term -- `bi` is the coefficient for the ith predictor variable -- `xi` is the ith predictor variable - -The coefficients of the linear equation are estimated using the ordinary least squares (OLS) method. The OLS method minimizes the sum of the squared residuals, which are the differences between the predicted values and the actual values of the target variable. Once the linear regression model is trained, it can be used to make predictions on new data. To make a prediction, we simply plug the values of the predictor variables into the linear equation. - -
-Learning connection
- -To learn more about Linear Regression and for a visual explanation, watch [Linear Regression, Clearly Explained!!!](https://www.youtube.com/watch?v=nk2CQITm_eo). - -
- -Which of the following is NOT a component of the linear regression formula? - - -[( )] Target variable -[( )] Bias term -[( )] Coefficient for the ith predictor variable -[( )] ith predictor variable -[(X)] Variance of the target variable -*** -
- -The variance of the target variable is not a component of the linear regression formula. The linear regression formula is used to predict the mean value of the target variable, not the variance. - -
-*** - -### Understanding Machine Learning Techniques - -Before diving into the example, it's valuable to understand some key concepts used in machine learning. These techniques help us build more accurate and reliable models for prediction. - -- **Splitting Data (Training and Testing):** Machine learning models 'learn' from data. We divide our dataset into two parts: - - - **Training set:** This part is used to train the model, allowing it to find patterns. - -- **Testing set:** This is held-out data used to evaluate how well our model performs on unseen examples. This prevents overfitting, where the model becomes too specific to the training data and performs poorly on new data. - -- **Recoding Categorical Predictors:** Many machine learning models work best with numerical data. Categorical features (like 'gender' or 'treatment group') need to be converted into numbers. Label encoding is a common technique, where each category is assigned a unique numerical label. -- **Scaling Continuous Predictors:** When features have vastly different scales (e.g., age vs. body temperature), some models might be biased towards features with larger ranges. Scaling brings features into a similar range, often between 0 and 1, or standardizing them to have a mean of 0 and a standard deviation of 1. This ensures all features are treated fairly during training. -- **Evaluating Model Predictions (MSE):** To assess the performance of a model, we use these metrics: - - - **Mean Squared Error (MSE):** This calculates the average of the squared differences between the model's predictions and the actual true values. A lower MSE indicates that a model's predictions are generally closer to the real targets. - - - MSE and Outliers: MSE is sensitive to outliers because squaring the errors emphasizes larger deviations. -- **R-squared (R²):** This represents the proportion of the variance in the dependent variable that can be explained by the independent variables in the model. It ranges from 0 to 1. A higher R-squared value suggests a better model fit, meaning your model is doing a better job of explaining the variation in the data. - - - R-squared and Additional Variables: R-squared tends to increase as you add more variables to your model, even if those variables don't actually improve the model's explanatory power. To address this, you can use the Adjusted R-squared, which takes the number of variables into account. +### Python Implementation of Linear Regression -#### Why do we use these techniques? +To implement linear regression in Python using Scikit-learn, we can follow these steps: -- **Improved Accuracy:** These steps help our model identify true patterns and relationships within the data and not just memorize specific examples from the training set. -- **Preventing Overfitting:** By testing the model on unseen data, we ensure it generalizes well to new situations. -- **Fair Feature Influence:** Scaling makes sure no single feature dominates the model's predictions due to differences in measurement ranges. -Let's continue with our example, keeping these concepts in mind. - -### Python Implementation of Linear Regression -To implement linear regression in Python using Scikit-learn, we can follow these steps: +1. Import Libraries +* **numpy (np):** Provides tools for working with numerical arrays and mathematical operations. +* **pandas (pd):** Enables data manipulation and analysis with data structures like DataFrames. +* **sklearn:** A powerful machine learning library. We specifically use: + * `train_test_split`: Splits data into training (model building) and testing (model evaluation) sets. + * `StandardScaler`: Standardizes features to have zero mean and unit variance (often important for linear regression). + * `LinearRegression`: The core linear regression model. -1. Import the necessary libraries: ```python import numpy as np import pandas as pd @@ -216,6 +96,10 @@ from sklearn.linear_model import LinearRegression 2. Load the data: + +* `pd.read_csv("file")`: Reads data from a CSV file into a pandas DataFrame. +* `data.info()`: Gives a summary of the data such as column names, data types, and any missing values. + ```python @Pyodide.exec import pandas as pd @@ -235,7 +119,11 @@ data.info() ``` -3. This function performs one-hot encoding on a specified column within a Pandas DataFrame. One-hot encoding is a technique for transforming categorical data into a numerical format suitable for machine learning algorithms. + +3. **The `onehot_encode` Function** + + * This function handles categorical features (like "Location" in your data) by creating new columns where each column represents a unique category. The values are 1 if the data point belongs to that category and 0 otherwise. + ```python def onehot_encode(df, column): df = df.copy() @@ -246,17 +134,21 @@ def onehot_encode(df, column): ``` @Pyodide.eval -4. Make a copy of the dataframe to avoid modifying the original data -```python -df = df.copy() -``` +4. Make Data Copy and One-Hot Encode + +* Creates a copy so we don't change the original data by accident. +* Applies one-hot encoding to the `Location` column. -5. One-hot encode the 'Location' column to convert categorical data into numerical form ```python +df = df.copy() df = onehot_encode(df, column='Location') ``` -6. Separate the target variable 'Hospital_Stay' from the features +5. Separate Target and Features +* **y**: This is our target variable – what we want to predict (Hospital Stay). +* **X**: These are our features – the information we'll use to make the prediction. + + ```python y = df['Hospital_Stay'].copy() X = df.drop('Hospital_Stay', axis=1).copy() @@ -264,50 +156,52 @@ X = df.drop('Hospital_Stay', axis=1).copy() ``` -7. Split the dataset into training and testing sets. 70% of the data will be used for training, and the remaining 30% for testing +6. Split into Training and Testing Sets +* 70% of data is used for training (`X_train`, `y_train`). +* 30% is held back for testing (`X_test`, `y_test`). +* `random_state=123` ensures we get the same split each time for reproducibility. + + ```python X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123) ``` -8. Standardize the features by scaling them using a StandardScaler. This helps in bringing all the feature values onto the same scale +7. Standardize Features +* Calculates the mean and standard deviation of each feature in the training set. +* Scales both training and testing data to have zero mean and unit variance. This is often necessary for linear regression to work well. + ```python scaler = StandardScaler() scaler.fit(X_train) -``` - -9. Transform both the training and testing features using the fitted scaler. This ensures that both sets of data are scaled in the same way -```python X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns) X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns) ``` -10. Create a linear regression model instance -```python -model = LinearRegression() -``` -@Pyodide.eval +8. Create and Train the Model +* Creates a linear regression object. +* Finds the best-fit line (or plane, in higher dimensions) by minimizing the difference between predicted and actual values in the training data. -11. Fit the linear regression model to the training data ```python +model = LinearRegression() model.fit(X_train, y_train) ``` @Pyodide.eval -12. Make predictions on the testing set +9. Make Predictions +* Applies the model to the testing data to predict hospital stay. ```python y_pred = model.predict(X_test) ``` +@Pyodide.eval -13. Evaluate the model using the mean squared error (MSE) -```python -mse = np.mean((y_pred - y_test)**2) +10. Evaluate the Model +* **Mean Squared Error (MSE):** A measure of how close the predictions are to the actual values on average. Lower is better. +* **R² Score:** Indicates the proportion of variance in the target variable that is explained by the model. Ranges from 0 to 1, with 1 being the best possible score. -# Print the MSE -print("MSE:", mse) -``` -14. Evaluate R^2 Score ```python +mse = np.mean((y_pred - y_test)**2) +print("MSE:", mse) print(" R^2 Score: {:.5f}".format(model.score(X_test, y_test))) ``` @@ -322,129 +216,29 @@ Here are some additional tips for implementing linear regression in Python: - Interpret the coefficients of the linear regression model to understand the relationship between the predictor variables and the target variable. -### Applying Linear Regression to a Real-World Dataset -To apply linear regression to a real-world dataset, we can follow these steps: - -- **Choose a dataset:** The dataset should have at least one continuous target variable and one or more predictor variables. -- **Prepare the data:** This may involve cleaning the data, handling missing values, and scaling the data. -- **Split the data into training and testing sets:** This will help to prevent overfitting. -- **Train the linear regression model:** Use the training set to fit the model to the data. -- **Evaluate the model on the testing set:** This will give you an estimate of how well the model will generalize to new data. -- **Interpret the results:** Examine the coefficients of the model to understand the relationship between the predictor variables and the target variable. -- **Make predictions on new data:** Use the trained model to make predictions on new data points. - -### Important Notes -Linear regression is a powerful machine learning algorithm, but it has some limitations. Here are some of the most important limitations of linear regression: - -- **Linearity assumption:** Linear regression assumes that the relationship between the target variable and the predictor variables is linear. If the relationship is non-linear, then linear regression will not be able to accurately predict the target variable. -- **Overfitting:** Linear regression is prone to overfitting, which occurs when the model learns the training data too well and is unable to generalize to new data. Overfitting can be prevented by using regularization techniques such as L1 or L2 regularization. -- **Outliers:** Linear regression is sensitive to outliers, which are data points that are significantly different from the rest of the data. Outliers can have a large impact on the parameters of the linear regression model and can lead to inaccurate predictions. -- **Collinearity:** Linear regression is also sensitive to collinearity, which occurs when two or more predictor variables are highly correlated with each other. Collinearity can make it difficult to interpret the results of the linear regression model and can lead to inaccurate predictions. - -[True/False] Linear regression is sensitive to collinearity. - - -[(X)] True -[( )] False -*** -
- -This is because of a condition called collinearity. Here's why it matters: -- Understanding the impact: When two or more of your predictor variables are highly correlated (meaning they change in similar ways), it becomes difficult for linear regression to figure out which variable is really affecting the outcome. -- Less reliable results: Collinearity can make the estimates for your model's coefficients less stable. A small change in your data might lead to big changes in how the model interprets the relationship between the variables. -- Tricky interpretation: It becomes harder to say with confidence how much a change in one specific predictor variable will impact the outcome you're trying to predict. - -Key Takeaway: It's a good idea to be aware of collinearity and check for it before building a linear regression model. - -
-*** - - -[True/False] Overfitting can be prevented by using regularization techniques. - - -[(X)] True -[( )] False -*** -
- -Regularization techniques are designed to combat overfitting. Let's break down why: - -- Overfitting: Occurs when a machine learning model learns the training data too well, including the noise. This makes it perform well on the training set but poorly on new, unseen data. It's like memorizing answers instead of truly understanding a subject. -- Regularization: It adds a penalty term to the model's loss function. This penalty discourages overly complex models, forcing them to generalize better to new data. -Key takeaway: Regularization techniques like L1 and L2 regularization help create models that are better at understanding the underlying patterns in data, not just the specific examples they were trained on. -
-*** -### Real World Code Example -The dataset comprises information on 442 diabetes patients, including their age, sex, body mass index (BMI), average blood pressure, and six blood serum measurements. Each patient's data includes ten baseline variables, with the first ten columns representing numeric predictive values. The eleventh column contains a quantitative measure of disease progression one year after baseline, serving as the target variable. Attributes include age in years, sex, BMI, average blood pressure, and measurements such as total serum cholesterol, low-density lipoproteins, high-density lipoproteins, total cholesterol/HDL ratio, possibly log of serum triglycerides level, and blood sugar level. Notably, each feature variable has been mean-centered and scaled by the standard deviation times the square root of the number of samples. This dataset is commonly used for predictive modeling and statistical analysis in the field of diabetes research. For more details, reference can be made to the original paper by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani titled "Least Angle Regression," published in the Annals of Statistics in 2004. +## Conclusion +By the end of this module, you'll have gained a solid grasp of linear regression as it is used in machine learning. You've learned how to implement and evaluate linear regression models using popular libraries like Scikit-learn. You've also seen how to apply these techniques to real-world datasets, both synthetic (healthcare investments) and established (diabetes dataset). -1. Install Packages: -```python @Pyodide.exec - -import pandas as pd -import numpy as np -import matplotlib.pyplot as plt -from sklearn import datasets -from sklearn.linear_model import LinearRegression -from sklearn.metrics import mean_squared_error, r2_score -``` - -2. Load the data: -```python -# Load dataset and read to pandas dataframe -diabetes = datasets.load_diabetes() - - -# Analyze data and features -print(diabetes) -print(diabetes.DESCR) - -# Now we will split the data into the independent and dependent variable -X = diabetes.data -Y = diabetes.target -``` -@Pyodide.eval - - -3. Compute Regression: -```python -model = LinearRegression() # Create the regression model -model.fit(X, Y) # Train the model -``` -@Pyodide.eval - - -4. Evaluate Model: -```python -# Predict data -predictions = model.predict(X) - -# Check equation -print('Coefficient', model.coef_) -print('Intercept', model.intercept_) - -# Analyze predictions -print('R-squared:', r2_score(Y, predictions)) -print('Mean Squared Error:', mean_squared_error(Y, predictions)) -``` -@Pyodide.eval - -While linear regression provides valuable insights into the relationship between the predictor variables and the target variable, it represents just the initial step in data analysis, particularly in the context of this diabetes dataset. The R-squared value of 0.518 indicates that approximately 51.8% of the variance in the response variable (disease progression) can be explained by the linear relationship with the predictor variables. Additionally, the mean squared error of 2859.70 suggests that the model's predictions deviate from the actual values by this amount, on average. However, it's essential to recognize that linear regression assumes a linear relationship between the predictors and the response, which may not always hold true. Further analysis is warranted to explore potential nonlinear relationships, assess the model's assumptions and limitations, evaluate the significance of each predictor variable, and possibly employ more sophisticated techniques such as feature selection, regularization, or non-linear regression methods to improve predictive accuracy and better understand the underlying patterns in the data. Additionally, validation techniques such as cross-validation should be employed to assess the model's generalizability and robustness. Therefore, while linear regression provides a foundational understanding, it is crucial to conduct comprehensive analyses to ensure robust and accurate modeling in the context of diabetes progression prediction. - +While the linear regression model for the diabetes dataset explains a reasonable amount of variance (51.8%), it's important to remember that real-world data analysis rarely ends with a single model. It's crucial to recognize the assumptions and limitations of linear regression. In the case of the diabetes dataset, further analysis is warranted to: +* **Explore potential nonlinear relationships:** The relationship between diabetes progression and the predictor variables might not be strictly linear. +* **Evaluate model assumptions:** Linear regression assumes specific relationships between variables (e.g., linearity, independence, homoscedasticity) that may not hold in the data. +* **Feature selection and engineering:** Some predictors might be more important than others. Feature engineering techniques could create new, more informative features. +* **Regularization:** Techniques like Ridge or Lasso regression could help prevent overfitting and improve model generalization. +* **Advanced models:** Non-linear regression or machine learning models like Random Forests or Gradient Boosting might offer better predictive performance. +This module is just the beginning of your journey into data analysis and machine learning. With the foundation you've built here, you're well-prepared to explore more advanced techniques and tackle complex data-driven challenges. -## Conclusion +Remember, the key to successful data analysis is not just about applying algorithms, but also about understanding your data, asking the right questions, and critically evaluating your results. As you continue learning, keep exploring, experimenting, and refining your skills to become a proficient data scientist. -By the end of this module, you'll have gained a solid grasp of linear regression and its practical implementation in Python. You'll be equipped to apply linear regression techniques to real-world datasets, enabling you to make predictions and uncover valuable insights. With this knowledge, you'll be well-prepared to embark on your journey into the world of data analysis and machine learning. ## Additional Resources diff --git a/python_linear_regression/python_linear_regression_exercise.ipynb b/python_linear_regression/python_linear_regression_exercise.ipynb new file mode 100644 index 000000000..e1c6e9407 --- /dev/null +++ b/python_linear_regression/python_linear_regression_exercise.ipynb @@ -0,0 +1,206 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "qJomtu5Ddh1h" + }, + "source": [ + "# Introduction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XO4ZsjHac4MD" + }, + "source": [ + "**Real World Code Example: Diabetes Progression Prediction**\n", + "\n", + "\n", + "This notebook demonstrates a basic linear regression analysis on a diabetes dataset to predict disease progression. The dataset includes information on 442 patients, their medical attributes, and a quantitative measure of disease advancement after one year." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vnNsDB28djkH" + }, + "source": [ + "# Data Description" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gyZ61b75c7iQ" + }, + "source": [ + "The data includes:\n", + "\n", + "* **Predictor Variables:**\n", + " * Age (years)\n", + " * Sex\n", + " * Body Mass Index (BMI)\n", + " * Average Blood Pressure\n", + " * Six Blood Serum Measurements (normalized)\n", + "* **Target Variable:**\n", + " * Quantitative measure of disease progression one year after baseline\n", + "\n", + "\n", + "Each feature variable has been mean-centered and scaled by the standard deviation times the square root of the number of samples.\n", + "\n", + "**Citation:**\n", + "\n", + "This dataset is sourced from the research paper \"Least Angle Regression\" by Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani (Annals of Statistics, 2004)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P-ZP5ZLWdTaE" + }, + "source": [ + "# Install and Import:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HKf3oUgwc7I1" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from sklearn import datasets\n", + "from sklearn.linear_model import LinearRegression\n", + "from sklearn.metrics import mean_squared_error, r2_score" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cY3m-DOddRFc" + }, + "source": [ + "# Load and Explore Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JP4lt_lvdNwH" + }, + "outputs": [], + "source": [ + "# Load the diabetes dataset\n", + "diabetes = datasets.load_diabetes()\n", + "\n", + "# Print dataset description\n", + "print(diabetes.DESCR)\n", + "\n", + "# Separate features (X) and target variable (Y)\n", + "X = diabetes.data\n", + "Y = diabetes.target" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r2Nyyk_FdYdg" + }, + "source": [ + "# Build and Train Model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qgNlTeu1dPKg" + }, + "outputs": [], + "source": [ + "# Create Linear Regression model\n", + "model = LinearRegression()\n", + "\n", + "# Train the model on the data\n", + "model.fit(X, Y)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TZUOu5k6db7T" + }, + "source": [ + "# Predict and Evaluate" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LE-F0aOpdeaS" + }, + "outputs": [], + "source": [ + "# Make predictions\n", + "predictions = model.predict(X)\n", + "\n", + "# Model Coefficients and Intercept\n", + "print('Coefficients:', model.coef_)\n", + "print('Intercept:', model.intercept_)\n", + "\n", + "# Evaluate performance using R-squared and Mean Squared Error\n", + "print('R-squared:', r2_score(Y, predictions))\n", + "print('Mean Squared Error:', mean_squared_error(Y, predictions))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BmBkosEIdfzt" + }, + "source": [ + "# Interpretation and Next Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "41PlZvIgdpRR" + }, + "source": [ + "This basic linear regression model explains approximately 51.8% of the variance in disease progression. However, the mean squared error indicates room for improvement.\n", + "\n", + "\n", + "**Future Directions:**\n", + "\n", + "* **Explore non-linear relationships:** Consider non-linear models (e.g., polynomial regression).\n", + "* **Feature selection/engineering:** Identify the most relevant predictors.\n", + "* **Regularization:** Prevent overfitting by adding penalty terms to the model.\n", + "* **Cross-validation:** Assess the model's performance on unseen data.\n", + "* **Advanced techniques:** Explore machine learning algorithms like Random Forests or Gradient Boosting." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} From 2a1dc6abc271bb0276e8b61d79952af7fce48002 Mon Sep 17 00:00:00 2001 From: Schwartz Date: Sun, 23 Jun 2024 20:43:23 -0400 Subject: [PATCH 10/10] Updated changes based off Elizabeth's comments --- .../python_linear_regression.md | 362 ++++++++++++++++-- 1 file changed, 336 insertions(+), 26 deletions(-) diff --git a/python_linear_regression/python_linear_regression.md b/python_linear_regression/python_linear_regression.md index 4a5fd3252..aa18d8dc0 100644 --- a/python_linear_regression/python_linear_regression.md +++ b/python_linear_regression/python_linear_regression.md @@ -2,7 +2,7 @@ author: Daniel Schwartz email: des338@drexel.edu -version: 0.0.0 +version: 1.0.0 current_version_description: Initial version module_type: standard docs_version: 3.0.0 @@ -67,14 +67,43 @@ import: https://raw.githubusercontent.com/LiaTemplates/Pyodide/master/README.md @overview +## Summary of Key Concepts in Linear Regression + +- **Definition**: Linear regression is a statistical method used to model and analyze the relationships between a dependent variable and one or more independent variables. + +- **Applications**: Commonly used in machine learning to predict continuous outcomes. + +- **Practical Application**: + - Applying linear regression to real-world datasets, such as synthetic healthcare investments and the diabetes dataset. + +- **Evaluation and Beyond**: + - Recognize model limitations and explore further analysis, such as: + - Nonlinear relationships. + - Model assumptions. + - Feature selection and engineering. + - Regularization techniques (Ridge, Lasso). + - Advanced models (Random Forests, Gradient Boosting). + + +- Linear regression is a starting point for data analysis and machine learning. +- The foundation built here prepares you for advanced techniques and complex challenges. +- Success in data analysis involves understanding data, asking the right questions, and critically evaluating results. + + + -### Python Implementation of Linear Regression +## Python Implementation of Linear Regression To implement linear regression in Python using Scikit-learn, we can follow these steps: -1. Import Libraries +### 1. Import Libraries +**Description**: +This code block imports necessary libraries for data manipulation and machine learning tasks. Specifically, it imports NumPy for numerical operations, Pandas for data manipulation, and scikit-learn (sklearn) for machine learning functionalities. + +**Why this is important:**1 +Importing libraries is the first step in any data analysis or machine learning project. These libraries provide tools and functions to efficiently handle data, perform mathematical operations, and build machine learning models. * **numpy (np):** Provides tools for working with numerical arrays and mathematical operations. * **pandas (pd):** Enables data manipulation and analysis with data structures like DataFrames. @@ -95,10 +124,20 @@ from sklearn.linear_model import LinearRegression @Pyodide.eval -2. Load the data: +**Output:** +There's no output generated from this code block. It simply imports the required libraries for subsequent steps in the machine learning workflow. + + -* `pd.read_csv("file")`: Reads data from a CSV file into a pandas DataFrame. -* `data.info()`: Gives a summary of the data such as column names, data types, and any missing values. +### 2. Load the data: + +**Description:** + +* `pd.read_csv("file")`: Reads data from a CSV file into a pandas DataFrame. +* `data.info()`: Gives a summary of the data such as column names, data types, and any missing values. + +**Why this is important:** +Loading the data is the initial step in any data analysis or machine learning task. It's essential to understand the structure of the data, such as the number of features and their data types, before proceeding with further analysis. ```python @Pyodide.exec @@ -106,97 +145,227 @@ import pandas as pd import io from pyodide.http import open_url +# URL of the CSV file url = "https://raw.githubusercontent.com/arcus/education_modules/linear_regression/python_linear_regression/data/healthcare_investments_and_hospital_stay.csv" +# Open and read the contents of the URL url_contents = open_url(url) text = url_contents.read() + +# Create a file-like object from the text content file = io.StringIO(text) +# Read the CSV data into a pandas DataFrame data = pd.read_csv(file) # Analyze data and features data.info() ``` +**Output:** +After executing this code block, you will see a summary of the loaded data, including information about columns, data types, and non-null values. This helps them understand the dataset they will be working with. + -3. **The `onehot_encode` Function** + + +### 3. **The `onehot_encode` Function** + +**Description:** * This function handles categorical features (like "Location" in your data) by creating new columns where each column represents a unique category. The values are 1 if the data point belongs to that category and 0 otherwise. +**Why this is important:** +One-hot encoding is crucial when dealing with categorical data in machine learning models. Many machine learning algorithms cannot directly handle categorical data, so encoding them into numerical values allows algorithms to operate on the data effectively. By creating binary columns for each category, we ensure that each category is treated equally, without imposing any ordinality or magnitude among them. + + ```python def onehot_encode(df, column): + # Make a copy of the DataFrame to avoid modifying the original data df = df.copy() + + # Use pandas get_dummies function to one-hot encode the specified column dummies = pd.get_dummies(df[column]) + + # Concatenate the one-hot encoded columns with the original DataFrame df = pd.concat([df, dummies], axis=1) + + # Drop the original categorical column since it's no longer needed df = df.drop(column, axis=1) + return df ``` @Pyodide.eval -4. Make Data Copy and One-Hot Encode + + + + +### 4. Make Data Copy and One-Hot Encode + +**Description:** +The code creates a copy of the DataFrame df to ensure that the original data remains unchanged. It then applies one-hot encoding to the Location column using the onehot_encode function. + +**Why this is important:** +Creating a copy of the DataFrame is essential to prevent unintentional modifications to the original data, which could lead to unexpected results or loss of information. One-hot encoding is necessary to convert categorical variables, such as the Location column, into numerical format, which is required for many machine learning algorithms to operate effectively. * Creates a copy so we don't change the original data by accident. * Applies one-hot encoding to the `Location` column. ```python +# Make a copy of the DataFrame to avoid modifying the original data accidentally df = df.copy() + +# Apply one-hot encoding to the 'Location' column df = onehot_encode(df, column='Location') + +# Print the resulting DataFrame to observe the effect of one-hot encoding +print(df.head()) ``` +**Output:** +While this code snippet itself does not produce direct output, we can demonstrate its usage by applying it to a DataFrame and printing the resulting DataFrame to observe the effect of one-hot encoding. + + + + +### 5. Separate Target and Features +**Description:** + +* This code snippet separates the target variable (`Hospital_Stay`) from the features in the DataFrame df. +* The target variable (`y`) is what we want to predict, while the features (`X`) are the information we'll use to make the prediction. -5. Separate Target and Features -* **y**: This is our target variable – what we want to predict (Hospital Stay). -* **X**: These are our features – the information we'll use to make the prediction. +**Why this is important:** + +* Separating the target variable from the features is a crucial step in machine learning model training. +* The target variable is the variable we aim to predict, while the features are the input variables that influence the prediction. +* By separating them, we ensure that the model trains on the features to predict the target accurately. ```python +# Separate the target variable 'Hospital_Stay' from the features y = df['Hospital_Stay'].copy() X = df.drop('Hospital_Stay', axis=1).copy() +# Print the target variable and features to verify the separation +print("Target variable (y):") +print(y.head()) +print("\nFeatures (X):") +print(X.head()) + ``` +**Output:** +While this code snippet doesn't produce any visible output, we can verify the separation by printing the y (target variable) and X (features) variables. + + + +### 6. Split into Training and Testing Sets +**Description:** +The `train_test_split` function divides the dataset into training and testing sets. Here, 70% of the data is used for training `(X_train, y_train)`, and the remaining 30% is held back for testing `(X_test, y_test)`. + +**Why this is important:** +Splitting the data into training and testing sets is crucial in machine learning to assess the performance of the model. The training set is used to train the model, while the testing set is used to evaluate its performance on unseen data. This helps to detect overfitting and ensures that the model generalizes well to new data. -6. Split into Training and Testing Sets -* 70% of data is used for training (`X_train`, `y_train`). -* 30% is held back for testing (`X_test`, `y_test`). * `random_state=123` ensures we get the same split each time for reproducibility. ```python +# Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123) + +# Print the shapes of the resulting training and testing sets +print("Training set - X shape:", X_train.shape, "y shape:", y_train.shape) +print("Testing set - X shape:", X_test.shape, "y shape:", y_test.shape) + ``` +**Output:** +While this code block doesn't produce any output directly, we can demonstrate its usage by applying it to our data and printing the shapes of the resulting training and testing sets to confirm the split. + + +### 7. Standardize Features +**Description:** -7. Standardize Features -* Calculates the mean and standard deviation of each feature in the training set. -* Scales both training and testing data to have zero mean and unit variance. This is often necessary for linear regression to work well. +* The code initializes a `StandardScaler` object, which will be used to standardize (or z-score normalize) the features. +* It then fits the scaler to the training data (`X_train`), calculating the mean and standard deviation of each feature in the training set. +* Finally, it scales both the training and testing data to have zero mean and unit variance using the fitted scaler. This ensures that both datasets are scaled in the same way. + +**Why this is important:** +Standardizing features is crucial, especially when working with algorithms that rely on distance metrics or gradient descent optimization, such as KNN, SVM, or logistic regression. By standardizing the features, we remove the mean and scale the data to unit variance, which can improve the convergence rate of optimization algorithms and prevent features with larger scales from dominating those with smaller scales. ```python +# Initialize a StandardScaler object scaler = StandardScaler() + +# Fit the scaler to the training data, calculating the mean and standard deviation of each feature scaler.fit(X_train) + +# Scale both training and testing data to have zero mean and unit variance X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns) X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns) + +# Print the scaled training and testing data to observe the effect of standardization +print("Scaled Training Data:") +print(X_train.head()) +print("\nScaled Testing Data:") +print(X_test.head()) ``` -8. Create and Train the Model -* Creates a linear regression object. -* Finds the best-fit line (or plane, in higher dimensions) by minimizing the difference between predicted and actual values in the training data. +**Output:** +While this code block doesn't produce any output directly, learners can observe the effect of standardization by printing the scaled `X_train` and `X_test` datasets after applying the scaler. + + + +### 8. Create and Train the Model +**Description:** + +* This code segment creates a linear regression object using the `LinearRegression` class from the scikit-learn library. +* It then fits the model to the training data, finding the best-fit line (or plane, in higher dimensions) by minimizing the difference between predicted and actual values in the training data. + +**Why this is important:** +Creating and training a model is the core of supervised machine learning. In this step, we instantiate a regression model and train it on our training data to learn the underlying patterns and relationships between the input features (`X`) and the target variable (`y`). This trained model will later be used to make predictions on new, unseen data. ```python +# Create a Linear Regression model object model = LinearRegression() + +# Fit the model to the training data model.fit(X_train, y_train) ``` @Pyodide.eval -9. Make Predictions -* Applies the model to the testing data to predict hospital stay. + + +### 9. Make Predictions +**Description:** +This line applies the trained machine learning model (`model`) to the testing data (`X_test`) to make predictions about hospital stay durations. + +**Why this is important:** +Making predictions is the ultimate goal of any machine learning model. By applying the trained model to new, unseen data, we can obtain predictions that can be used for decision-making or further analysis. + ```python +# Make predictions using the trained model and the testing data y_pred = model.predict(X_test) + +# Print the predicted hospital stay durations +print(y_pred) ``` @Pyodide.eval -10. Evaluate the Model -* **Mean Squared Error (MSE):** A measure of how close the predictions are to the actual values on average. Lower is better. -* **R² Score:** Indicates the proportion of variance in the target variable that is explained by the model. Ranges from 0 to 1, with 1 being the best possible score. +**Output:** +While this line doesn't produce output directly, we can add a print statement to display the predictions generated by the model. + + + +### 10. Evaluate the Model +**Description:** + +The code calculates and prints two evaluation metrics for the regression model: + +* Mean Squared Error (MSE): A measure of how close the predictions are to the actual values on average. Lower values indicate better performance. +* R² Score: Indicates the proportion of variance in the target variable that is explained by the model. Ranges from 0 to 1, with 1 being the best possible score. + +**Why this is important:** +Evaluating the model's performance is crucial to understand how well it is generalizing to unseen data. The Mean Squared Error provides a quantitative measure of the model's prediction accuracy, while the R² Score gives insight into the goodness of fit of the model. ```python @@ -205,7 +374,11 @@ print("MSE:", mse) print(" R^2 Score: {:.5f}".format(model.score(X_test, y_test))) ``` +**Output:** +This code snippet produces output showing the calculated MSE and R² Score, providing insights into the model's performance. + +### Code Overview and Tips This is a basic example of how to implement linear regression in Python using Scikit-learn. There are many other ways to implement linear regression in Python, but this is a good starting point. Here are some additional tips for implementing linear regression in Python: @@ -218,6 +391,31 @@ Here are some additional tips for implementing linear regression in Python: +## Review your knowledge + +Which function from Scikit-learn is used to split the dataset into training and testing sets? + + +A) data_splitter +B) train_test_split +C) train_validate_split +D) model_splitter + + +[( )] `data_splitter` +[(X)] `train_test_split` +[( )] `train_validate_split` +[( )] `model_splitter` +*** +
+ +The `train_test_split` function from Scikit-learn is used to split the dataset into training and testing sets. This function is essential for evaluating the performance of a machine learning model by training it on one subset of the data and testing it on another. + + +
+ + + @@ -225,9 +423,13 @@ Here are some additional tips for implementing linear regression in Python: ## Conclusion +### Key Takeaways By the end of this module, you'll have gained a solid grasp of linear regression as it is used in machine learning. You've learned how to implement and evaluate linear regression models using popular libraries like Scikit-learn. You've also seen how to apply these techniques to real-world datasets, both synthetic (healthcare investments) and established (diabetes dataset). -While the linear regression model for the diabetes dataset explains a reasonable amount of variance (51.8%), it's important to remember that real-world data analysis rarely ends with a single model. It's crucial to recognize the assumptions and limitations of linear regression. In the case of the diabetes dataset, further analysis is warranted to: +While the linear regression model for the diabetes dataset explains a reasonable amount of variance (51.8%), it's important to remember that real-world data analysis rarely ends with a single model. + +### Beyond Linear Regression +**Further Analysis Needed:** * **Explore potential nonlinear relationships:** The relationship between diabetes progression and the predictor variables might not be strictly linear. * **Evaluate model assumptions:** Linear regression assumes specific relationships between variables (e.g., linearity, independence, homoscedasticity) that may not hold in the data. @@ -242,6 +444,114 @@ Remember, the key to successful data analysis is not just about applying algorit ## Additional Resources +### Full Code Implementation + +At the end of this module, here you will find a "Full Code" section where all the code is consolidated into a single cell block. This allows for easy copying and pasting for those who want to implement the entire process quickly. While this single block of code isn't designed as a step-by-step educational tool, it serves as a convenient reference for future use and helps streamline the process for those already familiar with the concepts. Below is the complete code implementation: + + +```python +import numpy as np +import pandas as pd + +from sklearn.model_selection import train_test_split +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LinearRegression + +def onehot_encode(df, column): + # Make a copy of the DataFrame to avoid modifying the original data + df = df.copy() + + # Use pandas get_dummies function to one-hot encode the specified column + dummies = pd.get_dummies(df[column]) + + # Concatenate the one-hot encoded columns with the original DataFrame + df = pd.concat([df, dummies], axis=1) + + # Drop the original categorical column since it's no longer needed + df = df.drop(column, axis=1) + + return df + +# URL of the CSV file +url = "https://raw.githubusercontent.com/arcus/education_modules/linear_regression/python_linear_regression/data/healthcare_investments_and_hospital_stay.csv" + +# Open and read the contents of the URL +url_contents = open_url(url) +text = url_contents.read() + +# Create a file-like object from the text content +file = io.StringIO(text) + +# Read the CSV data into a pandas DataFrame +data = pd.read_csv(file) + +# Analyze data and features +data.info() + +# Make a copy of the DataFrame to avoid modifying the original data accidentally +df = data.copy() + +# Apply one-hot encoding to the 'Location' column +df = onehot_encode(df, column='Location') + +# Print the resulting DataFrame to observe the effect of one-hot encoding +print(df.head()) + +# Separate the target variable 'Hospital_Stay' from the features +y = df['Hospital_Stay'].copy() +X = df.drop('Hospital_Stay', axis=1).copy() + +# Print the target variable and features to verify the separation +print("Target variable (y):") +print(y.head()) +print("\nFeatures (X):") +print(X.head()) + +# Split the data into training and testing sets +X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=123) + +# Print the shapes of the resulting training and testing sets +print("Training set - X shape:", X_train.shape, "y shape:", y_train.shape) +print("Testing set - X shape:", X_test.shape, "y shape:", y_test.shape) + +# Initialize a StandardScaler object +scaler = StandardScaler() + +# Fit the scaler to the training data, calculating the mean and standard deviation of each feature +scaler.fit(X_train) + +# Scale both training and testing data to have zero mean and unit variance +X_train = pd.DataFrame(scaler.transform(X_train), columns=X.columns) +X_test = pd.DataFrame(scaler.transform(X_test), columns=X.columns) + +# Print the scaled training and testing data to observe the effect of standardization +print("Scaled Training Data:") +print(X_train.head()) +print("\nScaled Testing Data:") +print(X_test.head()) + +# Create a Linear Regression model object +model = LinearRegression() + +# Fit the model to the training data +model.fit(X_train, y_train) + +# Make predictions using the trained model and the testing data +y_pred = model.predict(X_test) + +# Print the predicted hospital stay durations +print(y_pred) + +# Calculate Mean Squared Error (MSE) +mse = np.mean((y_pred - y_test)**2) + +# Print MSE +print("MSE:", mse) + +# Calculate and print R² Score +print("R² Score:", model.score(X_test, y_test)) +``` + ## Feedback @feedback