diff --git a/lab-intro-to-ml.ipynb b/lab-intro-to-ml.ipynb
new file mode 100644
index 0000000..40b00a2
--- /dev/null
+++ b/lab-intro-to-ml.ipynb
@@ -0,0 +1,2598 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Before your start:\n",
+ "- Read the README.md file\n",
+ "- Comment as much as you can and use the resources in the README.md file\n",
+ "- Happy learning!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Import your libraries\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "# Challenge 1 - Import and Describe the Dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### In this challenge we will use the `austin_weather` data. \n",
+ "\n",
+ "#### First, import it into a data frame called `austin`. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n",
+ "Austin = pd.read_csv('austin_weather.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Next, describe the dataset you have loaded: \n",
+ "- Look at the variables and their types\n",
+ "- Examine the descriptive statistics of the numeric variables \n",
+ "- Look at the first five rows of all variables to evaluate the categorical variables as well"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Date | \n",
+ " TempHighF | \n",
+ " TempAvgF | \n",
+ " TempLowF | \n",
+ " DewPointHighF | \n",
+ " DewPointAvgF | \n",
+ " DewPointLowF | \n",
+ " HumidityHighPercent | \n",
+ " HumidityAvgPercent | \n",
+ " HumidityLowPercent | \n",
+ " ... | \n",
+ " SeaLevelPressureAvgInches | \n",
+ " SeaLevelPressureLowInches | \n",
+ " VisibilityHighMiles | \n",
+ " VisibilityAvgMiles | \n",
+ " VisibilityLowMiles | \n",
+ " WindHighMPH | \n",
+ " WindAvgMPH | \n",
+ " WindGustMPH | \n",
+ " PrecipitationSumInches | \n",
+ " Events | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 2013-12-21 | \n",
+ " 74 | \n",
+ " 60 | \n",
+ " 45 | \n",
+ " 67 | \n",
+ " 49 | \n",
+ " 43 | \n",
+ " 93 | \n",
+ " 75 | \n",
+ " 57 | \n",
+ " ... | \n",
+ " 29.68 | \n",
+ " 29.59 | \n",
+ " 10 | \n",
+ " 7 | \n",
+ " 2 | \n",
+ " 20 | \n",
+ " 4 | \n",
+ " 31 | \n",
+ " 0.46 | \n",
+ " Rain , Thunderstorm | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2013-12-22 | \n",
+ " 56 | \n",
+ " 48 | \n",
+ " 39 | \n",
+ " 43 | \n",
+ " 36 | \n",
+ " 28 | \n",
+ " 93 | \n",
+ " 68 | \n",
+ " 43 | \n",
+ " ... | \n",
+ " 30.13 | \n",
+ " 29.87 | \n",
+ " 10 | \n",
+ " 10 | \n",
+ " 5 | \n",
+ " 16 | \n",
+ " 6 | \n",
+ " 25 | \n",
+ " 0 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 2013-12-23 | \n",
+ " 58 | \n",
+ " 45 | \n",
+ " 32 | \n",
+ " 31 | \n",
+ " 27 | \n",
+ " 23 | \n",
+ " 76 | \n",
+ " 52 | \n",
+ " 27 | \n",
+ " ... | \n",
+ " 30.49 | \n",
+ " 30.41 | \n",
+ " 10 | \n",
+ " 10 | \n",
+ " 10 | \n",
+ " 8 | \n",
+ " 3 | \n",
+ " 12 | \n",
+ " 0 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 2013-12-24 | \n",
+ " 61 | \n",
+ " 46 | \n",
+ " 31 | \n",
+ " 36 | \n",
+ " 28 | \n",
+ " 21 | \n",
+ " 89 | \n",
+ " 56 | \n",
+ " 22 | \n",
+ " ... | \n",
+ " 30.45 | \n",
+ " 30.3 | \n",
+ " 10 | \n",
+ " 10 | \n",
+ " 7 | \n",
+ " 12 | \n",
+ " 4 | \n",
+ " 20 | \n",
+ " 0 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2013-12-25 | \n",
+ " 58 | \n",
+ " 50 | \n",
+ " 41 | \n",
+ " 44 | \n",
+ " 40 | \n",
+ " 36 | \n",
+ " 86 | \n",
+ " 71 | \n",
+ " 56 | \n",
+ " ... | \n",
+ " 30.33 | \n",
+ " 30.27 | \n",
+ " 10 | \n",
+ " 10 | \n",
+ " 7 | \n",
+ " 10 | \n",
+ " 2 | \n",
+ " 16 | \n",
+ " T | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 21 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Date TempHighF TempAvgF TempLowF DewPointHighF DewPointAvgF \\\n",
+ "0 2013-12-21 74 60 45 67 49 \n",
+ "1 2013-12-22 56 48 39 43 36 \n",
+ "2 2013-12-23 58 45 32 31 27 \n",
+ "3 2013-12-24 61 46 31 36 28 \n",
+ "4 2013-12-25 58 50 41 44 40 \n",
+ "\n",
+ " DewPointLowF HumidityHighPercent HumidityAvgPercent HumidityLowPercent ... \\\n",
+ "0 43 93 75 57 ... \n",
+ "1 28 93 68 43 ... \n",
+ "2 23 76 52 27 ... \n",
+ "3 21 89 56 22 ... \n",
+ "4 36 86 71 56 ... \n",
+ "\n",
+ " SeaLevelPressureAvgInches SeaLevelPressureLowInches VisibilityHighMiles \\\n",
+ "0 29.68 29.59 10 \n",
+ "1 30.13 29.87 10 \n",
+ "2 30.49 30.41 10 \n",
+ "3 30.45 30.3 10 \n",
+ "4 30.33 30.27 10 \n",
+ "\n",
+ " VisibilityAvgMiles VisibilityLowMiles WindHighMPH WindAvgMPH WindGustMPH \\\n",
+ "0 7 2 20 4 31 \n",
+ "1 10 5 16 6 25 \n",
+ "2 10 10 8 3 12 \n",
+ "3 10 7 12 4 20 \n",
+ "4 10 7 10 2 16 \n",
+ "\n",
+ " PrecipitationSumInches Events \n",
+ "0 0.46 Rain , Thunderstorm \n",
+ "1 0 \n",
+ "2 0 \n",
+ "3 0 \n",
+ "4 T \n",
+ "\n",
+ "[5 rows x 21 columns]"
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(1319, 21)"
+ ]
+ },
+ "execution_count": 8,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " TempHighF | \n",
+ " TempAvgF | \n",
+ " TempLowF | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | count | \n",
+ " 1319.000000 | \n",
+ " 1319.000000 | \n",
+ " 1319.000000 | \n",
+ "
\n",
+ " \n",
+ " | mean | \n",
+ " 80.862775 | \n",
+ " 70.642911 | \n",
+ " 59.902957 | \n",
+ "
\n",
+ " \n",
+ " | std | \n",
+ " 14.766523 | \n",
+ " 14.045904 | \n",
+ " 14.190648 | \n",
+ "
\n",
+ " \n",
+ " | min | \n",
+ " 32.000000 | \n",
+ " 29.000000 | \n",
+ " 19.000000 | \n",
+ "
\n",
+ " \n",
+ " | 25% | \n",
+ " 72.000000 | \n",
+ " 62.000000 | \n",
+ " 49.000000 | \n",
+ "
\n",
+ " \n",
+ " | 50% | \n",
+ " 83.000000 | \n",
+ " 73.000000 | \n",
+ " 63.000000 | \n",
+ "
\n",
+ " \n",
+ " | 75% | \n",
+ " 92.000000 | \n",
+ " 83.000000 | \n",
+ " 73.000000 | \n",
+ "
\n",
+ " \n",
+ " | max | \n",
+ " 107.000000 | \n",
+ " 93.000000 | \n",
+ " 81.000000 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " TempHighF TempAvgF TempLowF\n",
+ "count 1319.000000 1319.000000 1319.000000\n",
+ "mean 80.862775 70.642911 59.902957\n",
+ "std 14.766523 14.045904 14.190648\n",
+ "min 32.000000 29.000000 19.000000\n",
+ "25% 72.000000 62.000000 49.000000\n",
+ "50% 83.000000 73.000000 63.000000\n",
+ "75% 92.000000 83.000000 73.000000\n",
+ "max 107.000000 93.000000 81.000000"
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin.describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Given the information you have learned from examining the dataset, write down three insights about the data in a markdown cell below"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Your Insights:\n",
+ "\n",
+ "1. There are 21 variables in the dataset. 3 of them are numeric and the rest contain some text.\n",
+ "\n",
+ "2. The average temperature in Austin ranged between around 70 degrees F and around 93 degrees F. The highest temperature observed during this period was 107 degrees F and the lowest was 19 degrees F.\n",
+ "\n",
+ "3. When we look at the head function, we see that a lot of variables contain numeric data even though these columns are of object type. This means we might have to do some data cleansing.\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Let's examine the DewPointAvgF variable by using the `unique()` function to list all unique values in this dataframe.\n",
+ "\n",
+ "Describe what you find in a markdown cell below the code. What did you notice? What do you think made Pandas to treat this column as *object* instead of *int64*? "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Date 1319\n",
+ "TempHighF 74\n",
+ "TempAvgF 64\n",
+ "TempLowF 61\n",
+ "DewPointHighF 64\n",
+ "DewPointAvgF 66\n",
+ "DewPointLowF 73\n",
+ "HumidityHighPercent 58\n",
+ "HumidityAvgPercent 69\n",
+ "HumidityLowPercent 82\n",
+ "SeaLevelPressureHighInches 105\n",
+ "SeaLevelPressureAvgInches 101\n",
+ "SeaLevelPressureLowInches 105\n",
+ "VisibilityHighMiles 5\n",
+ "VisibilityAvgMiles 10\n",
+ "VisibilityLowMiles 12\n",
+ "WindHighMPH 22\n",
+ "WindAvgMPH 13\n",
+ "WindGustMPH 37\n",
+ "PrecipitationSumInches 114\n",
+ "Events 9\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin.nunique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Date object\n",
+ "TempHighF int64\n",
+ "TempAvgF int64\n",
+ "TempLowF int64\n",
+ "DewPointHighF object\n",
+ "DewPointAvgF object\n",
+ "DewPointLowF object\n",
+ "HumidityHighPercent object\n",
+ "HumidityAvgPercent object\n",
+ "HumidityLowPercent object\n",
+ "SeaLevelPressureHighInches object\n",
+ "SeaLevelPressureAvgInches object\n",
+ "SeaLevelPressureLowInches object\n",
+ "VisibilityHighMiles object\n",
+ "VisibilityAvgMiles object\n",
+ "VisibilityLowMiles object\n",
+ "WindHighMPH object\n",
+ "WindAvgMPH object\n",
+ "WindGustMPH object\n",
+ "PrecipitationSumInches object\n",
+ "Events object\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 14,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin.dtypes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "0 49\n",
+ "1 36\n",
+ "2 27\n",
+ "3 28\n",
+ "4 40\n",
+ "Name: DewPointAvgF, dtype: object"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin['DewPointAvgF'].head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### Your observation here\n",
+ "###### Had to use nunique() and dtypes(). "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The following is a list of columns misrepresented as `object`. Use this list to convert the columns to numeric using the `pandas.to_numeric` function in the next cell. If you encounter errors in converting strings to numeric values, you need to catch those errors and force the conversion by supplying `errors='coerce'` as an argument for `pandas.to_numeric`. Coercing will replace non-convertable elements with `NaN` which represents an undefined numeric value. This makes it possible for us to conveniently handle missing values in subsequent data processing.\n",
+ "\n",
+ "*Hint: you may use a loop to change one column at a time but it is more efficient to use `apply`.*"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "wrong_type_columns = ['DewPointHighF', 'DewPointAvgF', 'DewPointLowF', 'HumidityHighPercent', \n",
+ " 'HumidityAvgPercent', 'HumidityLowPercent', 'SeaLevelPressureHighInches', \n",
+ " 'SeaLevelPressureAvgInches' ,'SeaLevelPressureLowInches', 'VisibilityHighMiles',\n",
+ " 'VisibilityAvgMiles', 'VisibilityLowMiles', 'WindHighMPH', 'WindAvgMPH', \n",
+ " 'WindGustMPH', 'PrecipitationSumInches']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "Austin[wrong_type_columns] = Austin[wrong_type_columns].apply(lambda x: pd.to_numeric(x, errors='coerce') if x.dtype == 'O' else x)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Check if your code has worked by printing the data types again. You should see only two `object` columns (`Date` and `Events`) now. All other columns should be `int64` or `float64`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Date object\n",
+ "TempHighF int64\n",
+ "TempAvgF int64\n",
+ "TempLowF int64\n",
+ "DewPointHighF float64\n",
+ "DewPointAvgF float64\n",
+ "DewPointLowF float64\n",
+ "HumidityHighPercent float64\n",
+ "HumidityAvgPercent float64\n",
+ "HumidityLowPercent float64\n",
+ "SeaLevelPressureHighInches float64\n",
+ "SeaLevelPressureAvgInches float64\n",
+ "SeaLevelPressureLowInches float64\n",
+ "VisibilityHighMiles float64\n",
+ "VisibilityAvgMiles float64\n",
+ "VisibilityLowMiles float64\n",
+ "WindHighMPH float64\n",
+ "WindAvgMPH float64\n",
+ "WindGustMPH float64\n",
+ "PrecipitationSumInches float64\n",
+ "Events object\n",
+ "dtype: object"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin.dtypes"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Date | \n",
+ " TempHighF | \n",
+ " TempAvgF | \n",
+ " TempLowF | \n",
+ " DewPointHighF | \n",
+ " DewPointAvgF | \n",
+ " DewPointLowF | \n",
+ " HumidityHighPercent | \n",
+ " HumidityAvgPercent | \n",
+ " HumidityLowPercent | \n",
+ " ... | \n",
+ " SeaLevelPressureAvgInches | \n",
+ " SeaLevelPressureLowInches | \n",
+ " VisibilityHighMiles | \n",
+ " VisibilityAvgMiles | \n",
+ " VisibilityLowMiles | \n",
+ " WindHighMPH | \n",
+ " WindAvgMPH | \n",
+ " WindGustMPH | \n",
+ " PrecipitationSumInches | \n",
+ " Events | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 2013-12-21 | \n",
+ " 74 | \n",
+ " 60 | \n",
+ " 45 | \n",
+ " 67.0 | \n",
+ " 49.0 | \n",
+ " 43.0 | \n",
+ " 93.0 | \n",
+ " 75.0 | \n",
+ " 57.0 | \n",
+ " ... | \n",
+ " 29.68 | \n",
+ " 29.59 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 2.0 | \n",
+ " 20.0 | \n",
+ " 4.0 | \n",
+ " 31.0 | \n",
+ " 0.46 | \n",
+ " Rain , Thunderstorm | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2013-12-22 | \n",
+ " 56 | \n",
+ " 48 | \n",
+ " 39 | \n",
+ " 43.0 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 93.0 | \n",
+ " 68.0 | \n",
+ " 43.0 | \n",
+ " ... | \n",
+ " 30.13 | \n",
+ " 29.87 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 5.0 | \n",
+ " 16.0 | \n",
+ " 6.0 | \n",
+ " 25.0 | \n",
+ " 0.00 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 2013-12-23 | \n",
+ " 58 | \n",
+ " 45 | \n",
+ " 32 | \n",
+ " 31.0 | \n",
+ " 27.0 | \n",
+ " 23.0 | \n",
+ " 76.0 | \n",
+ " 52.0 | \n",
+ " 27.0 | \n",
+ " ... | \n",
+ " 30.49 | \n",
+ " 30.41 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 8.0 | \n",
+ " 3.0 | \n",
+ " 12.0 | \n",
+ " 0.00 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 2013-12-24 | \n",
+ " 61 | \n",
+ " 46 | \n",
+ " 31 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 21.0 | \n",
+ " 89.0 | \n",
+ " 56.0 | \n",
+ " 22.0 | \n",
+ " ... | \n",
+ " 30.45 | \n",
+ " 30.30 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 12.0 | \n",
+ " 4.0 | \n",
+ " 20.0 | \n",
+ " 0.00 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2013-12-25 | \n",
+ " 58 | \n",
+ " 50 | \n",
+ " 41 | \n",
+ " 44.0 | \n",
+ " 40.0 | \n",
+ " 36.0 | \n",
+ " 86.0 | \n",
+ " 71.0 | \n",
+ " 56.0 | \n",
+ " ... | \n",
+ " 30.33 | \n",
+ " 30.27 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 10.0 | \n",
+ " 2.0 | \n",
+ " 16.0 | \n",
+ " NaN | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 21 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Date TempHighF TempAvgF TempLowF DewPointHighF DewPointAvgF \\\n",
+ "0 2013-12-21 74 60 45 67.0 49.0 \n",
+ "1 2013-12-22 56 48 39 43.0 36.0 \n",
+ "2 2013-12-23 58 45 32 31.0 27.0 \n",
+ "3 2013-12-24 61 46 31 36.0 28.0 \n",
+ "4 2013-12-25 58 50 41 44.0 40.0 \n",
+ "\n",
+ " DewPointLowF HumidityHighPercent HumidityAvgPercent HumidityLowPercent \\\n",
+ "0 43.0 93.0 75.0 57.0 \n",
+ "1 28.0 93.0 68.0 43.0 \n",
+ "2 23.0 76.0 52.0 27.0 \n",
+ "3 21.0 89.0 56.0 22.0 \n",
+ "4 36.0 86.0 71.0 56.0 \n",
+ "\n",
+ " ... SeaLevelPressureAvgInches SeaLevelPressureLowInches \\\n",
+ "0 ... 29.68 29.59 \n",
+ "1 ... 30.13 29.87 \n",
+ "2 ... 30.49 30.41 \n",
+ "3 ... 30.45 30.30 \n",
+ "4 ... 30.33 30.27 \n",
+ "\n",
+ " VisibilityHighMiles VisibilityAvgMiles VisibilityLowMiles WindHighMPH \\\n",
+ "0 10.0 7.0 2.0 20.0 \n",
+ "1 10.0 10.0 5.0 16.0 \n",
+ "2 10.0 10.0 10.0 8.0 \n",
+ "3 10.0 10.0 7.0 12.0 \n",
+ "4 10.0 10.0 7.0 10.0 \n",
+ "\n",
+ " WindAvgMPH WindGustMPH PrecipitationSumInches Events \n",
+ "0 4.0 31.0 0.46 Rain , Thunderstorm \n",
+ "1 6.0 25.0 0.00 \n",
+ "2 3.0 12.0 0.00 \n",
+ "3 4.0 20.0 0.00 \n",
+ "4 2.0 16.0 NaN \n",
+ "\n",
+ "[5 rows x 21 columns]"
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Challenge 2 - Handle the Missing Data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Now that we have fixed the type mismatch, let's address the missing data.\n",
+ "\n",
+ "By coercing the columns to numeric, we have created `NaN` for each cell containing characters. We should choose a strategy to address these missing data.\n",
+ "\n",
+ "The first step is to examine how many rows contain missing data.\n",
+ "\n",
+ "We check how much missing data we have by applying the `.isnull()` function to our dataset. To find the rows with missing data in any of its cells, we apply `.any(axis=1)` to the function. `austin.isnull().any(axis=1)` will return a column containing true if the row contains at least one missing value and false otherwise. Therefore we must subset our dataframe with this column. This will give us all rows with at least one missing value. \n",
+ "\n",
+ "#### In the next cell, identify all rows containing at least one missing value. Assign the dataframes with missing values to a variable called `missing_values`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Date 0\n",
+ "TempHighF 0\n",
+ "TempAvgF 0\n",
+ "TempLowF 0\n",
+ "DewPointHighF 7\n",
+ "DewPointAvgF 7\n",
+ "DewPointLowF 7\n",
+ "HumidityHighPercent 2\n",
+ "HumidityAvgPercent 2\n",
+ "HumidityLowPercent 2\n",
+ "SeaLevelPressureHighInches 3\n",
+ "SeaLevelPressureAvgInches 3\n",
+ "SeaLevelPressureLowInches 3\n",
+ "VisibilityHighMiles 12\n",
+ "VisibilityAvgMiles 12\n",
+ "VisibilityLowMiles 12\n",
+ "WindHighMPH 2\n",
+ "WindAvgMPH 2\n",
+ "WindGustMPH 4\n",
+ "PrecipitationSumInches 124\n",
+ "Events 0\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin.isna().sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "There are multiple strategies to handle missing data. Below lists the most common ones data scientists use:\n",
+ "\n",
+ "* Removing all rows or all columns containing missing data. This is the simplest strategy. It may work in some cases but not others.\n",
+ "\n",
+ "* Filling all missing values with a placeholder value. \n",
+ " * For categorical data, `0`, `-1`, and `9999` are some commonly used placeholder values. \n",
+ " * For continuous data, some may opt to fill all missing data with the mean. This strategy is not optimal since it can increase the fit of the model.\n",
+ "\n",
+ "* Filling the values using some algorithm. \n",
+ "\n",
+ "#### In our case, we will use a hybrid approach which is to first remove the data that contain most missing values then fill in the rest of the missing values with the *linear interpolation* algorithm."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Next, count the number of rows of `austin` and `missing_values`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1319"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "len(Austin)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "204"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin.isnull().sum().sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Calculate the ratio of missing rows to total rows"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "total_rows = len(Austin)\n",
+ "rows_with_missing_values = Austin.isnull().any(axis=1).sum()\n",
+ "ratio_missing_rows = rows_with_missing_values / total_rows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As you can see, there is a large proportion of missing data (over 10%). Perhaps we should evaluate which columns have the most missing data and remove those columns. For the remaining columns, we will perform a linear approximation of the missing data.\n",
+ "\n",
+ "We can find the number of missing rows in each column using the `.isna()` function. We then chain the `.sum` function to the `.isna()` function and find the number of missing rows per column"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Date 0\n",
+ "TempHighF 0\n",
+ "TempAvgF 0\n",
+ "TempLowF 0\n",
+ "DewPointHighF 7\n",
+ "DewPointAvgF 7\n",
+ "DewPointLowF 7\n",
+ "HumidityHighPercent 2\n",
+ "HumidityAvgPercent 2\n",
+ "HumidityLowPercent 2\n",
+ "SeaLevelPressureHighInches 3\n",
+ "SeaLevelPressureAvgInches 3\n",
+ "SeaLevelPressureLowInches 3\n",
+ "VisibilityHighMiles 12\n",
+ "VisibilityAvgMiles 12\n",
+ "VisibilityLowMiles 12\n",
+ "WindHighMPH 2\n",
+ "WindAvgMPH 2\n",
+ "WindGustMPH 4\n",
+ "PrecipitationSumInches 124\n",
+ "Events 0\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin.isna().sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### As you can see from the output, the majority of missing data is in one column called `PrecipitationSumInches`. What's the number of missing values in this column in ratio to its total number of rows?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n",
+ "total_rows = len(Austin)\n",
+ "missing_values_precipitation = Austin['PrecipitationSumInches'].isnull().sum()\n",
+ "ratio_missing_precipitation = missing_values_precipitation / total_rows"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Almost 10% data missing! Therefore, we prefer to remove this column instead of filling its missing values. \n",
+ "\n",
+ "#### Remove this column from `austin` using the `.drop()` function. Use the `inplace=True` argument.\n",
+ "\n",
+ "*Hints:*\n",
+ "\n",
+ "* By supplying `inplace=True` to `drop()`, the original dataframe object will be changed in place and the function will return `None`. In contrast, if you don't supply `inplace=True`, which is equivalent to supplying `inplace=False` because `False` is the default value, the original dataframe object will be kept and the function returns a copy of the transformed dataframe object. In the latter case, you'll have to assign the returned object back to your variable.\n",
+ "\n",
+ "* Also, since you are dropping a column instead of a row, you'll need to supply `axis=1` to `drop()`.\n",
+ "\n",
+ "[Reference for `pandas.DataFrame.drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Date | \n",
+ " TempHighF | \n",
+ " TempAvgF | \n",
+ " TempLowF | \n",
+ " DewPointHighF | \n",
+ " DewPointAvgF | \n",
+ " DewPointLowF | \n",
+ " HumidityHighPercent | \n",
+ " HumidityAvgPercent | \n",
+ " HumidityLowPercent | \n",
+ " SeaLevelPressureHighInches | \n",
+ " SeaLevelPressureAvgInches | \n",
+ " SeaLevelPressureLowInches | \n",
+ " VisibilityHighMiles | \n",
+ " VisibilityAvgMiles | \n",
+ " VisibilityLowMiles | \n",
+ " WindHighMPH | \n",
+ " WindAvgMPH | \n",
+ " WindGustMPH | \n",
+ " Events | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 2013-12-21 | \n",
+ " 74 | \n",
+ " 60 | \n",
+ " 45 | \n",
+ " 67.0 | \n",
+ " 49.0 | \n",
+ " 43.0 | \n",
+ " 93.0 | \n",
+ " 75.0 | \n",
+ " 57.0 | \n",
+ " 29.86 | \n",
+ " 29.68 | \n",
+ " 29.59 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 2.0 | \n",
+ " 20.0 | \n",
+ " 4.0 | \n",
+ " 31.0 | \n",
+ " Rain , Thunderstorm | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2013-12-22 | \n",
+ " 56 | \n",
+ " 48 | \n",
+ " 39 | \n",
+ " 43.0 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 93.0 | \n",
+ " 68.0 | \n",
+ " 43.0 | \n",
+ " 30.41 | \n",
+ " 30.13 | \n",
+ " 29.87 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 5.0 | \n",
+ " 16.0 | \n",
+ " 6.0 | \n",
+ " 25.0 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 2013-12-23 | \n",
+ " 58 | \n",
+ " 45 | \n",
+ " 32 | \n",
+ " 31.0 | \n",
+ " 27.0 | \n",
+ " 23.0 | \n",
+ " 76.0 | \n",
+ " 52.0 | \n",
+ " 27.0 | \n",
+ " 30.56 | \n",
+ " 30.49 | \n",
+ " 30.41 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 8.0 | \n",
+ " 3.0 | \n",
+ " 12.0 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 2013-12-24 | \n",
+ " 61 | \n",
+ " 46 | \n",
+ " 31 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 21.0 | \n",
+ " 89.0 | \n",
+ " 56.0 | \n",
+ " 22.0 | \n",
+ " 30.56 | \n",
+ " 30.45 | \n",
+ " 30.30 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 12.0 | \n",
+ " 4.0 | \n",
+ " 20.0 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2013-12-25 | \n",
+ " 58 | \n",
+ " 50 | \n",
+ " 41 | \n",
+ " 44.0 | \n",
+ " 40.0 | \n",
+ " 36.0 | \n",
+ " 86.0 | \n",
+ " 71.0 | \n",
+ " 56.0 | \n",
+ " 30.41 | \n",
+ " 30.33 | \n",
+ " 30.27 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 10.0 | \n",
+ " 2.0 | \n",
+ " 16.0 | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Date TempHighF TempAvgF TempLowF DewPointHighF DewPointAvgF \\\n",
+ "0 2013-12-21 74 60 45 67.0 49.0 \n",
+ "1 2013-12-22 56 48 39 43.0 36.0 \n",
+ "2 2013-12-23 58 45 32 31.0 27.0 \n",
+ "3 2013-12-24 61 46 31 36.0 28.0 \n",
+ "4 2013-12-25 58 50 41 44.0 40.0 \n",
+ "\n",
+ " DewPointLowF HumidityHighPercent HumidityAvgPercent HumidityLowPercent \\\n",
+ "0 43.0 93.0 75.0 57.0 \n",
+ "1 28.0 93.0 68.0 43.0 \n",
+ "2 23.0 76.0 52.0 27.0 \n",
+ "3 21.0 89.0 56.0 22.0 \n",
+ "4 36.0 86.0 71.0 56.0 \n",
+ "\n",
+ " SeaLevelPressureHighInches SeaLevelPressureAvgInches \\\n",
+ "0 29.86 29.68 \n",
+ "1 30.41 30.13 \n",
+ "2 30.56 30.49 \n",
+ "3 30.56 30.45 \n",
+ "4 30.41 30.33 \n",
+ "\n",
+ " SeaLevelPressureLowInches VisibilityHighMiles VisibilityAvgMiles \\\n",
+ "0 29.59 10.0 7.0 \n",
+ "1 29.87 10.0 10.0 \n",
+ "2 30.41 10.0 10.0 \n",
+ "3 30.30 10.0 10.0 \n",
+ "4 30.27 10.0 10.0 \n",
+ "\n",
+ " VisibilityLowMiles WindHighMPH WindAvgMPH WindGustMPH \\\n",
+ "0 2.0 20.0 4.0 31.0 \n",
+ "1 5.0 16.0 6.0 25.0 \n",
+ "2 10.0 8.0 3.0 12.0 \n",
+ "3 7.0 12.0 4.0 20.0 \n",
+ "4 7.0 10.0 2.0 16.0 \n",
+ "\n",
+ " Events \n",
+ "0 Rain , Thunderstorm \n",
+ "1 \n",
+ "2 \n",
+ "3 \n",
+ "4 "
+ ]
+ },
+ "execution_count": 37,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here \n",
+ "Austin.drop(columns='PrecipitationSumInches', inplace=True)\n",
+ "\n",
+ "# Print `austin` to confirm the column is indeed removed\n",
+ "Austin.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Next we will perform linear interpolation of the missing data.\n",
+ "\n",
+ "This means that we will use a linear algorithm to estimate the missing data. Linear interpolation assumes that there is a straight line between the points and the missing point will fall on that line. This is a good enough approximation for weather related data. Weather related data is typically a time series. Therefore, we do not want to drop rows from our data if possible. It is prefereable to estimate the missing values rather than remove the rows. However, if you have data from a single point in time, perhaps a better solution would be to remove the rows. \n",
+ "\n",
+ "If you would like to read more about linear interpolation, you can do so [here](https://en.wikipedia.org/wiki/Linear_interpolation).\n",
+ "\n",
+ "In the following cell, use the `.interpolate()` function on the entire dataframe. This time pass the `inplace=False` argument to the function and assign the interpolated dataframe to a new variable called `austin_fixed` so that we can compare with `austin`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "C:\\Users\\julyj\\AppData\\Local\\Temp\\ipykernel_25324\\2434206715.py:4: FutureWarning: DataFrame.interpolate with object dtype is deprecated and will raise in a future version. Call obj.infer_objects(copy=False) before interpolating instead.\n",
+ " Austin_fixed = Austin_fixed.interpolate(method='linear', axis=0)\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "threshold = 0.5 \n",
+ "Austin_fixed = Austin.dropna(thresh=int((1-threshold)*len(Austin.columns)), axis=0)\n",
+ "Austin_fixed = Austin_fixed.interpolate(method='linear', axis=0)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Check to make sure `austin_fixed` contains no missing data. Also check `austin` - it still contains missing data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Date 0\n",
+ "TempHighF 0\n",
+ "TempAvgF 0\n",
+ "TempLowF 0\n",
+ "DewPointHighF 0\n",
+ "DewPointAvgF 0\n",
+ "DewPointLowF 0\n",
+ "HumidityHighPercent 0\n",
+ "HumidityAvgPercent 0\n",
+ "HumidityLowPercent 0\n",
+ "SeaLevelPressureHighInches 0\n",
+ "SeaLevelPressureAvgInches 0\n",
+ "SeaLevelPressureLowInches 0\n",
+ "VisibilityHighMiles 0\n",
+ "VisibilityAvgMiles 0\n",
+ "VisibilityLowMiles 0\n",
+ "WindHighMPH 0\n",
+ "WindAvgMPH 0\n",
+ "WindGustMPH 0\n",
+ "Events 0\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin_fixed.isnull().sum()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Date 0\n",
+ "TempHighF 0\n",
+ "TempAvgF 0\n",
+ "TempLowF 0\n",
+ "DewPointHighF 7\n",
+ "DewPointAvgF 7\n",
+ "DewPointLowF 7\n",
+ "HumidityHighPercent 2\n",
+ "HumidityAvgPercent 2\n",
+ "HumidityLowPercent 2\n",
+ "SeaLevelPressureHighInches 3\n",
+ "SeaLevelPressureAvgInches 3\n",
+ "SeaLevelPressureLowInches 3\n",
+ "VisibilityHighMiles 12\n",
+ "VisibilityAvgMiles 12\n",
+ "VisibilityLowMiles 12\n",
+ "WindHighMPH 2\n",
+ "WindAvgMPH 2\n",
+ "WindGustMPH 4\n",
+ "Events 0\n",
+ "dtype: int64"
+ ]
+ },
+ "execution_count": 42,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin.isnull().sum()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Challenge 3 - Processing the `Events` Column"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Our dataframe contains one true text column - the Events column. We should evaluate this column to determine how to process it.\n",
+ "\n",
+ "Use the `value_counts()` function to evaluate the contents of this column"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Events\n",
+ " 903\n",
+ "Rain 192\n",
+ "Rain , Thunderstorm 137\n",
+ "Fog , Rain , Thunderstorm 33\n",
+ "Fog 21\n",
+ "Thunderstorm 17\n",
+ "Fog , Rain 14\n",
+ "Rain , Snow 1\n",
+ "Fog , Thunderstorm 1\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here:\n",
+ "Austin['Events'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array(['Rain , Thunderstorm', ' ', 'Rain', 'Fog', 'Rain , Snow',\n",
+ " 'Fog , Rain', 'Thunderstorm', 'Fog , Rain , Thunderstorm',\n",
+ " 'Fog , Thunderstorm'], dtype=object)"
+ ]
+ },
+ "execution_count": 46,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin_fixed.Events.unique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "9"
+ ]
+ },
+ "execution_count": 47,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin_fixed.Events.nunique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "count 1317\n",
+ "unique 9\n",
+ "top \n",
+ "freq 901\n",
+ "Name: Events, dtype: object"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin_fixed.Events.describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Reading the values of `Events` and reflecting what those values mean in the context of data, you realize this column indicates what weather events had happened in a particular day.\n",
+ "\n",
+ "#### What is the largest number of events happened in a single day? Enter your answer in the next cell."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "The largest number of events happened in a single day: 3\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Your answer:\n",
+ "Austin['NumEvents'] = Austin['Events'].apply(lambda x: len(str(x).split(',')) if pd.notnull(x) else 0)\n",
+ "max_events = Austin['NumEvents'].max()\n",
+ "print(f\"The largest number of events happened in a single day: {max_events}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### We want to transform the string-type `Events` values to the numbers. This will allow us to apply machine learning algorithms easily.\n",
+ "\n",
+ "How? We will create a new column for each type of events (i.e. *Rain*, *Snow*, *Fog*, *Thunderstorm*. In each column, we use `1` to indicate if the corresponding event happened in that day and use `0` otherwise.\n",
+ "\n",
+ "Below we provide you a list of all event types. Loop the list and create a dummy column with `0` values for each event in `austin_fixed`. To create a new dummy column with `0` values, simply use `austin_fixed[event] = 0`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Date | \n",
+ " TempHighF | \n",
+ " TempAvgF | \n",
+ " TempLowF | \n",
+ " DewPointHighF | \n",
+ " DewPointAvgF | \n",
+ " DewPointLowF | \n",
+ " HumidityHighPercent | \n",
+ " HumidityAvgPercent | \n",
+ " HumidityLowPercent | \n",
+ " ... | \n",
+ " VisibilityAvgMiles | \n",
+ " VisibilityLowMiles | \n",
+ " WindHighMPH | \n",
+ " WindAvgMPH | \n",
+ " WindGustMPH | \n",
+ " Events | \n",
+ " Snow | \n",
+ " Fog | \n",
+ " Rain | \n",
+ " Thunderstorm | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 2013-12-21 | \n",
+ " 74 | \n",
+ " 60 | \n",
+ " 45 | \n",
+ " 67.0 | \n",
+ " 49.0 | \n",
+ " 43.0 | \n",
+ " 93.0 | \n",
+ " 75.0 | \n",
+ " 57.0 | \n",
+ " ... | \n",
+ " 7.0 | \n",
+ " 2.0 | \n",
+ " 20.0 | \n",
+ " 4.0 | \n",
+ " 31.0 | \n",
+ " Rain , Thunderstorm | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2013-12-22 | \n",
+ " 56 | \n",
+ " 48 | \n",
+ " 39 | \n",
+ " 43.0 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 93.0 | \n",
+ " 68.0 | \n",
+ " 43.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 5.0 | \n",
+ " 16.0 | \n",
+ " 6.0 | \n",
+ " 25.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 2013-12-23 | \n",
+ " 58 | \n",
+ " 45 | \n",
+ " 32 | \n",
+ " 31.0 | \n",
+ " 27.0 | \n",
+ " 23.0 | \n",
+ " 76.0 | \n",
+ " 52.0 | \n",
+ " 27.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 8.0 | \n",
+ " 3.0 | \n",
+ " 12.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 2013-12-24 | \n",
+ " 61 | \n",
+ " 46 | \n",
+ " 31 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 21.0 | \n",
+ " 89.0 | \n",
+ " 56.0 | \n",
+ " 22.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 12.0 | \n",
+ " 4.0 | \n",
+ " 20.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2013-12-25 | \n",
+ " 58 | \n",
+ " 50 | \n",
+ " 41 | \n",
+ " 44.0 | \n",
+ " 40.0 | \n",
+ " 36.0 | \n",
+ " 86.0 | \n",
+ " 71.0 | \n",
+ " 56.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 10.0 | \n",
+ " 2.0 | \n",
+ " 16.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 24 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Date TempHighF TempAvgF TempLowF DewPointHighF DewPointAvgF \\\n",
+ "0 2013-12-21 74 60 45 67.0 49.0 \n",
+ "1 2013-12-22 56 48 39 43.0 36.0 \n",
+ "2 2013-12-23 58 45 32 31.0 27.0 \n",
+ "3 2013-12-24 61 46 31 36.0 28.0 \n",
+ "4 2013-12-25 58 50 41 44.0 40.0 \n",
+ "\n",
+ " DewPointLowF HumidityHighPercent HumidityAvgPercent HumidityLowPercent \\\n",
+ "0 43.0 93.0 75.0 57.0 \n",
+ "1 28.0 93.0 68.0 43.0 \n",
+ "2 23.0 76.0 52.0 27.0 \n",
+ "3 21.0 89.0 56.0 22.0 \n",
+ "4 36.0 86.0 71.0 56.0 \n",
+ "\n",
+ " ... VisibilityAvgMiles VisibilityLowMiles WindHighMPH WindAvgMPH \\\n",
+ "0 ... 7.0 2.0 20.0 4.0 \n",
+ "1 ... 10.0 5.0 16.0 6.0 \n",
+ "2 ... 10.0 10.0 8.0 3.0 \n",
+ "3 ... 10.0 7.0 12.0 4.0 \n",
+ "4 ... 10.0 7.0 10.0 2.0 \n",
+ "\n",
+ " WindGustMPH Events Snow Fog Rain Thunderstorm \n",
+ "0 31.0 Rain , Thunderstorm 0 0 0 0 \n",
+ "1 25.0 0 0 0 0 \n",
+ "2 12.0 0 0 0 0 \n",
+ "3 20.0 0 0 0 0 \n",
+ "4 16.0 0 0 0 0 \n",
+ "\n",
+ "[5 rows x 24 columns]"
+ ]
+ },
+ "execution_count": 52,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "event_list = ['Snow', 'Fog', 'Rain', 'Thunderstorm']\n",
+ "\n",
+ "# Your code here\n",
+ "for event in event_list:\n",
+ " Austin_fixed[event] = 0\n",
+ "\n",
+ "\n",
+ "# Print your new dataframe to check whether new columns have been created:\n",
+ "\n",
+ "Austin_fixed.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Next, populate the actual values in the dummy columns of `austin_fixed`.\n",
+ "\n",
+ "You will check the *Events* column. If its string value contains `Rain`, then the *Rain* column should be `1`. The same for `Snow`, `Fog`, and `Thunderstorm`.\n",
+ "\n",
+ "*Hints:*\n",
+ "\n",
+ "* Use [`pandas.Series.str.contains()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html) to create the value series of each new column.\n",
+ "\n",
+ "* What if the values you populated are booleans instead of numbers? You can cast the boolean values to numbers by using `.astype(int)`. For instance, `pd.Series([True, True, False]).astype(int)` will return a new series with values of `[1, 1, 0]`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 65,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n",
+ "Austin_fixed['Rain'] = Austin_fixed['Events'].str.contains('Rain').astype(int)\n",
+ "Austin_fixed['Snow'] = Austin_fixed['Events'].str.contains('Snow').astype(int)\n",
+ "Austin_fixed['Fog'] = Austin_fixed['Events'].str.contains('Fog').astype(int)\n",
+ "Austin_fixed['Thunderstorm'] = Austin_fixed['Events'].str.contains('Thunderstorm').astype(int)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Print out `austin_fixed` to check if the event columns are populated with the intended values"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 67,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Date | \n",
+ " TempHighF | \n",
+ " TempAvgF | \n",
+ " TempLowF | \n",
+ " DewPointHighF | \n",
+ " DewPointAvgF | \n",
+ " DewPointLowF | \n",
+ " HumidityHighPercent | \n",
+ " HumidityAvgPercent | \n",
+ " HumidityLowPercent | \n",
+ " ... | \n",
+ " VisibilityAvgMiles | \n",
+ " VisibilityLowMiles | \n",
+ " WindHighMPH | \n",
+ " WindAvgMPH | \n",
+ " WindGustMPH | \n",
+ " Events | \n",
+ " Snow | \n",
+ " Fog | \n",
+ " Rain | \n",
+ " Thunderstorm | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 2013-12-21 | \n",
+ " 74 | \n",
+ " 60 | \n",
+ " 45 | \n",
+ " 67.0 | \n",
+ " 49.0 | \n",
+ " 43.0 | \n",
+ " 93.0 | \n",
+ " 75.0 | \n",
+ " 57.0 | \n",
+ " ... | \n",
+ " 7.0 | \n",
+ " 2.0 | \n",
+ " 20.0 | \n",
+ " 4.0 | \n",
+ " 31.0 | \n",
+ " Rain , Thunderstorm | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 2013-12-22 | \n",
+ " 56 | \n",
+ " 48 | \n",
+ " 39 | \n",
+ " 43.0 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 93.0 | \n",
+ " 68.0 | \n",
+ " 43.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 5.0 | \n",
+ " 16.0 | \n",
+ " 6.0 | \n",
+ " 25.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 2013-12-23 | \n",
+ " 58 | \n",
+ " 45 | \n",
+ " 32 | \n",
+ " 31.0 | \n",
+ " 27.0 | \n",
+ " 23.0 | \n",
+ " 76.0 | \n",
+ " 52.0 | \n",
+ " 27.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 8.0 | \n",
+ " 3.0 | \n",
+ " 12.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 2013-12-24 | \n",
+ " 61 | \n",
+ " 46 | \n",
+ " 31 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 21.0 | \n",
+ " 89.0 | \n",
+ " 56.0 | \n",
+ " 22.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 12.0 | \n",
+ " 4.0 | \n",
+ " 20.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 2013-12-25 | \n",
+ " 58 | \n",
+ " 50 | \n",
+ " 41 | \n",
+ " 44.0 | \n",
+ " 40.0 | \n",
+ " 36.0 | \n",
+ " 86.0 | \n",
+ " 71.0 | \n",
+ " 56.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 10.0 | \n",
+ " 2.0 | \n",
+ " 16.0 | \n",
+ " | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 24 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Date TempHighF TempAvgF TempLowF DewPointHighF DewPointAvgF \\\n",
+ "0 2013-12-21 74 60 45 67.0 49.0 \n",
+ "1 2013-12-22 56 48 39 43.0 36.0 \n",
+ "2 2013-12-23 58 45 32 31.0 27.0 \n",
+ "3 2013-12-24 61 46 31 36.0 28.0 \n",
+ "4 2013-12-25 58 50 41 44.0 40.0 \n",
+ "\n",
+ " DewPointLowF HumidityHighPercent HumidityAvgPercent HumidityLowPercent \\\n",
+ "0 43.0 93.0 75.0 57.0 \n",
+ "1 28.0 93.0 68.0 43.0 \n",
+ "2 23.0 76.0 52.0 27.0 \n",
+ "3 21.0 89.0 56.0 22.0 \n",
+ "4 36.0 86.0 71.0 56.0 \n",
+ "\n",
+ " ... VisibilityAvgMiles VisibilityLowMiles WindHighMPH WindAvgMPH \\\n",
+ "0 ... 7.0 2.0 20.0 4.0 \n",
+ "1 ... 10.0 5.0 16.0 6.0 \n",
+ "2 ... 10.0 10.0 8.0 3.0 \n",
+ "3 ... 10.0 7.0 12.0 4.0 \n",
+ "4 ... 10.0 7.0 10.0 2.0 \n",
+ "\n",
+ " WindGustMPH Events Snow Fog Rain Thunderstorm \n",
+ "0 31.0 Rain , Thunderstorm 0 0 1 1 \n",
+ "1 25.0 0 0 0 0 \n",
+ "2 12.0 0 0 0 0 \n",
+ "3 20.0 0 0 0 0 \n",
+ "4 16.0 0 0 0 0 \n",
+ "\n",
+ "[5 rows x 24 columns]"
+ ]
+ },
+ "execution_count": 67,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Your code here\n",
+ "Austin_fixed.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### If your code worked correctly, now we can drop the `Events` column as we don't need it any more."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 71,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here\n",
+ "Austin_fixed.drop('Events', axis=1, inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Challenge 4 - Processing The `Date` Column\n",
+ "\n",
+ "The `Date` column is another non-numeric field in our dataset. A value in that field looks like `'2014-01-06'` which consists of the year, month, and day connected with hyphens. One way to convert the date string to numerical is using a similar approach as we used for `Events`, namely splitting the column into numerical `Year`, `Month`, and `Day` columns. In this challenge we'll show you another way which is to use the Python `datetime` library's `toordinal()` function. Depending on what actual machine learning analysis you will conduct, each approach has its pros and cons. Our goal today is to practice data preparation so we'll skip the discussion here.\n",
+ "\n",
+ "Here you can find the [reference](https://docs.python.org/3/library/datetime.html) and [example](https://stackoverflow.com/questions/39846918/convert-date-to-ordinal-python) for `toordinal`. The basic process is to first convert the string to a `datetime` object using `datetime.datetime.strptime`, then convert the `datetime` object to numerical using `toordinal`.\n",
+ "\n",
+ "#### In the cell below, convert the `Date` column values from string to numeric values using `toordinal()`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 83,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import datetime\n",
+ "Austin_fixed['Date'] = Austin_fixed['Date'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d').toordinal())\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Print `austin_fixed` to check your `Date` column."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 85,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " Date | \n",
+ " TempHighF | \n",
+ " TempAvgF | \n",
+ " TempLowF | \n",
+ " DewPointHighF | \n",
+ " DewPointAvgF | \n",
+ " DewPointLowF | \n",
+ " HumidityHighPercent | \n",
+ " HumidityAvgPercent | \n",
+ " HumidityLowPercent | \n",
+ " ... | \n",
+ " VisibilityHighMiles | \n",
+ " VisibilityAvgMiles | \n",
+ " VisibilityLowMiles | \n",
+ " WindHighMPH | \n",
+ " WindAvgMPH | \n",
+ " WindGustMPH | \n",
+ " Snow | \n",
+ " Fog | \n",
+ " Rain | \n",
+ " Thunderstorm | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | 0 | \n",
+ " 735223 | \n",
+ " 74 | \n",
+ " 60 | \n",
+ " 45 | \n",
+ " 67.0 | \n",
+ " 49.0 | \n",
+ " 43.0 | \n",
+ " 93.0 | \n",
+ " 75.0 | \n",
+ " 57.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 2.0 | \n",
+ " 20.0 | \n",
+ " 4.0 | \n",
+ " 31.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 1 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ " | 1 | \n",
+ " 735224 | \n",
+ " 56 | \n",
+ " 48 | \n",
+ " 39 | \n",
+ " 43.0 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 93.0 | \n",
+ " 68.0 | \n",
+ " 43.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 5.0 | \n",
+ " 16.0 | \n",
+ " 6.0 | \n",
+ " 25.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 2 | \n",
+ " 735225 | \n",
+ " 58 | \n",
+ " 45 | \n",
+ " 32 | \n",
+ " 31.0 | \n",
+ " 27.0 | \n",
+ " 23.0 | \n",
+ " 76.0 | \n",
+ " 52.0 | \n",
+ " 27.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 8.0 | \n",
+ " 3.0 | \n",
+ " 12.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 3 | \n",
+ " 735226 | \n",
+ " 61 | \n",
+ " 46 | \n",
+ " 31 | \n",
+ " 36.0 | \n",
+ " 28.0 | \n",
+ " 21.0 | \n",
+ " 89.0 | \n",
+ " 56.0 | \n",
+ " 22.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 12.0 | \n",
+ " 4.0 | \n",
+ " 20.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " | 4 | \n",
+ " 735227 | \n",
+ " 58 | \n",
+ " 50 | \n",
+ " 41 | \n",
+ " 44.0 | \n",
+ " 40.0 | \n",
+ " 36.0 | \n",
+ " 86.0 | \n",
+ " 71.0 | \n",
+ " 56.0 | \n",
+ " ... | \n",
+ " 10.0 | \n",
+ " 10.0 | \n",
+ " 7.0 | \n",
+ " 10.0 | \n",
+ " 2.0 | \n",
+ " 16.0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 23 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " Date TempHighF TempAvgF TempLowF DewPointHighF DewPointAvgF \\\n",
+ "0 735223 74 60 45 67.0 49.0 \n",
+ "1 735224 56 48 39 43.0 36.0 \n",
+ "2 735225 58 45 32 31.0 27.0 \n",
+ "3 735226 61 46 31 36.0 28.0 \n",
+ "4 735227 58 50 41 44.0 40.0 \n",
+ "\n",
+ " DewPointLowF HumidityHighPercent HumidityAvgPercent HumidityLowPercent \\\n",
+ "0 43.0 93.0 75.0 57.0 \n",
+ "1 28.0 93.0 68.0 43.0 \n",
+ "2 23.0 76.0 52.0 27.0 \n",
+ "3 21.0 89.0 56.0 22.0 \n",
+ "4 36.0 86.0 71.0 56.0 \n",
+ "\n",
+ " ... VisibilityHighMiles VisibilityAvgMiles VisibilityLowMiles \\\n",
+ "0 ... 10.0 7.0 2.0 \n",
+ "1 ... 10.0 10.0 5.0 \n",
+ "2 ... 10.0 10.0 10.0 \n",
+ "3 ... 10.0 10.0 7.0 \n",
+ "4 ... 10.0 10.0 7.0 \n",
+ "\n",
+ " WindHighMPH WindAvgMPH WindGustMPH Snow Fog Rain Thunderstorm \n",
+ "0 20.0 4.0 31.0 0 0 1 1 \n",
+ "1 16.0 6.0 25.0 0 0 0 0 \n",
+ "2 8.0 3.0 12.0 0 0 0 0 \n",
+ "3 12.0 4.0 20.0 0 0 0 0 \n",
+ "4 10.0 2.0 16.0 0 0 0 0 \n",
+ "\n",
+ "[5 rows x 23 columns]"
+ ]
+ },
+ "execution_count": 85,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "Austin_fixed.head(5)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Challenge 5 - Sampling and Holdout Sets"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Now that we have processed the data for machine learning, we will separate the data to test and training sets.\n",
+ "\n",
+ "We first train the model using only the training set. We check our metrics on the training set. We then apply the model to the test set and check our metrics on the test set as well. If the metrics are significantly more optimal on the training set, then we know we have overfit our model. We will need to revise our model to ensure it will be more applicable to data outside the test set."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### In the next cells we will separate the data into a training set and a test set using the `train_test_split()` function in scikit-learn.\n",
+ "\n",
+ "When using `scikit-learn` for machine learning, we first separate the data to predictor and response variables. This is the standard way of passing datasets into a model in `scikit-learn`. The `scikit-learn` will then find out whether the predictors and responses fit the model.\n",
+ "\n",
+ "In the next cell, assign the `TempAvgF` column to `y` and the remaining columns to `X`. Your `X` should be a subset of `austin_fixed` containing the following columns: \n",
+ "\n",
+ "```['Date',\n",
+ " 'TempHighF',\n",
+ " 'TempLowF',\n",
+ " 'DewPointHighF',\n",
+ " 'DewPointAvgF',\n",
+ " 'DewPointLowF',\n",
+ " 'HumidityHighPercent',\n",
+ " 'HumidityAvgPercent',\n",
+ " 'HumidityLowPercent',\n",
+ " 'SeaLevelPressureHighInches',\n",
+ " 'SeaLevelPressureAvgInches',\n",
+ " 'SeaLevelPressureLowInches',\n",
+ " 'VisibilityHighMiles',\n",
+ " 'VisibilityAvgMiles',\n",
+ " 'VisibilityLowMiles',\n",
+ " 'WindHighMPH',\n",
+ " 'WindAvgMPH',\n",
+ " 'WindGustMPH',\n",
+ " 'Snow',\n",
+ " 'Fog',\n",
+ " 'Rain',\n",
+ " 'Thunderstorm']```\n",
+ " \n",
+ " Your `y` should be a subset of `austin_fixed` containing one column `TempAvgF`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 89,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "X (predictor variables):\n",
+ " TempHighF TempLowF DewPointHighF DewPointAvgF DewPointLowF \\\n",
+ "0 74 45 67.0 49.0 43.0 \n",
+ "1 56 39 43.0 36.0 28.0 \n",
+ "2 58 32 31.0 27.0 23.0 \n",
+ "3 61 31 36.0 28.0 21.0 \n",
+ "4 58 41 44.0 40.0 36.0 \n",
+ "\n",
+ " HumidityHighPercent HumidityAvgPercent HumidityLowPercent \\\n",
+ "0 93.0 75.0 57.0 \n",
+ "1 93.0 68.0 43.0 \n",
+ "2 76.0 52.0 27.0 \n",
+ "3 89.0 56.0 22.0 \n",
+ "4 86.0 71.0 56.0 \n",
+ "\n",
+ " SeaLevelPressureHighInches SeaLevelPressureAvgInches ... \\\n",
+ "0 29.86 29.68 ... \n",
+ "1 30.41 30.13 ... \n",
+ "2 30.56 30.49 ... \n",
+ "3 30.56 30.45 ... \n",
+ "4 30.41 30.33 ... \n",
+ "\n",
+ " VisibilityHighMiles VisibilityAvgMiles VisibilityLowMiles WindHighMPH \\\n",
+ "0 10.0 7.0 2.0 20.0 \n",
+ "1 10.0 10.0 5.0 16.0 \n",
+ "2 10.0 10.0 10.0 8.0 \n",
+ "3 10.0 10.0 7.0 12.0 \n",
+ "4 10.0 10.0 7.0 10.0 \n",
+ "\n",
+ " WindAvgMPH WindGustMPH Snow Fog Rain Thunderstorm \n",
+ "0 4.0 31.0 0 0 1 1 \n",
+ "1 6.0 25.0 0 0 0 0 \n",
+ "2 3.0 12.0 0 0 0 0 \n",
+ "3 4.0 20.0 0 0 0 0 \n",
+ "4 2.0 16.0 0 0 0 0 \n",
+ "\n",
+ "[5 rows x 21 columns]\n",
+ "\n",
+ "y (response variable):\n",
+ " 0 60\n",
+ "1 48\n",
+ "2 45\n",
+ "3 46\n",
+ "4 50\n",
+ "Name: TempAvgF, dtype: int64\n"
+ ]
+ }
+ ],
+ "source": [
+ "X = Austin_fixed[['TempHighF', 'TempLowF', 'DewPointHighF', 'DewPointAvgF', 'DewPointLowF',\n",
+ " 'HumidityHighPercent', 'HumidityAvgPercent', 'HumidityLowPercent',\n",
+ " 'SeaLevelPressureHighInches', 'SeaLevelPressureAvgInches', 'SeaLevelPressureLowInches',\n",
+ " 'VisibilityHighMiles', 'VisibilityAvgMiles', 'VisibilityLowMiles',\n",
+ " 'WindHighMPH', 'WindAvgMPH', 'WindGustMPH', 'Snow', 'Fog', 'Rain', 'Thunderstorm']]\n",
+ "\n",
+ "y = Austin_fixed['TempAvgF']\n",
+ "\n",
+ "\n",
+ "print(\"X (predictor variables):\\n\", X.head())\n",
+ "print(\"\\ny (response variable):\\n\", y.head())\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In the next cell, import `train_test_split` from `sklearn.model_selection`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 91,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#Your code here:\n",
+ "from sklearn.model_selection import train_test_split"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now that we have split the data to predictor and response variables and imported the `train_test_split()` function, split `X` and `y` into `X_train`, `X_test`, `y_train`, and `y_test`. 80% of the data should be in the training set and 20% in the test set. `train_test_split()` reference can be accessed [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).\n",
+ "\n",
+ "\n",
+ "Enter your code in the cell below:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 93,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "X_train shape: (1053, 21)\n",
+ "X_test shape: (264, 21)\n",
+ "y_train shape: (1053,)\n",
+ "y_test shape: (264,)\n"
+ ]
+ }
+ ],
+ "source": [
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
+ "\n",
+ "\n",
+ "print(\"X_train shape:\", X_train.shape)\n",
+ "print(\"X_test shape:\", X_test.shape)\n",
+ "print(\"y_train shape:\", y_train.shape)\n",
+ "print(\"y_test shape:\", y_test.shape)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### Congratulations! Now you have finished the preparation of the dataset!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Bonus Challenge 1\n",
+ "\n",
+ "#### While the above is the common practice to prepare most datasets, when it comes to time series data, we sometimes do not want to randomly select rows from our dataset.\n",
+ "\n",
+ "This is because many time series algorithms rely on observations having equal time distances between them. In such cases, we typically select the majority of rows as the test data and the last few rows as the training data. We don't use `train_test_split()` to select the train/test data because it returns random selections.\n",
+ "\n",
+ "In the following cell, compute the number of rows that account for 80% of our data and round it to the next integer. Assign this number to `ts_rows`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here:\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Assign the first `ts_rows` rows of `X` to `X_ts_train` and the remaining rows to `X_ts_test`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here:\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Assign the first `ts_rows` rows of `y` to `y_ts_train` and the remaining rows to `y_ts_test`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Your code here:\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.12.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}