diff --git a/.DS_Store b/.DS_Store
new file mode 100644
index 00000000..e7697274
Binary files /dev/null and b/.DS_Store differ
diff --git a/module1-join-and-reshape-data/.DS_Store b/module1-join-and-reshape-data/.DS_Store
new file mode 100644
index 00000000..5008ddfc
Binary files /dev/null and b/module1-join-and-reshape-data/.DS_Store differ
diff --git a/module1-join-and-reshape-data/LS_DS_121_Join_and_Reshape_Data.ipynb b/module1-join-and-reshape-data/LS_DS_121_Join_and_Reshape_Data.ipynb
new file mode 100644
index 00000000..3849f919
--- /dev/null
+++ b/module1-join-and-reshape-data/LS_DS_121_Join_and_Reshape_Data.ipynb
@@ -0,0 +1,1272 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "LS_DS_121_Join_and_Reshape_Data.ipynb",
+ "version": "0.3.2",
+ "provenance": [],
+ "collapsed_sections": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ }
+ },
+ "cells": [
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "pmU5YUal1eTZ"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "_Lambda School Data Science_\n",
+ "\n",
+ "# Join and Reshape datasets\n",
+ "\n",
+ "Objectives\n",
+ "- concatenate data with pandas\n",
+ "- merge data with pandas\n",
+ "- understand tidy data formatting\n",
+ "- melt and pivot data with pandas\n",
+ "\n",
+ "Links\n",
+ "- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)\n",
+ "- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)\n",
+ " - Combine Data Sets: Standard Joins\n",
+ " - Tidy Data\n",
+ " - Reshaping Data\n",
+ "- Python Data Science Handbook\n",
+ " - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append\n",
+ " - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join\n",
+ " - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping\n",
+ " - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables\n",
+ " \n",
+ "Reference\n",
+ "- Pandas Documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)\n",
+ "- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "Mmi3J5fXrwZ3"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Download data\n",
+ "\n",
+ "We’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "K2kcrJVybjrW",
+ "outputId": "d6483326-62c0-41ae-db9c-8a366ee60495",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 202
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "--2019-04-28 02:31:08-- https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz\n",
+ "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.85.125\n",
+ "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.85.125|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 205548478 (196M) [application/x-gzip]\n",
+ "Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’\n",
+ "\n",
+ "instacart_online_gr 100%[===================>] 196.03M 43.6MB/s in 4.2s \n",
+ "\n",
+ "2019-04-28 02:31:12 (46.5 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]\n",
+ "\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "kqX40b2kdgAb",
+ "outputId": "4516a4b1-3873-456e-a0e0-bdf31c1bc2e9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 235
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "instacart_2017_05_01/\n",
+ "instacart_2017_05_01/._aisles.csv\n",
+ "instacart_2017_05_01/aisles.csv\n",
+ "instacart_2017_05_01/._departments.csv\n",
+ "instacart_2017_05_01/departments.csv\n",
+ "instacart_2017_05_01/._order_products__prior.csv\n",
+ "instacart_2017_05_01/order_products__prior.csv\n",
+ "instacart_2017_05_01/._order_products__train.csv\n",
+ "instacart_2017_05_01/order_products__train.csv\n",
+ "instacart_2017_05_01/._orders.csv\n",
+ "instacart_2017_05_01/orders.csv\n",
+ "instacart_2017_05_01/._products.csv\n",
+ "instacart_2017_05_01/products.csv\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "YbCvZZCBfHCI",
+ "outputId": "42278f71-aba5-4720-d3c5-18293d768ce8",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "%cd instacart_2017_05_01"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "/content/instacart_2017_05_01\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "id": "etshR5kpvWOj",
+ "colab_type": "code",
+ "outputId": "f0154d6d-02e0-4763-df72-371d4ba69ffb",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 118
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "!ls -lh *.csv"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "-rw-r--r-- 1 502 staff 2.6K May 2 2017 aisles.csv\n",
+ "-rw-r--r-- 1 502 staff 270 May 2 2017 departments.csv\n",
+ "-rw-r--r-- 1 502 staff 551M May 2 2017 order_products__prior.csv\n",
+ "-rw-r--r-- 1 502 staff 24M May 2 2017 order_products__train.csv\n",
+ "-rw-r--r-- 1 502 staff 104M May 2 2017 orders.csv\n",
+ "-rw-r--r-- 1 502 staff 2.1M May 2 2017 products.csv\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "id": "RcCu3Tlgv6J2",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Join Datasets"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "RsA14wiKr03j"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Goal: Reproduce this example\n",
+ "\n",
+ "The first two orders for user id 1:"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "vLqOTMcfjprg",
+ "outputId": "6f9c662b-86c2-4bca-afad-9031b36de8ba",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 313
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "from IPython.display import display, Image\n",
+ "url = 'https://cdn-images-1.medium.com/max/1600/1*vYGFQCafJtGBBX5mbl0xyw.png'\n",
+ "example = Image(url=url, width=600)\n",
+ "\n",
+ "display(example)"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ }
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "nPwG8aM_txl4"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Load data\n",
+ "\n",
+ "Here's a list of all six CSV filenames"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "Ksah0cOrfdJQ",
+ "outputId": "90510aa4-82d6-43fe-dbd8-176123c03602",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 121
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "!ls -lh *.csv"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "stream",
+ "text": [
+ "-rw-r--r-- 1 502 staff 2.6K May 2 2017 aisles.csv\n",
+ "-rw-r--r-- 1 502 staff 270 May 2 2017 departments.csv\n",
+ "-rw-r--r-- 1 502 staff 551M May 2 2017 order_products__prior.csv\n",
+ "-rw-r--r-- 1 502 staff 24M May 2 2017 order_products__train.csv\n",
+ "-rw-r--r-- 1 502 staff 104M May 2 2017 orders.csv\n",
+ "-rw-r--r-- 1 502 staff 2.1M May 2 2017 products.csv\n"
+ ],
+ "name": "stdout"
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "AHT7fKuxvPgV"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "For each CSV\n",
+ "- Load it with pandas\n",
+ "- Look at the dataframe's shape\n",
+ "- Look at its head (first rows)\n",
+ "- `display(example)`\n",
+ "- Which columns does it have in common with the example we want to reproduce?"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "cB_5T6TprcUH"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### aisles"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "JB3bvwSDK6v3",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "9-GrkqM6rfXr"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### departments"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "yxFd5n20yOVn",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "VhhVcn9kK-nG"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### order_products__prior"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "86rIMNFSzKaG",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "HVYJEKJcLBut"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### order_products__train"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "xgwSUCBk6Ciy",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "LYPrWUJnrp7G"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### orders"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "UfPRTW5w128P",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "nIX3SYXersao"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### products"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "3BKG5dxy2IOA",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "cbHumXOiJfy2"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Concatenate order_products__prior and order_products__train"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "TJ23kqpAY8Vv",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "Z1YRw5ypJuv2"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Get a subset of orders — the first two orders for user id 1"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "eJ9EixWs6K64",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "From `orders` dataframe:\n",
+ "- user_id\n",
+ "- order_id\n",
+ "- order_number\n",
+ "- order_dow\n",
+ "- order_hour_of_day"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "3K1p0QHuKPnt"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Merge dataframes"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "4MVZ9vb1BuO0",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Merge the subset from `orders` with columns from `order_products`"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "3lajwEE86iKc",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "i1uLO1bxByfz",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Merge with columns from `products`"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "D3Hfo2dkJlmh",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "dDfzKXJdwApV",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Reshape Datasets"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "4stCppWhwIx0",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Why reshape data?\n",
+ "\n",
+ "#### Some libraries prefer data in different formats\n",
+ "\n",
+ "For example, the Seaborn data visualization library prefers data in \"Tidy\" format often (but not always).\n",
+ "\n",
+ "> \"[Seaborn will be most powerful when your datasets have a particular organization.](https://seaborn.pydata.org/introduction.html#organizing-datasets) This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham. The rules can be simply stated:\n",
+ "\n",
+ "> - Each variable is a column\n",
+ "- Each observation is a row\n",
+ "\n",
+ "> A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot.\"\n",
+ "\n",
+ "#### Data science is often about putting square pegs in round holes\n",
+ "\n",
+ "Here's an inspiring [video clip from _Apollo 13_](https://www.youtube.com/watch?v=ry55--J4_VQ): “Invent a way to put a square peg in a round hole.” It's a good metaphor for data wrangling!"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "79KITszBwXp7",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Hadley Wickham's Examples\n",
+ "\n",
+ "From his paper, [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html)"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Jna5sk5FwYHr",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "%matplotlib inline\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "import seaborn as sns\n",
+ "\n",
+ "table1 = pd.DataFrame(\n",
+ " [[np.nan, 2],\n",
+ " [16, 11], \n",
+ " [3, 1]],\n",
+ " index=['John Smith', 'Jane Doe', 'Mary Johnson'], \n",
+ " columns=['treatmenta', 'treatmentb'])\n",
+ "\n",
+ "table2 = table1.T"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "eWe5rpI9wdvT",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "\"Table 1 provides some data about an imaginary experiment in a format commonly seen in the wild. \n",
+ "\n",
+ "The table has two columns and three rows, and both rows and columns are labelled.\""
+ ]
+ },
+ {
+ "metadata": {
+ "id": "SdUp5LbcwgNK",
+ "colab_type": "code",
+ "outputId": "176dd9b1-a8f4-49ea-9b0f-2ae4421d935f",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 136
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "table1"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " treatmenta | \n",
+ " treatmentb | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " John Smith | \n",
+ " NaN | \n",
+ " 2 | \n",
+ "
\n",
+ " \n",
+ " Jane Doe | \n",
+ " 16.0 | \n",
+ " 11 | \n",
+ "
\n",
+ " \n",
+ " Mary Johnson | \n",
+ " 3.0 | \n",
+ " 1 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " treatmenta treatmentb\n",
+ "John Smith NaN 2\n",
+ "Jane Doe 16.0 11\n",
+ "Mary Johnson 3.0 1"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 3
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "id": "SaEcDmZhwmon",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "\"There are many ways to structure the same underlying data. \n",
+ "\n",
+ "Table 2 shows the same data as Table 1, but the rows and columns have been transposed. The data is the same, but the layout is different.\""
+ ]
+ },
+ {
+ "metadata": {
+ "id": "SwDVoCj5woAn",
+ "colab_type": "code",
+ "outputId": "8390f63c-c5a0-433e-9f5e-fdb928ff30a9",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 106
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "table2"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "execute_result",
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " John Smith | \n",
+ " Jane Doe | \n",
+ " Mary Johnson | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " treatmenta | \n",
+ " NaN | \n",
+ " 16.0 | \n",
+ " 3.0 | \n",
+ "
\n",
+ " \n",
+ " treatmentb | \n",
+ " 2.0 | \n",
+ " 11.0 | \n",
+ " 1.0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " John Smith Jane Doe Mary Johnson\n",
+ "treatmenta NaN 16.0 3.0\n",
+ "treatmentb 2.0 11.0 1.0"
+ ]
+ },
+ "metadata": {
+ "tags": []
+ },
+ "execution_count": 4
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "id": "k3ratDNbwsyN",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "\"Table 3 reorganises Table 1 to make the values, variables and obserations more clear.\n",
+ "\n",
+ "Table 3 is the tidy version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable.\"\n",
+ "\n",
+ "| name | trt | result |\n",
+ "|--------------|-----|--------|\n",
+ "| John Smith | a | - |\n",
+ "| Jane Doe | a | 16 |\n",
+ "| Mary Johnson | a | 3 |\n",
+ "| John Smith | b | 2 |\n",
+ "| Jane Doe | b | 11 |\n",
+ "| Mary Johnson | b | 1 |"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "WsvD1I3TwwnI",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Table 1 --> Tidy\n",
+ "\n",
+ "We can use the pandas `melt` function to reshape Table 1 into Tidy format."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "S48tKmC46veF",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "Ck15sXaJxPrd",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Table 2 --> Tidy"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "k2Qn94RIxQhV",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "##### LEAVE BLANK --an assignment exercise #####"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "As0W7PWLxea3",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Tidy --> Table 1\n",
+ "\n",
+ "The `pivot_table` function is the inverse of `melt`."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "CdZZiLYoxfJC",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "3GeAKoSZxoPS",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Tidy --> Table 2"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "W2jjciN2xk9r",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "##### LEAVE BLANK --an assignment exercise #####"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "jr0jQy6Oxqi7",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Seaborn example\n",
+ "\n",
+ "The rules can be simply stated:\n",
+ "\n",
+ "- Each variable is a column\n",
+ "- Each observation is a row\n",
+ "\n",
+ "A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot.\""
+ ]
+ },
+ {
+ "metadata": {
+ "id": "kWo3FIP9xuKo",
+ "colab_type": "code",
+ "outputId": "25a90cb8-2bfc-4858-851f-d2fec3aa652d",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 153
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "sns.catplot(x='trt', y='result', col='name', \n",
+ " kind='bar', data=tidy, height=2);"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAagAAACICAYAAACyaX9CAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADq9JREFUeJzt3X2wXHV9x/H3BxIgIAQxDJMIaTQG\nM2l46CRTQTNtxBZCkabjAxZBiFWZFhVqSwodFVKKMwzR6iA2DKUhIshD1IlpaIMYjGgUCRfzQIBY\nSwiCKRAMhBLIA3z7x/kt2dzcu7v3ZnfP7+5+XjM79+x5+J3vOed77/f8zjm7VxGBmZlZbvYrOwAz\nM7O+uECZmVmWXKDMzCxLLlBmZpYlFygzM8uSC5SZmWXJBarNJI2T9HCdeWZJum4f1nGUpCWSVkt6\nRNJ/DnD5v5Z0XlUsY6qmPSFp1GBjs/41khtNWMccSU9LWiXpvyV9T9KkVq7T3ji2IemqqnGjJO3c\nl9/1GuubLmlJnXnmSLqk2etuJheoznQlcE9EnBARk4DLBrJwRFwfETent7OAMTVmt6HnqxFxYkRM\nAO4A7pV0ZNlBdYENwBlV7z8MrBtIA5KGNTWizHVEgUpnJ49K+jdJ6yT9QNKINO1Tklam3sR3JR2c\nxi+QNE/S/ZIeT2cc81M7C6raPlXSzyU9JGmhpDc1Me6DJN0kaa2kX0p6b9XkMZKWprPca6qW+T9J\nX0rbc7+ko/poejTwVOVNRKxJy06X9GNJ30/bfLWkcyQ9kGIYn+abI+kSSR8CpgK3pjPuEanJz6b9\nsVbSxGbtj1YYwrkxTtJPUtsPSXp3Gj9d0nJJ35H0mKRbJSlNm5KOb4+kuyWNrreeiLgD+AHw0dTG\n+1Iurk3bfOBg2261IXhstwGPSpqa3n8EuLNqnWdK+kXa/z+s/G6n38dvSVoBfEvSfZJOrFrup5JO\nqLGfjpC0SNKatN3HV02elPLpcUkXpflr7deLVFyVWSPp9lrtp7jn925/QCJiyL+AccAu4MT0/k7g\n3DT8lqr5rgI+m4YXALcDAmYCW4HjKIp2D3AiMAq4DzgkLXMpcHkf658NrOrjdW0/sT6chv8emJ+G\nJwJPAgdR9FoeB0am9xuBY9J8AZyZhq8BvtDHOk4DXgB+BHweGJPGT0/jRwMHAk8D/5SmXQx8LQ3P\nAS5Jw8uBqVVtP1G1Dy8Ebiz7+HdobhwMHJSGJwAPVh3DF4GjUzw/B6YBw4GfAUem+T5Sya1e63jj\n2FaN+1tgXsq13wDHpvE3p2kNte1jW//YAn8OfBk4BlhG8bt+XZrnzYDS8CeBr1Qdsx5gRHp/Prt/\nV4+t5Eav9U0HlqThrwNXpOFTgFVV7f6M4m/BKOD5dKxr7dffAgem4cMH0/5AjnEndRc3RMSqNNxD\nsZMBJqu47ns48Cbg7qpl/iMiQtJa4JmIWAsgaV1a/mhgErAinaQeQPEHYQ8RMReYO4iYp1EcXCLi\nMUkbKRIOYFlEvJjieQT4PYo/HjuAyrXlHuBP+4jnbklvB2YApwO/lDQ5TV4ZEZtSu/9DcfYMsBZ4\nb++2+vG9qvV/oMFlyjQUc2M4cF06U36N3XkB8EBEPJXiWZXieQGYDNyT4tkf2NTgupR+vpNiX/0q\nvf8m8Gngh/vQdqsNtWO7FPhn4BmKy6vVjgbuSL3TAyguCVYsjohX0vBC4IuSZgN/RVF0a5kGfDDF\nfK+kt0g6LE27KyK2A9slPQtUrsj0t1/XUFxRWQQsGmT7b1zdqaeTCtT2quHXgMrlqAXAX0TEakmz\nKM4sei/zeq/lX6fYN69R3Ms5u9aKU6Kc08ek+yJi4N3aPWMjxVE5VjsjnZ70Gr+HiPgd8G3g2ypu\nlv4RxRlM7+2s3geN5kNlmX7Xn5mhmBufo/gjdgLF2f2rfcQGu4+BgHURcXKtePrxB8CDNabvS9ut\nNqSObUTskNRDcfVkEkWPquLrwL9ExGJJ0yl6IBUvV7WxTdI9FD3As4ApteKso7+/M/3t1zMo/pac\nCXxe0nGDbL8hHXEPqo5DgU2ShtN3MtVyP/AeSe8AkHSIpGN7zxQRc6O46dz7Va84/aQSU2p3LLB+\ngDHuRdIpVdfcDwXGU1w+HIyXKPZhJ8o5N0YCmyLideBjFL2WWtYDR0o6OcUzXNLv19sISR8ETgVu\nS22Mq2xTWu+PB9t2yXI+tl8BLk0nkdVGUlx2h+IyXi03AtdSXBHZUmfe6r8z04HNEbG1zjJ7kbQf\nxa2GH1Fc9hxJ0TttSvt9GQpnv/vqi8AvgOfSz4b/2EbEc+ns6zalm8XAF4Bf9b9UXcPYfVbxr8C8\ndKlhFzArIranywr7YgrF5aFdFCchN0bEypQ8A7UAuF7SK0COZ9D7Ivfc+K6Kx/2XUnUG3U88O1Q8\n1HKtpJGpra/R91Nin5N0LnAIxX2RUyLiOQBJHwcWqnhabCVw/QDbzkVux7a6/XX0ve/mUOz7LcC9\nwNtqtNEjaStwUz+zVOfSHGC+pDUUD2rUK3792R+4JeWAKO61vSCpWe3vpXJDztpE0kzgnIg4q+xY\nLC/ODWuUis8mLgcmpl527+kXA2+NiH9od2zN1A09qGxIupLiuvGskkOxzDg3rFGpV/0l4O/6KU7/\nTvFQy5A/0XEPyszMstQND0mYmdkQ5AJlZmZZGhIFasaMGUHxDQp+Dc1XUzgPOuK1z5wHHfFqyJAo\nUJs3by47BMuA88DAedBNhkSBMjOz7uMCZWZmWfLnoGzImzL75vozZaZn7nllh2CWPfegzMwsSy5Q\nZmaWJRcoMzPLkguUmZllyQXKzMyy5AJlZmZZcoEyM7MsuUCZmVmWXKDMzCxLLlBmZpYlFygzM8uS\nC5SZmWXJBcrMzLLkAmVmZllygTIzsyy5QJmZWZZcoMzMLEsuUGZmliUXKDMzy5ILlJmZZckFyszM\nsuQCZWZmWRrWyEySPhwRC+uNM7Pme/LK48oOYcDGXr627BCsAzTag/rHBseZmZk1Rc0elKTTgT8D\n3irp2qpJhwG76iw7H3g/8GxETE7jjgDuAMYBTwBnRcSWwQZvZmadq14P6rdAD/Bq+ll5LQZOq7Ps\nAmBGr3GXAcsiYgKwLL03MzPbS80eVESsBlZLuiUiavaY+lj2Pknjeo2eCUxPw98ElgOXDqRdMzPr\nDvUu8a0FIg3vNT0ijh/g+o6KiE1p+H+Bo2qs+wLgAoCxY8cOcDXWKZwHBs6DblXvKb73t2rFERGS\nosb0G4AbAKZOndrvfNbZnAcGzoNuVe8S38Ymr+8ZSaMjYpOk0cCzTW7fzMw6REOPmUt6SdLW9HpV\n0muStg5ifYuB89Pw+cD3B9GGmZl1gYY+qBsRh1aGVdyMmgmcVGsZSbdRPBAxStJTwBXA1cCdkj4B\nbATOGlzYZmbW6RoqUNUiIoBFkq6gxmPiEXF2P5PeN9B1mplZ92n0q44+UPV2P2AqxWejzMzMWqLR\nHtSZVcO7KL4FYmbTozEzM0savQf18VYHYmZmVq3Rp/iukXSYpOGSlkl6TtK5rQ7OzMy6V6PfZn5q\nRGyl+ODuE8A7gNmtCsrMzKzRAlW5FHgGsDAiXmxRPGZmZkDjD0kskfQY8ArwN5KOxE/xmZlZCzXU\ng4qIy4B3A1MjYiewDT/FZ2ZmLdToQxIHAxcC89KoMRSfhTIzM2uJRu9B3QTsoOhFATwNXNWSiMzM\nzGi8QI2PiGuAnQARsQ3Y+x9EmZmZNUmjBWqHpBHs/ueF44HtLYvKzMy6Xt2n+NK3l18PLAWOkXQr\n8B5gVmtDMzOzbla3QKX/fDub4l9nnERxae/iiNjc4tjMzKyLNfo5qIeAt0fEXa0MxszMrKLRAvUu\n4BxJG4GXKXpRERHHtywyMzPrao0WqNNaGoWZmVkvjf67jY2tDsTMzKxao4+Zm5mZtZULlJmZZckF\nyszMsuQCZWZmWXKBMjOzLLlAmZlZllygzMwsSy5QZmaWpUa/ScLMLGtTZt9cdggD1jP3vLJDyJp7\nUGZmliUXKDMzy5ILlJmZZckFyszMsuSHJDL25JXHlR3CgI29fG3ZIZhZh3APyszMsuQCZWZmWXKB\nMjOzLPkelJnZENCN96TdgzIzsyyVUqAkzZC0XtKvJV1WRgxmZpa3thcoSfsD3wBOByYBZ0ua1O44\nzMwsb2X0oP4Q+HVEPB4RO4DbgZklxGFmZhlTRLR3hdKHgBkR8cn0/mPAuyLiM73muwC4IL19J7C+\njWGOAja3cX3t1u7t2xwRMwazoPOg5YZELpScB9D5uZBlHmRboMok6cGImFp2HK3S6dvXLN2wn7ph\nG5uh0/dTrttXxiW+p4Fjqt4fncaZmZm9oYwCtRKYIOltkg4A/hJYXEIcZmaWsbZ/UDcidkn6DHA3\nsD8wPyLWtTuOOm4oO4AW6/Tta5Zu2E/dsI3N0On7Kcvta/s9KDMzs0b4myTMzCxLLlBmZpYlF6gu\nImmcpIfLjsPK5TywitxzwQXKzMyy5ALVi6RFknokrUufXu80wyTdKulRSd+RdHDZAeXIeWDQFXkA\nGeeCn+LrRdIREfE7SSMoPrP1xxHxfNlxNYOkccAGYFpErJA0H3gkIr5camAZch4YdHYeQP654B7U\n3i6StBq4n+IbLyaUHE+z/SYiVqThW4BpZQaTMeeBQefnAWScC/6PulUkTQf+BDg5IrZJWg4cVGpQ\nzde7y+wudC/OA4OuyQPIOBfcg9rTSGBLSsaJwEllB9QCYyWdnIY/Cvy0zGAy5Tww6I48gIxzwQVq\nT0spbhg+ClxN0a3vNOuBT6dtfDMwr+R4cuQ8MOiOPICMc8EPSZiZWZbcgzIzsyy5QJmZWZZcoMzM\nLEsuUGZmliUXKDMzy5ILVBtJOlzShQOdZp3HuWDgPKjHBaq9Dgf2SjhJw/qbZh3LuWDgPKjJX3XU\nXlcD4yWtAnYCrwJbgInAQ1XT7omI2eWFaW3gXDBwHtTkD+q2Ufrm4CURMTl9z9ddwOSI2FA9rbQA\nrW2cCwbOg3p8ia9cD0TEhrKDsCw4FwycB3twgSrXy2UHYNlwLhg4D/bgAtVeLwGHDmKadR7ngoHz\noCY/JNFGEfG8pBWSHgZeAZ7pZ9p/deMN0W7iXDBwHtTjhyTMzCxLvsRnZmZZcoEyM7MsuUCZmVmW\nXKDMzCxLLlBmZpYlFygzM8uSC5SZmWXp/wHQDYFXEZ9ZlwAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ }
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "id": "cIgT41Rxx4oj",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Now with Instacart data"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Oydw0VvGxyDJ",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "products = pd.read_csv('products.csv')\n",
+ "\n",
+ "order_products = pd.concat([pd.read_csv('order_products__prior.csv'), \n",
+ " pd.read_csv('order_products__train.csv')])\n",
+ "\n",
+ "orders = pd.read_csv('orders.csv')"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "6p-IsG0jyXQj",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Goal: Reproduce part of this example\n",
+ "\n",
+ "Instead of a plot with 50 products, we'll just do two — the first products from each list\n",
+ "- Half And Half Ultra Pasteurized\n",
+ "- Half Baked Frozen Yogurt"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Rs-_n9yjyZ15",
+ "colab_type": "code",
+ "outputId": "87a7427c-41e5-48af-e5c4-e6fb99c81519",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 383
+ }
+ },
+ "cell_type": "code",
+ "source": [
+ "from IPython.display import display, Image\n",
+ "url = 'https://cdn-images-1.medium.com/max/1600/1*wKfV6OV-_1Ipwrl7AjjSuw.png'\n",
+ "example = Image(url=url, width=600)\n",
+ "\n",
+ "display(example)"
+ ],
+ "execution_count": 0,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ }
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Vj5GR7I4ydBg",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "So, given a `product_name` we need to calculate its `order_hour_of_day` pattern."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Vc9_s7-LyhBI",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Subset and Merge\n",
+ "\n",
+ "One challenge of performing a merge on this data is that the `products` and `orders` datasets do not have any common columns that we can merge on. Due to this we will have to use the `order_products` dataset to provide the columns that we will use to perform the merge."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "W1yHMS-OyUTH",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "UvhcadjFzx0Q",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## 4 ways to reshape and plot"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "aEE_nCWjzz7f",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### 1. value_counts"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "vTL3Cko87VL-",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "tMSd6YDj0BjE",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### 2. crosstab"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Slu2bWYK0CZD",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "ICjPVqO70Hv8",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### 3. Pivot Table"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "LQtMNVa10I_S",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "7A9jfBVv0M7e",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "### 4. melt"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "2AmbAKm20PAg",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "kAMtvSQWPUcj"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Assignment\n",
+ "\n",
+ "## Join Data Section\n",
+ "\n",
+ "These are the top 10 most frequently ordered products. How many times was each ordered? \n",
+ "\n",
+ "1. Banana\n",
+ "2. Bag of Organic Bananas\n",
+ "3. Organic Strawberries\n",
+ "4. Organic Baby Spinach \n",
+ "5. Organic Hass Avocado\n",
+ "6. Organic Avocado\n",
+ "7. Large Lemon \n",
+ "8. Strawberries\n",
+ "9. Limes \n",
+ "10. Organic Whole Milk\n",
+ "\n",
+ "First, write down which columns you need and which dataframes have them.\n",
+ "\n",
+ "Next, merge these into a single dataframe.\n",
+ "\n",
+ "Then, use pandas functions from the previous lesson to get the counts of the top 10 most frequently ordered products.\n",
+ "\n",
+ "## Reshape Data Section\n",
+ "\n",
+ "- Replicate the lesson code\n",
+ "- Complete the code cells we skipped near the beginning of the notebook\n",
+ "- Table 2 --> Tidy\n",
+ "- Tidy --> Table 2\n",
+ "- Load seaborn's `flights` dataset by running the cell below. Then create a pivot table showing the number of passengers by month and year. Use year for the index and month for the columns. You've done it right if you get 112 passengers for January 1949 and 432 passengers for December 1960."
+ ]
+ },
+ {
+ "metadata": {
+ "id": "fgxulJQq0uLw",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "flights = sns.load_dataset('flights')"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "1qKc88WI0up-",
+ "colab_type": "code",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "##### YOUR CODE HERE #####"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "id": "mnOuqL9K0dqh",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Join Data Stretch Challenge\n",
+ "\n",
+ "The [Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) has a visualization of \"**Popular products** purchased earliest in the day (green) and latest in the day (red).\" \n",
+ "\n",
+ "The post says,\n",
+ "\n",
+ "> \"We can also see the time of day that users purchase specific products.\n",
+ "\n",
+ "> Healthier snacks and staples tend to be purchased earlier in the day, whereas ice cream (especially Half Baked and The Tonight Dough) are far more popular when customers are ordering in the evening.\n",
+ "\n",
+ "> **In fact, of the top 25 latest ordered products, the first 24 are ice cream! The last one, of course, is a frozen pizza.**\"\n",
+ "\n",
+ "Your challenge is to reproduce the list of the top 25 latest ordered popular products.\n",
+ "\n",
+ "We'll define \"popular products\" as products with more than 2,900 orders.\n",
+ "\n",
+ "## Reshape Data Stretch Challenge\n",
+ "\n",
+ "_Try whatever sounds most interesting to you!_\n",
+ "\n",
+ "- Replicate more of Instacart's visualization showing \"Hour of Day Ordered\" vs \"Percent of Orders by Product\"\n",
+ "- Replicate parts of the other visualization from [Instacart's blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), showing \"Number of Purchases\" vs \"Percent Reorder Purchases\"\n",
+ "- Get the most recent order for each user in Instacart's dataset. This is a useful baseline when [predicting a user's next order](https://www.kaggle.com/c/instacart-market-basket-analysis)\n",
+ "- Replicate parts of the blog post linked at the top of this notebook: [Modern Pandas, Part 5: Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)"
+ ]
+ }
+ ]
+}
\ No newline at end of file
diff --git a/module1-scrape-and-process-data/LS_DS_121_Scrape_and_process_data.ipynb b/module1-scrape-and-process-data/LS_DS_121_Scrape_and_process_data.ipynb
deleted file mode 100644
index 49559cea..00000000
--- a/module1-scrape-and-process-data/LS_DS_121_Scrape_and_process_data.ipynb
+++ /dev/null
@@ -1,593 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "oR4Yeg3P07gu"
- },
- "source": [
- "_Lambda School Data Science_\n",
- "\n",
- "# Scrape and process data\n",
- "\n",
- "Objectives\n",
- "- scrape and parse web pages\n",
- "- use list comprehensions\n",
- "- select rows and columns with pandas\n",
- "\n",
- "Links\n",
- "- [Automate the Boring Stuff with Python, Chapter 11](https://automatetheboringstuff.com/chapter11/)\n",
- " - Requests\n",
- " - Beautiful Soup\n",
- "- [Python List Comprehensions: Explained Visually](https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)\n",
- "- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)\n",
- " - Subset Observations (Rows)\n",
- " - Subset Variables (Columns)\n",
- "- Python Data Science Handbook\n",
- " - [Chapter 3.1](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html), Introducing Pandas Objects\n",
- " - [Chapter 3.2](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html), Data Indexing and Selection\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "I_NRVchqgGvM"
- },
- "source": [
- "## Scrape the titles of PyCon 2019 talks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "3elw_8Nc7Tpe"
- },
- "outputs": [],
- "source": [
- "url = 'https://us.pycon.org/2019/schedule/talks/list/'"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "SFNsyjVsTU4b"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "vqkNgAzYpeK7"
- },
- "source": [
- "## 5 ways to look at long titles\n",
- "\n",
- "Let's define a long title as greater than 80 characters"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "N7tqeZh14Fws"
- },
- "source": [
- "### 1. For Loop"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "kKxs5tqDApuZ"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "I21jcEnK4IN7"
- },
- "source": [
- "### 2. List Comprehension"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "qaXe9UldAs3H"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "2kn8pxL-4yMG"
- },
- "source": [
- "### 3. Filter with named function"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "ywLqqFJNAvFm"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "IPIT6oXz40Q3"
- },
- "source": [
- "### 4. Filter with anonymous function"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "giIcFYkiAwiR"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "qj8Yod8_45z4"
- },
- "source": [
- "### 5. Pandas\n",
- "\n",
- "pandas documentation: [Working with Text Data](https://pandas.pydata.org/pandas-docs/stable/text.html)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "yRwPEHNcAzc_"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "8YaUZJvRp681"
- },
- "source": [
- "## Make new dataframe columns\n",
- "\n",
- "pandas documentation: [apply](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "DR_WZ-olA4-v"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Ua74pMrGrsZR"
- },
- "source": [
- "### title length"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "p-Euz7tgA8Fd"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "OgsKArXPrz5n"
- },
- "source": [
- "### long title"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "b_WCRvvKA-IP"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "TonCXYPesUsT"
- },
- "source": [
- "### first letter"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "fhO4aABpBBgA"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Etz1XeLKs6DL"
- },
- "source": [
- "### word count\n",
- "\n",
- "Using [`textstat`](https://github.com/shivam5992/textstat)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "GVIkRWchs4zR"
- },
- "outputs": [],
- "source": [
- "!pip install textstat"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "mY_M_MuaBFrF"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "UN_7FABhwDqc"
- },
- "source": [
- "## Rename column\n",
- "\n",
- "`title length` --> `title character count`\n",
- "\n",
- "pandas documentation: [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "jvTif7sBBMpN"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "ca2pDtytr5tR"
- },
- "source": [
- "## Analyze the dataframe"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "AitNVDCFwWwc"
- },
- "source": [
- "### Describe\n",
- "\n",
- "pandas documentation: [describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "yPo9RdxYBQ64"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "T0lc_o-xyjZU"
- },
- "source": [
- "### Sort values\n",
- "\n",
- "pandas documentation: [sort_values](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "kxE2swJ9-cG_"
- },
- "source": [
- "Five shortest titles, by character count"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "7t8DlpLhBVQa"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "NOEH4Ef5-kvo"
- },
- "source": [
- "Titles sorted reverse alphabetically"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "WkymeWDjBV8X"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "e4wr42FB0GV-"
- },
- "source": [
- "### Get value counts\n",
- "\n",
- "pandas documentation: [value_counts](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "D81LNGaI-6ya"
- },
- "source": [
- "Frequency counts of first letters"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "AdTQYsRKBZio"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "04NVokvTAwqK"
- },
- "source": [
- "Percentage of talks with long titles"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "uS8qp4hrBat6"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "mmYZL2QL0lgd"
- },
- "source": [
- "### Plot\n",
- "\n",
- "pandas documentation: [Visualization](https://pandas.pydata.org/pandas-docs/stable/visualization.html)\n",
- "\n",
- "\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "c6gCotA9_B68"
- },
- "source": [
- "Top 5 most frequent first letters"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "DUmcVcdXBdkw"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "_Ngegk0bASty"
- },
- "source": [
- "Histogram of title lengths, in characters"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "y5oLu2D4BeKw"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "DiylH7LQw44u"
- },
- "source": [
- "# Assignment\n",
- "\n",
- "**Scrape** the talk descriptions. Hint: `soup.select('.presentation-description')`\n",
- "\n",
- "**Make** new columns in the dataframe:\n",
- "- description\n",
- "- description character count\n",
- "- description word count\n",
- "\n",
- "**Describe** all the dataframe's columns. What's the average description word count? The minimum? The maximum?\n",
- "\n",
- "**Answer** the question: Which descriptions could fit in a tweet?\n",
- "\n",
- "\n",
- "# Stretch Challenge\n",
- "\n",
- "**Make** another new column in the dataframe:\n",
- "- description grade level (you can use [this `textstat` function](https://github.com/shivam5992/textstat#the-flesch-kincaid-grade-level) to get the Flesh-Kincaid grade level)\n",
- "\n",
- "**Answer** the question: What's the distribution of grade levels? Plot a histogram.\n",
- "\n",
- "**Be aware** that [Textstat has issues when sentences aren't separated by spaces](https://github.com/shivam5992/textstat/issues/77#issuecomment-453734048). (A Lambda School Data Science student helped identify this issue, and emailed with the developer.) \n",
- "\n",
- "Also, [BeautifulSoup doesn't separate paragraph tags with spaces](https://bugs.launchpad.net/beautifulsoup/+bug/1768330).\n",
- "\n",
- "So, you may get some inaccurate or surprising grade level estimates here. Don't worry, that's ok — but optionally, can you do anything to try improving the grade level estimates?"
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [],
- "name": "LS_DS_121_Scrape_and_process_data.ipynb",
- "provenance": [],
- "version": "0.3.2"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.1"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
diff --git a/module2-find-portfolio-ideas/.DS_Store b/module2-find-portfolio-ideas/.DS_Store
new file mode 100644
index 00000000..5008ddfc
Binary files /dev/null and b/module2-find-portfolio-ideas/.DS_Store differ
diff --git a/module2-find-portfolio-ideas/template.md b/module2-find-portfolio-ideas/template.md
new file mode 100644
index 00000000..8d1cedb8
--- /dev/null
+++ b/module2-find-portfolio-ideas/template.md
@@ -0,0 +1,70 @@
+# Ideas for data storytelling
+
+## You
+
+What do you care about?
+
+
+What do you know about?
+
+
+What decisions do you face?
+
+
+## Seven templates
+
+Training Kit (https://learn.lambdaschool.com/ds/module/recedjanlbpqxic2r) explains the seven templates from Priceonomics.
+
+Can you apply the templates to your topics?
+
+1. Geographic Variation
+
+
+2. Trend related to the news
+
+
+3. Who does that?
+
+
+4. Answering a question people care about
+
+
+5. Valuable to businesses
+
+
+6. What's the most popular?
+
+
+7. Cost/Money rankings
+
+
+## Misconceptions
+
+What misconceptions do people have about your topic?
+
+
+## Examples
+
+What data storytelling example inspires you?
+
+
+Could you do a new hypothesis, for the same question?
+
+
+Could you do a new question, for the same topic?
+
+
+Could you do a new topic, with the same "style"?
+
+
+## Data
+
+Where could you search for data about your topic?
+
+
+# Assignment!
+
+Fill out the above template *twice*, for two different ideas.
+
+Then compare and contrast and select one as the idea you're leaning towards
+working on for your project week.
diff --git a/module2-join-datasets/LS_DS_122_Join_datasets.ipynb b/module2-join-datasets/LS_DS_122_Join_datasets.ipynb
deleted file mode 100644
index 85381492..00000000
--- a/module2-join-datasets/LS_DS_122_Join_datasets.ipynb
+++ /dev/null
@@ -1,485 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "pmU5YUal1eTZ"
- },
- "source": [
- "_Lambda School Data Science_\n",
- "\n",
- "# Join datasets\n",
- "\n",
- "Objectives\n",
- "- concatenate data with pandas\n",
- "- merge data with pandas\n",
- "\n",
- "Links\n",
- "- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)\n",
- " - Combine Data Sets: Standard Joins\n",
- "- Python Data Science Handbook\n",
- " - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append\n",
- " - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Mmi3J5fXrwZ3"
- },
- "source": [
- "## Download data\n",
- "\n",
- "We’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 208
- },
- "colab_type": "code",
- "id": "K2kcrJVybjrW",
- "outputId": "3506f1c8-b01e-47b8-9c04-66c40c18ba2b"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "--2019-01-14 02:36:03-- https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz\n",
- "Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.109.181\n",
- "Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.109.181|:443... connected.\n",
- "HTTP request sent, awaiting response... 200 OK\n",
- "Length: 205548478 (196M) [application/x-gzip]\n",
- "Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’\n",
- "\n",
- "instacart_online_gr 100%[===================>] 196.03M 32.2MB/s in 6.2s \n",
- "\n",
- "2019-01-14 02:36:10 (31.8 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]\n",
- "\n"
- ]
- }
- ],
- "source": [
- "!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 242
- },
- "colab_type": "code",
- "id": "kqX40b2kdgAb",
- "outputId": "d61fd29f-378b-49a1-9720-63b9509f0f87"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "instacart_2017_05_01/\n",
- "instacart_2017_05_01/._aisles.csv\n",
- "instacart_2017_05_01/aisles.csv\n",
- "instacart_2017_05_01/._departments.csv\n",
- "instacart_2017_05_01/departments.csv\n",
- "instacart_2017_05_01/._order_products__prior.csv\n",
- "instacart_2017_05_01/order_products__prior.csv\n",
- "instacart_2017_05_01/._order_products__train.csv\n",
- "instacart_2017_05_01/order_products__train.csv\n",
- "instacart_2017_05_01/._orders.csv\n",
- "instacart_2017_05_01/orders.csv\n",
- "instacart_2017_05_01/._products.csv\n",
- "instacart_2017_05_01/products.csv\n"
- ]
- }
- ],
- "source": [
- "!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 34
- },
- "colab_type": "code",
- "id": "YbCvZZCBfHCI",
- "outputId": "953334d1-6d82-430b-c39d-50c9a1240eee"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "/content/instacart_2017_05_01\n"
- ]
- }
- ],
- "source": [
- "%cd instacart_2017_05_01"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "RsA14wiKr03j"
- },
- "source": [
- "## Goal: Reproduce this example\n",
- "\n",
- "The first two orders for user id 1:"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 312
- },
- "colab_type": "code",
- "id": "vLqOTMcfjprg",
- "outputId": "808c9f60-f337-4719-ce62-c4622815c4db"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
"
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "tags": []
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "from IPython.display import display, Image\n",
- "url = 'https://cdn-images-1.medium.com/max/1600/1*vYGFQCafJtGBBX5mbl0xyw.png'\n",
- "example = Image(url=url, width=600)\n",
- "\n",
- "display(example)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "nPwG8aM_txl4"
- },
- "source": [
- "## Load data\n",
- "\n",
- "Here's a list of all six CSV filenames"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "Ksah0cOrfdJQ"
- },
- "outputs": [],
- "source": [
- "!ls -lh"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "AHT7fKuxvPgV"
- },
- "source": [
- "For each CSV\n",
- "- Load it with pandas\n",
- "- Look at the dataframe's shape\n",
- "- Look at its head (first rows)\n",
- "- `display(example)`\n",
- "- Which columns does it have in common with the example we want to reproduce?"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "cB_5T6TprcUH"
- },
- "source": [
- "### aisles"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "JB3bvwSDK6v3"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "9-GrkqM6rfXr"
- },
- "source": [
- "### departments"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "NYIcif0dK9_5"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "VhhVcn9kK-nG"
- },
- "source": [
- "### order_products__prior"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "-49qTkPlLBT_"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "HVYJEKJcLBut"
- },
- "source": [
- "### order_products__train"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "gbPKUMb3LDxb"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "LYPrWUJnrp7G"
- },
- "source": [
- "### orders"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "aFyl_7vyLJxS"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "nIX3SYXersao"
- },
- "source": [
- "### products"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "9icvQgRfLLU1"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "cbHumXOiJfy2"
- },
- "source": [
- "## Concatenate order_products__prior and order_products__train"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "TJ23kqpAY8Vv"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "Z1YRw5ypJuv2"
- },
- "source": [
- "## Get a subset of orders — the first two orders for user id 1"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "YFBTseoyZAbj"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "3K1p0QHuKPnt"
- },
- "source": [
- "## Merge dataframes"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "DaDcnygCLZvM"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "kAMtvSQWPUcj"
- },
- "source": [
- "# Assignment\n",
- "\n",
- "These are the top 10 most frequently ordered products. How many times was each ordered? \n",
- "\n",
- "1. Banana\n",
- "2. Bag of Organic Bananas\n",
- "3. Organic Strawberries\n",
- "4. Organic Baby Spinach \n",
- "5. Organic Hass Avocado\n",
- "6. Organic Avocado\n",
- "7. Large Lemon \n",
- "8. Strawberries\n",
- "9. Limes \n",
- "10. Organic Whole Milk\n",
- "\n",
- "First, write down which columns you need and which dataframes have them.\n",
- "\n",
- "Next, merge these into a single dataframe.\n",
- "\n",
- "Then, use pandas functions from the previous lesson to get the counts of the top 10 most frequently ordered products.\n",
- "\n",
- "## Stretch challenge\n",
- "\n",
- "The [Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) has a visualization of \"**Popular products** purchased earliest in the day (green) and latest in the day (red).\" \n",
- "\n",
- "The post says,\n",
- "\n",
- "> \"We can also see the time of day that users purchase specific products.\n",
- "\n",
- "> Healthier snacks and staples tend to be purchased earlier in the day, whereas ice cream (especially Half Baked and The Tonight Dough) are far more popular when customers are ordering in the evening.\n",
- "\n",
- "> **In fact, of the top 25 latest ordered products, the first 24 are ice cream! The last one, of course, is a frozen pizza.**\"\n",
- "\n",
- "Your challenge is to reproduce the list of the top 25 latest ordered popular products.\n",
- "\n",
- "We'll define \"popular products\" as products with more than 2,900 orders."
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [],
- "name": "LS_DS_122_Join_datasets.ipynb",
- "provenance": [],
- "version": "0.3.2"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.1"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
diff --git a/module3-make-explanatory-visualizations/.DS_Store b/module3-make-explanatory-visualizations/.DS_Store
new file mode 100644
index 00000000..5008ddfc
Binary files /dev/null and b/module3-make-explanatory-visualizations/.DS_Store differ
diff --git a/module3-make-explanatory-visualizations/LS_DS_123_Make_Explanatory_Visualizations.ipynb b/module3-make-explanatory-visualizations/LS_DS_123_Make_Explanatory_Visualizations.ipynb
new file mode 100644
index 00000000..e4705546
--- /dev/null
+++ b/module3-make-explanatory-visualizations/LS_DS_123_Make_Explanatory_Visualizations.ipynb
@@ -0,0 +1,536 @@
+{
+ "nbformat": 4,
+ "nbformat_minor": 0,
+ "metadata": {
+ "colab": {
+ "name": "LS_DS_123_Make_Explanatory_Visualizations.ipynb",
+ "version": "0.3.2",
+ "provenance": [],
+ "collapsed_sections": []
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.1"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ }
+ },
+ "cells": [
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "-8-trVo__vRE"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "_Lambda School Data Science_\n",
+ "\n",
+ "# Make Explanatory Visualizations\n",
+ "\n",
+ "### Objectives\n",
+ "\n",
+ "- identify misleading visualizations and how to fix them\n",
+ "- use Seaborn to visualize distributions and relationships with continuous and discrete variables\n",
+ "- add emphasis and annotations to transform visualizations from exploratory to explanatory\n",
+ "- remove clutter from visualizations\n",
+ "\n",
+ "### Links\n",
+ "\n",
+ "- [How to Spot Visualization Lies](https://flowingdata.com/2017/02/09/how-to-spot-visualization-lies/)\n",
+ "- [Visual Vocabulary - Vega Edition](http://ft.com/vocabulary)\n",
+ "- [Choosing a Python Visualization Tool flowchart](http://pbpython.com/python-vis-flowchart.html)\n",
+ "- [Searborn example gallery](http://seaborn.pydata.org/examples/index.html) & [tutorial](http://seaborn.pydata.org/tutorial.html)\n",
+ "- [Strong Titles Are The Biggest Bang for Your Buck](http://stephanieevergreen.com/strong-titles/)\n",
+ "- [Remove to improve (the data-ink ratio)](https://www.darkhorseanalytics.com/blog/data-looks-better-naked)\n",
+ "- [How to Generate FiveThirtyEight Graphs in Python](https://www.dataquest.io/blog/making-538-plots/)"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "s-24T844-8qv",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Avoid Misleading Visualizations\n",
+ "\n",
+ "Did you find/discuss any interesting misleading visualizations in your Walkie Talkie?"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Qzxt9ntsNjs0",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## What makes a visualization misleading?\n",
+ "\n",
+ "[5 Ways Writers Use Misleading Graphs To Manipulate You](https://venngage.com/blog/misleading-graphs/)"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "q7_DUiENNvxk",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Two y-axes\n",
+ "\n",
+ "
\n",
+ " \n",
+ " Other Examples: \n",
+ " - [Spurious Correlations](https://tylervigen.com/spurious-correlations)\n",
+ " - \n",
+ " - \n",
+ " - "
+ ]
+ },
+ {
+ "metadata": {
+ "id": "oIijNBDMNv2k",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Y-axis doesn't start at zero.\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "ISB2p8vZNv6r",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Pie Charts are bad\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "67CsAzu1NwBJ",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Pie charts that omit data are extra bad\n",
+ " \n",
+ "- A guy makes a misleading chart that goes viral\n",
+ "\n",
+ " What does this chart imply at first glance? You don't want your user to have to do a lot of work in order to be able to interpret you graph correctly. You want that first-glance conclusions to be the correct ones.\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ "- It gets picked up by overworked journalists (assuming incompetency before malice)\n",
+ " \n",
+ " \n",
+ " \n",
+ "- Even after the chart's implications have been refuted, it's hard a bad (although compelling) visualization from being passed around.\n",
+ "\n",
+ " \n",
+ "\n",
+ "**[\"yea I understand a pie chart was probably not the best choice to present this data.\"](https://twitter.com/michaelbatnick/status/1037036440494985216)**"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "FYXmlToEOOTC",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Pie Charts that compare unrelated things are next-level extra bad\n",
+ "\n",
+ "
\n"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "IwtMQpY_QFUw",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Be careful about how you use volume to represent quantities:\n",
+ "\n",
+ "radius vs diameter vs volume\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "tTuAWjSBRsc7",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Don't cherrypick timelines or specific subsets of your data:\n",
+ "\n",
+ "
\n",
+ "\n",
+ "Look how specifically the writer has selected what years to show in the legend on the right side.\n",
+ "\n",
+ "\n",
+ "\n",
+ "Try the tool that was used to make the graphic for yourself\n",
+ "\n",
+ "\n",
+ " "
+ ]
+ },
+ {
+ "metadata": {
+ "id": "Xs13S7p4Srme",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Use Relative units rather than Absolute Units\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "CIMt5OiuTlrr",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Avoid 3D graphs unless having the extra dimension is effective\n",
+ "\n",
+ "Usually you can Split 3D graphs into multiple 2D graphs\n",
+ "\n",
+ "3D graphs that are interactive can be very cool. (See Plotly and Bokeh)\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "GATMu9IqUlIj",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Don't go against typical conventions\n",
+ "\n",
+ "
"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "g6bKgZ0m_ynS",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Tips for choosing an appropriate visualization:"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "WtBsVnO4VHiJ",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Use Appropriate \"Visual Vocabulary\"\n",
+ "\n",
+ "[Visual Vocabulary - Vega Edition](http://ft.com/vocabulary)"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "H_QM9FHqVT7T",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## What are the properties of your data?\n",
+ "- Is your primary variable of interest continuous or discrete?\n",
+ "- Is in wide or long (tidy) format?\n",
+ "- Does your visualization involve multiple variables?\n",
+ "- How many dimensions do you need to include on your plot?\n",
+ "\n",
+ "Can you express the main idea of your visualization in a single sentence?\n",
+ "\n",
+ "How hard does your visualization make the user work in order to draw the intended conclusion?"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "5EqXxnJeB89_",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Which Visualization tool is most appropriate? \n",
+ "\n",
+ "[Choosing a Python Visualization Tool flowchart](http://pbpython.com/python-vis-flowchart.html)"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "5_na7Oy3NGKA",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# Making Explanatory Visualizations with Seaborn"
+ ]
+ },
+ {
+ "metadata": {
+ "id": "ORUwQD6F-VYg",
+ "colab_type": "text"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Today we will reproduce this [example by FiveThirtyEight:](https://fivethirtyeight.com/features/al-gores-new-movie-exposes-the-big-flaw-in-online-movie-ratings/)\n",
+ "\n"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "ya_w5WORGs-n",
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 355
+ },
+ "outputId": "0dbf77af-aa69-4d25-cdb7-bf0e7f8058ad"
+ },
+ "cell_type": "code",
+ "source": [
+ "from IPython.display import display, Image\n",
+ "\n",
+ "url = 'https://fivethirtyeight.com/wp-content/uploads/2017/09/mehtahickey-inconvenient-0830-1.png'\n",
+ "example = Image(url=url, width=400)\n",
+ "\n",
+ "display(example)"
+ ],
+ "execution_count": 1,
+ "outputs": [
+ {
+ "output_type": "display_data",
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "tags": []
+ }
+ }
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "HP4DALiRG3sC"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Using this data: https://github.com/fivethirtyeight/data/tree/master/inconvenient-sequel"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "HioPkYtUG03B"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "Links\n",
+ "- [Strong Titles Are The Biggest Bang for Your Buck](http://stephanieevergreen.com/strong-titles/)\n",
+ "- [Remove to improve (the data-ink ratio)](https://www.darkhorseanalytics.com/blog/data-looks-better-naked)\n",
+ "- [How to Generate FiveThirtyEight Graphs in Python](https://www.dataquest.io/blog/making-538-plots/)"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "0w_iMnQ6-VoQ"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Make prototypes\n",
+ "\n",
+ "This helps us understand the problem"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "5uz0eEaEN-GO",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "%matplotlib inline\n",
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "\n",
+ "plt.style.use('fivethirtyeight')\n",
+ "\n",
+ "fake = pd.Series([38, 3, 2, 1, 2, 4, 6, 5, 5, 33], \n",
+ " index=range(1,11))\n",
+ "\n",
+ "fake.plot.bar(color='C1', width=0.9);"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "KZ0VLOV8OyRr",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "fake2 = pd.Series(\n",
+ " [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
+ " 2, 2, 2, \n",
+ " 3, 3, 3,\n",
+ " 4, 4,\n",
+ " 5, 5, 5,\n",
+ " 6, 6, 6, 6,\n",
+ " 7, 7, 7, 7, 7,\n",
+ " 8, 8, 8, 8,\n",
+ " 9, 9, 9, 9, \n",
+ " 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10])\n",
+ "\n",
+ "fake2.value_counts().sort_index().plot.bar(color='C1', width=0.9);"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "mZb3UZWO-q05"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Annotate with text"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "f6U1vswr_uWp",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "x8jRZkpB_MJ6"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "## Reproduce with real data"
+ ]
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "3SOHJckDUPI8",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ "df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/inconvenient-sequel/ratings.csv')"
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "code",
+ "id": "cDltXxhC_yG-",
+ "colab": {}
+ },
+ "cell_type": "code",
+ "source": [
+ ""
+ ],
+ "execution_count": 0,
+ "outputs": []
+ },
+ {
+ "metadata": {
+ "colab_type": "text",
+ "id": "NMEswXWh9mqw"
+ },
+ "cell_type": "markdown",
+ "source": [
+ "# ASSIGNMENT\n",
+ "\n",
+ "Replicate the lesson code. I recommend that you [do not copy-paste](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit).\n",
+ "\n",
+ "# STRETCH OPTIONS\n",
+ "\n",
+ "#### Reproduce another example from [FiveThityEight's shared data repository](https://data.fivethirtyeight.com/).\n",
+ "\n",
+ "For example:\n",
+ "- [thanksgiving-2015](https://fivethirtyeight.com/features/heres-what-your-part-of-america-eats-on-thanksgiving/) (try the [`altair`](https://altair-viz.github.io/gallery/index.html#maps) library)\n",
+ "- [candy-power-ranking](https://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/) (try the [`statsmodels`](https://www.statsmodels.org/stable/index.html) library)\n",
+ "- or another example of your choice!\n",
+ "\n",
+ "#### Make more charts!\n",
+ "\n",
+ "Choose a chart you want to make, from [Visual Vocabulary - Vega Edition](http://ft.com/vocabulary).\n",
+ "\n",
+ "Find the chart in an example gallery of a Python data visualization library:\n",
+ "- [Seaborn](http://seaborn.pydata.org/examples/index.html)\n",
+ "- [Altair](https://altair-viz.github.io/gallery/index.html)\n",
+ "- [Matplotlib](https://matplotlib.org/gallery.html)\n",
+ "- [Pandas](https://pandas.pydata.org/pandas-docs/stable/visualization.html)\n",
+ "\n",
+ "Reproduce the chart. [Optionally, try the \"Ben Franklin Method.\"](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit) If you want, experiment and make changes.\n",
+ "\n",
+ "Take notes. Consider sharing your work with your cohort!\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n",
+ "\n"
+ ]
+ }
+ ]
+}
\ No newline at end of file
diff --git a/module3-reshape-data/LS_DS_123_Reshape_data.ipynb b/module3-reshape-data/LS_DS_123_Reshape_data.ipynb
deleted file mode 100644
index 2cc98a7a..00000000
--- a/module3-reshape-data/LS_DS_123_Reshape_data.ipynb
+++ /dev/null
@@ -1,756 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "oeWq7mGFZm6L"
- },
- "source": [
- "_Lambda School Data Science_\n",
- "\n",
- "# Reshape data\n",
- "\n",
- "Objectives\n",
- "- understand tidy data formatting\n",
- "- melt and pivot data with pandas\n",
- "\n",
- "Links\n",
- "- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)\n",
- "- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)\n",
- " - Tidy Data\n",
- " - Reshaping Data\n",
- "- Python Data Science Handbook\n",
- " - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping\n",
- " - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables\n",
- " \n",
- "Reference\n",
- "- pandas documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)\n",
- "- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "u2-7QkU3eR_e"
- },
- "source": [
- "## Why reshape data?\n",
- "\n",
- "#### Some libraries prefer data in different formats\n",
- "\n",
- "For example, the Seaborn data visualization library prefers data in \"Tidy\" format often (but not always).\n",
- "\n",
- "> \"[Seaborn will be most powerful when your datasets have a particular organization.](https://seaborn.pydata.org/introduction.html#organizing-datasets) This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham. The rules can be simply stated:\n",
- "\n",
- "> - Each variable is a column\n",
- "- Each observation is a row\n",
- "\n",
- "> A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot.\"\n",
- "\n",
- "#### Data science is often about putting square pegs in round holes\n",
- "\n",
- "Here's an inspiring [video clip from _Apollo 13_](https://www.youtube.com/watch?v=ry55--J4_VQ): “Invent a way to put a square peg in a round hole.” It's a good metaphor for data wrangling!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "3av1dYbRZ4k2"
- },
- "source": [
- "## Upgrade Seaborn\n",
- "\n",
- "Run the cell below which upgrades Seaborn and automatically restarts your Google Colab Runtime."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "AOLhnquFxao7"
- },
- "outputs": [],
- "source": [
- "!pip install seaborn --upgrade\n",
- "import os\n",
- "os.kill(os.getpid(), 9)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "tE_BXOAjaWB_"
- },
- "source": [
- "## Hadley Wickham's Examples\n",
- "\n",
- "From his paper, [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "PL6hzS3yYsNt"
- },
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "import pandas as pd\n",
- "import numpy as np\n",
- "import seaborn as sns\n",
- "\n",
- "table1 = pd.DataFrame(\n",
- " [[np.nan, 2],\n",
- " [16, 11], \n",
- " [3, 1]],\n",
- " index=['John Smith', 'Jane Doe', 'Mary Johnson'], \n",
- " columns=['treatmenta', 'treatmentb'])\n",
- "\n",
- "table2 = table1.T"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "YvfghLi3bu6S"
- },
- "source": [
- "\"Table 1 provides some data about an imaginary experiment in a format commonly seen in the wild. \n",
- "\n",
- "The table has two columns and three rows, and both rows and columns are labelled.\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 141
- },
- "colab_type": "code",
- "id": "5ZidjYdNikwF",
- "outputId": "70c926ab-df34-4432-d518-9029390aacbf"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " treatmenta | \n",
- " treatmentb | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " John Smith | \n",
- " NaN | \n",
- " 2 | \n",
- "
\n",
- " \n",
- " Jane Doe | \n",
- " 16.0 | \n",
- " 11 | \n",
- "
\n",
- " \n",
- " Mary Johnson | \n",
- " 3.0 | \n",
- " 1 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " treatmenta treatmentb\n",
- "John Smith NaN 2\n",
- "Jane Doe 16.0 11\n",
- "Mary Johnson 3.0 1"
- ]
- },
- "execution_count": 2,
- "metadata": {
- "tags": []
- },
- "output_type": "execute_result"
- }
- ],
- "source": [
- "table1"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "wIfPYP4rcDbO"
- },
- "source": [
- "\"There are many ways to structure the same underlying data. \n",
- "\n",
- "Table 2 shows the same data as Table 1, but the rows and columns have been transposed. The data is the same, but the layout is different.\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 110
- },
- "colab_type": "code",
- "id": "mYBLbVTVKR2h",
- "outputId": "53cc8be1-c7b3-4964-8aaa-e6712f8d14de"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " | \n",
- " John Smith | \n",
- " Jane Doe | \n",
- " Mary Johnson | \n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " treatmenta | \n",
- " NaN | \n",
- " 16.0 | \n",
- " 3.0 | \n",
- "
\n",
- " \n",
- " treatmentb | \n",
- " 2.0 | \n",
- " 11.0 | \n",
- " 1.0 | \n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " John Smith Jane Doe Mary Johnson\n",
- "treatmenta NaN 16.0 3.0\n",
- "treatmentb 2.0 11.0 1.0"
- ]
- },
- "execution_count": 3,
- "metadata": {
- "tags": []
- },
- "output_type": "execute_result"
- }
- ],
- "source": [
- "table2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "RaZuIwqNcRpr"
- },
- "source": [
- "\"Table 3 reorganises Table 1 to make the values, variables and obserations more clear.\n",
- "\n",
- "Table 3 is the tidy version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable.\"\n",
- "\n",
- "| name | trt | result |\n",
- "|--------------|-----|--------|\n",
- "| John Smith | a | - |\n",
- "| Jane Doe | a | 16 |\n",
- "| Mary Johnson | a | 3 |\n",
- "| John Smith | b | 2 |\n",
- "| Jane Doe | b | 11 |\n",
- "| Mary Johnson | b | 1 |"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "8P88YyUvaxAV"
- },
- "source": [
- "## Table 1 --> Tidy\n",
- "\n",
- "We can use the pandas `melt` function to reshape Table 1 into Tidy format."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "vOUzvON0t8El"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "uYb2vG44az2m"
- },
- "source": [
- "## Table 2 --> Tidy"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "yP_oYbGsazdU"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "kRwnCeDYa27n"
- },
- "source": [
- "## Tidy --> Table 1\n",
- "\n",
- "The `pivot_table` function is the inverse of `melt`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "BxcwXHS9H7RB"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "nR4dlpFQa5Pw"
- },
- "source": [
- "## Tidy --> Table 2"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "flcwLnVdJ-TD"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "7OwdtbQqgG4j"
- },
- "source": [
- "## Load Instacart data\n",
- "\n",
- "Let's return to the dataset of [3 Million Instacart Orders](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "RdXhRmSbgbBc"
- },
- "source": [
- "If necessary, uncomment and run the cells below to re-download and extract the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "SoX-00UugVZD"
- },
- "outputs": [],
- "source": [
- "# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "tDGkv5vngXTw"
- },
- "outputs": [],
- "source": [
- "# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "covQKAHggl80"
- },
- "source": [
- "Run these cells to load the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "dsbev9Gi0JYo"
- },
- "outputs": [],
- "source": [
- "%cd instacart_2017_05_01"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "1AHEpFPcMTn1"
- },
- "outputs": [],
- "source": [
- "products = pd.read_csv('products.csv')\n",
- "\n",
- "order_products = pd.concat([pd.read_csv('order_products__prior.csv'), \n",
- " pd.read_csv('order_products__train.csv')])\n",
- "\n",
- "orders = pd.read_csv('orders.csv')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "bmgW_DxohBV5"
- },
- "source": [
- "## Goal: Reproduce part of this example\n",
- "\n",
- "Instead of a plot with 50 products, we'll just do two — the first products from each list\n",
- "- Half And Half Ultra Pasteurized\n",
- "- Half Baked Frozen Yogurt"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 382
- },
- "colab_type": "code",
- "id": "p4CdH8hkg5RJ",
- "outputId": "62f6104b-064d-4624-f1e9-5da250e2ea80"
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "
"
- ],
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "tags": []
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "from IPython.display import display, Image\n",
- "url = 'https://cdn-images-1.medium.com/max/1600/1*wKfV6OV-_1Ipwrl7AjjSuw.png'\n",
- "example = Image(url=url, width=600)\n",
- "\n",
- "display(example)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "VgXHJM-mhvuo"
- },
- "source": [
- "So, given a `product_name` we need to calculate its `order_hour_of_day` pattern."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "PZxgqPU7h8cj"
- },
- "source": [
- "## Subset and Merge"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "6IymsG0BRYQY"
- },
- "outputs": [],
- "source": [
- "product_names = ['Half Baked Frozen Yogurt', 'Half And Half Ultra Pasteurized']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "LUoNA7_UTNkp"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "lOw6aZ3oiPLf"
- },
- "source": [
- "## 4 ways to reshape and plot"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "5W-vHcWZiFKv"
- },
- "source": [
- "### 1. value_counts"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "QApT8TeRTsgh"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "CiB9xmZ4iIqt"
- },
- "source": [
- "### 2. crosstab"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "aCzF5spQWd_f"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "wCp-qjbriUze"
- },
- "source": [
- "### 3. pivot_table"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "O8d6_TDKNsxB"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "48wCoJowigCf"
- },
- "source": [
- "### 4. melt"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "VnslvFfvYSIk"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# ASSIGNMENT\n",
- "- Replicate the lesson code\n",
- "- Complete the code cells we skipped near the beginning of the notebook\n",
- " - Table 2 --> Tidy\n",
- " - Tidy --> Table 2"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "- Load seaborn's `flights` dataset by running the cell below. Then create a pivot table showing the number of passengers by month and year. Use year for the index and month for the columns. You've done it right if you get 112 passengers for January 1949 and 432 passengers for December 1960."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "flights = sns.load_dataset('flights')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# STRETCH OPTIONS\n",
- "\n",
- "_Try whatever sounds most interesting to you!_\n",
- "\n",
- "- Replicate more of Instacart's visualization showing \"Hour of Day Ordered\" vs \"Percent of Orders by Product\"\n",
- "- Replicate parts of the other visualization from [Instacart's blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), showing \"Number of Purchases\" vs \"Percent Reorder Purchases\"\n",
- "- Get the most recent order for each user in Instacart's dataset. This is a useful baseline when [predicting a user's next order](https://www.kaggle.com/c/instacart-market-basket-analysis)\n",
- "- Replicate parts of the blog post linked at the top of this notebook: [Modern Pandas, Part 5: Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)"
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [],
- "name": "LS_DS_123_Reshape_data.ipynb",
- "provenance": [],
- "version": "0.3.2"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.1"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
diff --git a/module4-make-features/LS_DS_124_Make_features.ipynb b/module4-make-features/LS_DS_124_Make_features.ipynb
deleted file mode 100644
index ce2f9a0d..00000000
--- a/module4-make-features/LS_DS_124_Make_features.ipynb
+++ /dev/null
@@ -1,354 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "qacqiXogluN_"
- },
- "source": [
- "_Lambda School Data Science_\n",
- "\n",
- "# Make features\n",
- "\n",
- "Objectives\n",
- "- understand the purpose of feature engineering\n",
- "- work with strings in pandas\n",
- "- work with dates and times in pandas\n",
- "\n",
- "Links\n",
- "- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)\n",
- "- Python Data Science Handbook\n",
- " - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations\n",
- " - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "iSGiSktAoWIx"
- },
- "source": [
- "## Get LendingClub data\n",
- "\n",
- "[Source](https://www.lendingclub.com/info/download-data.action)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "2ugxlWXimoHn"
- },
- "outputs": [],
- "source": [
- "!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "-4sk6qPgmpmN"
- },
- "outputs": [],
- "source": [
- "!unzip LoanStats_2018Q4.csv.zip"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "seh5oNE1nD0X"
- },
- "outputs": [],
- "source": [
- "!head LoanStats_2018Q4.csv"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "3nAIRCZdofrY"
- },
- "source": [
- "## Load LendingClub data\n",
- "\n",
- "pandas documentation\n",
- "- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n",
- "- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "3-8Vn3y6ooBC"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "1b5_hMTio2Ly"
- },
- "source": [
- "## Work with strings"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "For machine learning, we usually want to replace strings with numbers.\n",
- "\n",
- "We can get info about which columns have a datatype of \"object\" (strings)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "WOL7QPVNo3F4"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Convert `int_rate`\n",
- "\n",
- "Define a function to remove percent signs from strings and convert to floats"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Apply the function to the `int_rate` column"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Clean `emp_title`\n",
- "\n",
- "Look at top 20 titles"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "How often is `emp_title` null?"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Clean the title and handle missing values"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Create `emp_title_manager`\n",
- "\n",
- "pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "s8BcCY6so3by"
- },
- "source": [
- "## Work with dates"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "pandas documentation\n",
- "- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)\n",
- "- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) \"You can access these properties via the `.dt` accessor\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 0,
- "metadata": {
- "colab": {},
- "colab_type": "code",
- "id": "kNrKxOTeo4W3"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# ASSIGNMENT\n",
- "\n",
- "- Replicate the lesson code.\n",
- "\n",
- "- Convert the `term` column from string to integer.\n",
- "\n",
- "- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is \"Current\" or \"Fully Paid.\" Else it should contain the integer 0.\n",
- "\n",
- "- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns."
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "L8k0LiHmo5EU"
- },
- "source": [
- "# STRETCH OPTIONS\n",
- "\n",
- "You can do more with the LendingClub or Instacart datasets.\n",
- "\n",
- "LendingClub options:\n",
- "- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.\n",
- "- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. \n",
- "- Take initiatve and work on your own ideas!\n",
- "\n",
- "Instacart options:\n",
- "- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)\n",
- "- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)\n",
- "- Take initiative and work on your own ideas!"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab_type": "text",
- "id": "0_7PXF7lpEXg"
- },
- "source": [
- "You can uncomment and run the cells below to re-download and extract the Instacart data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# %cd instacart_2017_05_01"
- ]
- }
- ],
- "metadata": {
- "colab": {
- "collapsed_sections": [],
- "name": "LS_DS_124_Make_features.ipynb",
- "provenance": [],
- "version": "0.3.2"
- },
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.7.1"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 1
-}
diff --git a/module4-sequence-your-narrative/.DS_Store b/module4-sequence-your-narrative/.DS_Store
new file mode 100644
index 00000000..5008ddfc
Binary files /dev/null and b/module4-sequence-your-narrative/.DS_Store differ
diff --git a/module4-sequence-your-narrative/LS_DS_124_Sequence_your_narrative.ipynb b/module4-sequence-your-narrative/LS_DS_124_Sequence_your_narrative.ipynb
new file mode 100644
index 00000000..bcc54a82
--- /dev/null
+++ b/module4-sequence-your-narrative/LS_DS_124_Sequence_your_narrative.ipynb
@@ -0,0 +1,479 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "JbDHnhet8CWy"
+ },
+ "source": [
+ "_Lambda School Data Science_\n",
+ "\n",
+ "# Sequence your narrative\n",
+ "\n",
+ "Today we will create a sequence of visualizations inspired by [Hans Rosling's 200 Countries, 200 Years, 4 Minutes](https://www.youtube.com/watch?v=jbkSRLYSojo).\n",
+ "\n",
+ "Using this [data from Gapminder](https://github.com/open-numbers/ddf--gapminder--systema_globalis/):\n",
+ "- [Income Per Person (GDP Per Capital, Inflation Adjusted) by Geo & Time](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--income_per_person_gdppercapita_ppp_inflation_adjusted--by--geo--time.csv)\n",
+ "- [Life Expectancy (in Years) by Geo & Time](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--life_expectancy_years--by--geo--time.csv)\n",
+ "- [Population Totals, by Geo & Time](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv)\n",
+ "- [Entities](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv)\n",
+ "- [Concepts](https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--concepts.csv)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "zyPYtsY6HtIK"
+ },
+ "source": [
+ "Objectives\n",
+ "- sequence multiple visualizations\n",
+ "- combine qualitative anecdotes with quantitative aggregates\n",
+ "\n",
+ "Links\n",
+ "- [Hans Rosling’s TED talks](https://www.ted.com/speakers/hans_rosling)\n",
+ "- [Spiralling global temperatures from 1850-2016](https://twitter.com/ed_hawkins/status/729753441459945474)\n",
+ "- \"[The Pudding](https://pudding.cool/) explains ideas debated in culture with visual essays.\"\n",
+ "- [A Data Point Walks Into a Bar](https://lisacharlotterost.github.io/2016/12/27/datapoint-in-bar/): a thoughtful blog post about emotion and empathy in data storytelling"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "SxTJBgRAW3jD"
+ },
+ "source": [
+ "## Make a plan\n",
+ "\n",
+ "#### How to present the data?\n",
+ "\n",
+ "Variables --> Visual Encodings\n",
+ "- Income --> x\n",
+ "- Lifespan --> y\n",
+ "- Region --> color\n",
+ "- Population --> size\n",
+ "- Year --> animation frame (alternative: small multiple)\n",
+ "- Country --> annotation\n",
+ "\n",
+ "Qualitative --> Verbal\n",
+ "- Editorial / contextual explanation --> audio narration (alternative: text)\n",
+ "\n",
+ "\n",
+ "#### How to structure the data?\n",
+ "\n",
+ "| Year | Country | Region | Income | Lifespan | Population |\n",
+ "|------|---------|----------|--------|----------|------------|\n",
+ "| 1818 | USA | Americas | ### | ## | # |\n",
+ "| 1918 | USA | Americas | #### | ### | ## |\n",
+ "| 2018 | USA | Americas | ##### | ### | ### |\n",
+ "| 1818 | China | Asia | # | # | # |\n",
+ "| 1918 | China | Asia | ## | ## | ### |\n",
+ "| 2018 | China | Asia | ### | ### | ##### |\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "3ebEjShbWsIy"
+ },
+ "source": [
+ "## Upgrade Seaborn\n",
+ "\n",
+ "Make sure you have at least version 0.9.0.\n",
+ "\n",
+ "In Colab, go to **Restart runtime** after you run the `pip` command."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "4RSxbu7rWr1p"
+ },
+ "outputs": [],
+ "source": [
+ "!pip install --upgrade seaborn"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "5sQ0-7JUWyN4"
+ },
+ "outputs": [],
+ "source": [
+ "import seaborn as sns\n",
+ "sns.__version__"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "S2dXWRTFTsgd"
+ },
+ "source": [
+ "## More imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "y-TgL_mA8OkF"
+ },
+ "outputs": [],
+ "source": [
+ "%matplotlib inline\n",
+ "import matplotlib.pyplot as plt\n",
+ "import numpy as np\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "CZGG5prcTxrQ"
+ },
+ "source": [
+ "## Load & look at data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "-uE25LHD8CW0"
+ },
+ "outputs": [],
+ "source": [
+ "income = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--income_per_person_gdppercapita_ppp_inflation_adjusted--by--geo--time.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "gg_pJslMY2bq"
+ },
+ "outputs": [],
+ "source": [
+ "lifespan = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--life_expectancy_years--by--geo--time.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "F6knDUevY-xR"
+ },
+ "outputs": [],
+ "source": [
+ "population = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--datapoints--population_total--by--geo--time.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "hX6abI-iZGLl"
+ },
+ "outputs": [],
+ "source": [
+ "entities = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--entities--geo--country.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "AI-zcaDkZHXm"
+ },
+ "outputs": [],
+ "source": [
+ "concepts = pd.read_csv('https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/ddf--concepts.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "EgFw-g0nZLJy"
+ },
+ "outputs": [],
+ "source": [
+ "income.shape, lifespan.shape, population.shape, entities.shape, concepts.shape"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "I-T62v7FZQu5"
+ },
+ "outputs": [],
+ "source": [
+ "income.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "2zIdtDESZYG5"
+ },
+ "outputs": [],
+ "source": [
+ "lifespan.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "58AXNVMKZj3T"
+ },
+ "outputs": [],
+ "source": [
+ "population.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "0ywWDL2MZqlF"
+ },
+ "outputs": [],
+ "source": [
+ "pd.options.display.max_columns = 500\n",
+ "entities.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "mk_R0eFZZ0G5"
+ },
+ "outputs": [],
+ "source": [
+ "concepts.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "6HYUytvLT8Kf"
+ },
+ "source": [
+ "## Merge data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "dhALZDsh9n9L"
+ },
+ "source": [
+ "https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "A-tnI-hK6yDG"
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "4OdEr5IFVdF5"
+ },
+ "source": [
+ "## Explore data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "4IzXea0T64x4"
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "hecscpimY6Oz"
+ },
+ "source": [
+ "## Plot visualization"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "_o8RmX2M67ai"
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "8OFxenCdhocj"
+ },
+ "source": [
+ "## Analyze outliers"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "D59bn-7k6-Io"
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "DNTMMBkVhrGk"
+ },
+ "source": [
+ "## Plot multiple years"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "JkTUmYGF7BQt"
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "BB1Ki0v6hxCA"
+ },
+ "source": [
+ "## Point out a story"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 0,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "eSgZhD3v7HIe"
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# ASSIGNMENT\n",
+ "Replicate the lesson code\n",
+ "\n",
+ "# STRETCH OPTIONS\n",
+ "\n",
+ "## 1. Animate!\n",
+ "- [Making animations work in Google Colaboratory](https://medium.com/lambda-school-machine-learning/making-animations-work-in-google-colaboratory-new-home-for-ml-prototyping-c6147186ae75)\n",
+ "- [How to Create Animated Graphs in Python](https://towardsdatascience.com/how-to-create-animated-graphs-in-python-bb619cc2dec1)\n",
+ "- [The Ultimate Day of Chicago Bikeshare](https://chrisluedtke.github.io/divvy-data.html) (Lambda School Data Science student)\n",
+ "\n",
+ "## 2. Work on anything related to your portfolio site / project"
+ ]
+ }
+ ],
+ "metadata": {
+ "colab": {
+ "collapsed_sections": [],
+ "name": "LS_DS_224_Sequence_your_narrative.ipynb",
+ "provenance": [],
+ "version": "0.3.2"
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.1"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}