|
47 | 47 | "cell_type": "markdown",
|
48 | 48 | "metadata": {},
|
49 | 49 | "source": [
|
50 |
| - "**Goal:** In this notebook, we review various ways to read (load) and write (save) data from NYC Open Data. Specifically, we will focus on reading our data into a pandas dataframe.\n", |
| 50 | + "**Goal:** In this notebook, we review various ways to read (load) and write (save) data from NYC Open Data. Specifically, we focus on reading our data into a pandas dataframe.\n", |
51 | 51 | "\n",
|
52 | 52 | "**Main Library:** [pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language."
|
53 | 53 | ]
|
|
112 | 112 | }
|
113 | 113 | ],
|
114 | 114 | "source": [
|
| 115 | + "# watermark\n", |
115 | 116 | "%reload_ext watermark\n",
|
116 | 117 | "%watermark -v -p numpy,pandas,geopandas,matplotlib,json,requests,sodapy"
|
117 | 118 | ]
|
|
399 | 400 | "\n",
|
400 | 401 | "2) We use pandas `.shape` method to print the dimensions of the dataframe (i.e. number of rows, number of columns).\n",
|
401 | 402 | "\n",
|
402 |
| - "We will use these two methods throughout the examples." |
| 403 | + "We will use these two methods extensively throughout the examples." |
403 | 404 | ]
|
404 | 405 | },
|
405 | 406 | {
|
|
833 | 834 | "metadata": {},
|
834 | 835 | "source": [
|
835 | 836 | "## 1.4 Reading in a Shapefile\n",
|
836 |
| - "Information about [Shapefiles](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/what-is-a-shapefile.htm#:~:text=A%20shapefile%20is%20a%20simple,%2C%20or%20polygons%20(areas).)." |
| 837 | + "Information about [Shapefiles](https://desktop.arcgis.com/en/arcmap/latest/manage-data/shapefiles/what-is-a-shapefile.htm#:~:text=A%20shapefile%20is%20a%20simple,%2C%20or%20polygons%20(areas).). A popular Geospatial file format." |
837 | 838 | ]
|
838 | 839 | },
|
839 | 840 | {
|
|
1310 | 1311 | "cell_type": "markdown",
|
1311 | 1312 | "metadata": {},
|
1312 | 1313 | "source": [
|
1313 |
| - "## 2.1 Unzipping and reading in data as CSV to local folder\n", |
| 1314 | + "## 2.1 Unzipping and Reading In Data as CSV to Local Folder\n", |
1314 | 1315 | "\n",
|
1315 | 1316 | "We will retrieve, unzip and read data in our downloads folder. Note: I'm using a Mac.\n",
|
1316 | 1317 | "\n",
|
|
1614 | 1615 | "cell_type": "markdown",
|
1615 | 1616 | "metadata": {},
|
1616 | 1617 | "source": [
|
1617 |
| - "## 2.2 Unzipping and reading in data as CSV from local folder" |
| 1618 | + "## 2.2 Unzipping and Reading In Data as CSV from Local Folder" |
1618 | 1619 | ]
|
1619 | 1620 | },
|
1620 | 1621 | {
|
|
1648 | 1649 | }
|
1649 | 1650 | ],
|
1650 | 1651 | "source": [
|
1651 |
| - "# list files in this file path\n", |
| 1652 | + "# list files in this folder path\n", |
1652 | 1653 | "%ls data/unzipped-data/"
|
1653 | 1654 | ]
|
1654 | 1655 | },
|
|
1862 | 1863 | "cell_type": "markdown",
|
1863 | 1864 | "metadata": {},
|
1864 | 1865 | "source": [
|
1865 |
| - "## 2.3 Unzipping and reading in data as a CSV in-memory" |
| 1866 | + "## 2.3 Unzipping and Reading In Data as a CSV In-Memory" |
1866 | 1867 | ]
|
1867 | 1868 | },
|
1868 | 1869 | {
|
|
2106 | 2107 | }
|
2107 | 2108 | ],
|
2108 | 2109 | "source": [
|
2109 |
| - "# read our csv data into a dataframe from our ZIP file\n", |
| 2110 | + "# read our CSV data into a dataframe from our ZIP file\n", |
2110 | 2111 | "file = zf.open('pluto_20v1.csv')\n",
|
2111 | 2112 | "pluto_data = pd.read_csv(file, low_memory=False)\n",
|
2112 | 2113 | "\n",
|
|
2136 | 2137 | "cell_type": "markdown",
|
2137 | 2138 | "metadata": {},
|
2138 | 2139 | "source": [
|
2139 |
| - "# 3. Reading in data from NYC Open Data" |
| 2140 | + "# 3. Reading In Data from NYC Open Data" |
2140 | 2141 | ]
|
2141 | 2142 | },
|
2142 | 2143 | {
|
2143 | 2144 | "cell_type": "markdown",
|
2144 | 2145 | "metadata": {},
|
2145 | 2146 | "source": [
|
2146 |
| - "## 3.1 Reading in data as CSV in static form" |
| 2147 | + "## 3.1 Reading In Data as CSV in Static Form" |
2147 | 2148 | ]
|
2148 | 2149 | },
|
2149 | 2150 | {
|
|
2365 | 2366 | "cell_type": "markdown",
|
2366 | 2367 | "metadata": {},
|
2367 | 2368 | "source": [
|
2368 |
| - "## 3.2 Reading in data as JSON in static form\n", |
2369 |
| - "Note: I do not read data on NYC Open Data this way, but reading JSON in static form does come up, especially when you're working with JSON data from the web. Here, I demonstrate a sample workflow.\n", |
| 2369 | + "## 3.2 Reading In Data as JSON in Static Form\n", |
| 2370 | + "Note: I do not read data on NYC Open Data this way, but reading JSON in static form does come up, especially when you're working with JSON data from the web. Understanding the structure of JSON data is a good skill. Here, I demonstrate a sample workflow.\n", |
2370 | 2371 | ""
|
2371 | 2372 | ]
|
2372 | 2373 | },
|
|
2636 | 2637 | "\n",
|
2637 | 2638 | "# loop through columns and save to list\n",
|
2638 | 2639 | "for col in cols:\n",
|
| 2640 | + " \n", |
| 2641 | + " # sanity check\n", |
2639 | 2642 | " print(col['fieldName'])\n",
|
2640 | 2643 | " \n",
|
2641 | 2644 | " # append column name to list\n",
|
|
3072 | 3075 | }
|
3073 | 3076 | ],
|
3074 | 3077 | "source": [
|
3075 |
| - "# sanity check\n", |
| 3078 | + "# column info sanity check\n", |
3076 | 3079 | "df_json.info()"
|
3077 | 3080 | ]
|
3078 | 3081 | },
|
|
3100 | 3103 | "cell_type": "markdown",
|
3101 | 3104 | "metadata": {},
|
3102 | 3105 | "source": [
|
3103 |
| - "## 3.3 Reading in Shapefile data \n", |
| 3106 | + "## 3.3 Reading In Shapefile Data \n", |
3104 | 3107 | ""
|
3105 | 3108 | ]
|
3106 | 3109 | },
|
|
3236 | 3239 | "cell_type": "code",
|
3237 | 3240 | "execution_count": 38,
|
3238 | 3241 | "metadata": {},
|
| 3242 | + "outputs": [ |
| 3243 | + { |
| 3244 | + "name": "stdout", |
| 3245 | + "output_type": "stream", |
| 3246 | + "text": [ |
| 3247 | + "<class 'geopandas.geodataframe.GeoDataFrame'>\n" |
| 3248 | + ] |
| 3249 | + } |
| 3250 | + ], |
| 3251 | + "source": [ |
| 3252 | + "# preview object type\n", |
| 3253 | + "print(type(gdf))" |
| 3254 | + ] |
| 3255 | + }, |
| 3256 | + { |
| 3257 | + "cell_type": "code", |
| 3258 | + "execution_count": 39, |
| 3259 | + "metadata": {}, |
3239 | 3260 | "outputs": [
|
3240 | 3261 | {
|
3241 | 3262 | "data": {
|
3242 | 3263 | "text/plain": [
|
3243 | 3264 | "<AxesSubplot: >"
|
3244 | 3265 | ]
|
3245 | 3266 | },
|
3246 |
| - "execution_count": 38, |
| 3267 | + "execution_count": 39, |
3247 | 3268 | "metadata": {},
|
3248 | 3269 | "output_type": "execute_result"
|
3249 | 3270 | },
|
|
3296 | 3317 | },
|
3297 | 3318 | {
|
3298 | 3319 | "cell_type": "code",
|
3299 |
| - "execution_count": 39, |
| 3320 | + "execution_count": 40, |
3300 | 3321 | "metadata": {},
|
3301 | 3322 | "outputs": [
|
3302 | 3323 | {
|
|
3311 | 3332 | "output_type": "stream",
|
3312 | 3333 | "text": [
|
3313 | 3334 | "client.__dict__:\n",
|
3314 |
| - "{'domain': 'data.cityofnewyork.us', 'session': <requests.sessions.Session object at 0x1675f24d0>, 'uri_prefix': 'https://', 'timeout': 10}\n" |
| 3335 | + "{'domain': 'data.cityofnewyork.us', 'session': <requests.sessions.Session object at 0x160c734d0>, 'uri_prefix': 'https://', 'timeout': 10}\n" |
3315 | 3336 | ]
|
3316 | 3337 | }
|
3317 | 3338 | ],
|
|
3331 | 3352 | },
|
3332 | 3353 | {
|
3333 | 3354 | "cell_type": "code",
|
3334 |
| - "execution_count": 40, |
| 3355 | + "execution_count": 41, |
3335 | 3356 | "metadata": {},
|
3336 | 3357 | "outputs": [
|
3337 | 3358 | {
|
|
3497 | 3518 | "4 {525F2C24-616B-4F29-98A3-8FEA5D4B1A7D} "
|
3498 | 3519 | ]
|
3499 | 3520 | },
|
3500 |
| - "execution_count": 40, |
| 3521 | + "execution_count": 41, |
3501 | 3522 | "metadata": {},
|
3502 | 3523 | "output_type": "execute_result"
|
3503 | 3524 | }
|
3504 | 3525 | ],
|
3505 | 3526 | "source": [
|
3506 |
| - "# we set the limit at 1,000 rows\n", |
3507 |
| - "# increase limit if you'd like, but there might be a limit if you haven't signed up for an app token\n", |
| 3527 | + "# we manually set the limit at 1,000 rows\n", |
| 3528 | + "# increase limit if you'd like, but there *might* be a limit if you haven't signed up for an app token\n", |
3508 | 3529 | "limit = 1_000\n",
|
3509 | 3530 | "\n",
|
3510 | 3531 | "# get data by passing dataset id and limit number\n",
|
|
3522 | 3543 | },
|
3523 | 3544 | {
|
3524 | 3545 | "cell_type": "code",
|
3525 |
| - "execution_count": 41, |
| 3546 | + "execution_count": 42, |
3526 | 3547 | "metadata": {},
|
3527 | 3548 | "outputs": [
|
3528 | 3549 | {
|
|
3542 | 3563 | "cell_type": "markdown",
|
3543 | 3564 | "metadata": {},
|
3544 | 3565 | "source": [
|
3545 |
| - "# 4. Writing out data\n", |
| 3566 | + "# 4. Writing Out Data\n", |
3546 | 3567 | "Save data to specified destination."
|
3547 | 3568 | ]
|
3548 | 3569 | },
|
3549 | 3570 | {
|
3550 | 3571 | "cell_type": "code",
|
3551 |
| - "execution_count": 42, |
| 3572 | + "execution_count": 43, |
3552 | 3573 | "metadata": {},
|
3553 | 3574 | "outputs": [
|
3554 | 3575 | {
|
|
3720 | 3741 | "4 Other (Man {BB58FD7B-CC22-4896-901D-F8BAFF4AC129} "
|
3721 | 3742 | ]
|
3722 | 3743 | },
|
3723 |
| - "execution_count": 42, |
| 3744 | + "execution_count": 43, |
3724 | 3745 | "metadata": {},
|
3725 | 3746 | "output_type": "execute_result"
|
3726 | 3747 | }
|
|
3738 | 3759 | "cell_type": "markdown",
|
3739 | 3760 | "metadata": {},
|
3740 | 3761 | "source": [
|
3741 |
| - "## 4.1 Writing to a CSV file" |
| 3762 | + "## 4.1 Writing to a CSV File" |
3742 | 3763 | ]
|
3743 | 3764 | },
|
3744 | 3765 | {
|
3745 | 3766 | "cell_type": "code",
|
3746 |
| - "execution_count": 43, |
| 3767 | + "execution_count": 44, |
3747 | 3768 | "metadata": {},
|
3748 | 3769 | "outputs": [
|
3749 | 3770 | {
|
3750 | 3771 | "name": "stdout",
|
3751 | 3772 | "output_type": "stream",
|
3752 | 3773 | "text": [
|
3753 |
| - "README.md sample-data.geojson\r\n", |
3754 |
| - "building-footprints-pluto.csv sample-data.gpkg\r\n", |
| 3774 | + "README.md sample-data.csv\r\n", |
| 3775 | + "building-footprints-pluto.csv sample-data.geojson\r\n", |
| 3776 | + "nta-shape.geojson sample-data.gpkg\r\n", |
3755 | 3777 | "output.csv sample-data.json\r\n",
|
3756 | 3778 | "output.json sample-data.xlsx\r\n",
|
3757 | 3779 | "output.xlsx \u001b[34mshapefile\u001b[m\u001b[m/\r\n",
|
3758 |
| - "sample-buildings.zip \u001b[34munzipped-data\u001b[m\u001b[m/\r\n", |
3759 |
| - "sample-data.csv\r\n" |
| 3780 | + "sample-buildings.zip \u001b[34munzipped-data\u001b[m\u001b[m/\r\n" |
3760 | 3781 | ]
|
3761 | 3782 | }
|
3762 | 3783 | ],
|
|
3773 | 3794 | "cell_type": "markdown",
|
3774 | 3795 | "metadata": {},
|
3775 | 3796 | "source": [
|
3776 |
| - "## 4.2 Writing to an Excel (xlsx) file" |
| 3797 | + "## 4.2 Writing to an Excel (xlsx) File" |
3777 | 3798 | ]
|
3778 | 3799 | },
|
3779 | 3800 | {
|
3780 | 3801 | "cell_type": "code",
|
3781 |
| - "execution_count": 44, |
| 3802 | + "execution_count": 45, |
3782 | 3803 | "metadata": {},
|
3783 | 3804 | "outputs": [
|
3784 | 3805 | {
|
3785 | 3806 | "name": "stdout",
|
3786 | 3807 | "output_type": "stream",
|
3787 | 3808 | "text": [
|
3788 |
| - "README.md sample-data.geojson\r\n", |
3789 |
| - "building-footprints-pluto.csv sample-data.gpkg\r\n", |
| 3809 | + "README.md sample-data.csv\r\n", |
| 3810 | + "building-footprints-pluto.csv sample-data.geojson\r\n", |
| 3811 | + "nta-shape.geojson sample-data.gpkg\r\n", |
3790 | 3812 | "output.csv sample-data.json\r\n",
|
3791 | 3813 | "output.json sample-data.xlsx\r\n",
|
3792 | 3814 | "output.xlsx \u001b[34mshapefile\u001b[m\u001b[m/\r\n",
|
3793 |
| - "sample-buildings.zip \u001b[34munzipped-data\u001b[m\u001b[m/\r\n", |
3794 |
| - "sample-data.csv\r\n" |
| 3815 | + "sample-buildings.zip \u001b[34munzipped-data\u001b[m\u001b[m/\r\n" |
3795 | 3816 | ]
|
3796 | 3817 | }
|
3797 | 3818 | ],
|
|
3813 | 3834 | },
|
3814 | 3835 | {
|
3815 | 3836 | "cell_type": "code",
|
3816 |
| - "execution_count": 45, |
| 3837 | + "execution_count": 46, |
3817 | 3838 | "metadata": {},
|
3818 | 3839 | "outputs": [
|
3819 | 3840 | {
|
3820 | 3841 | "name": "stdout",
|
3821 | 3842 | "output_type": "stream",
|
3822 | 3843 | "text": [
|
3823 |
| - "README.md sample-data.geojson\r\n", |
3824 |
| - "building-footprints-pluto.csv sample-data.gpkg\r\n", |
| 3844 | + "README.md sample-data.csv\r\n", |
| 3845 | + "building-footprints-pluto.csv sample-data.geojson\r\n", |
| 3846 | + "nta-shape.geojson sample-data.gpkg\r\n", |
3825 | 3847 | "output.csv sample-data.json\r\n",
|
3826 | 3848 | "output.json sample-data.xlsx\r\n",
|
3827 | 3849 | "output.xlsx \u001b[34mshapefile\u001b[m\u001b[m/\r\n",
|
3828 |
| - "sample-buildings.zip \u001b[34munzipped-data\u001b[m\u001b[m/\r\n", |
3829 |
| - "sample-data.csv\r\n" |
| 3850 | + "sample-buildings.zip \u001b[34munzipped-data\u001b[m\u001b[m/\r\n" |
3830 | 3851 | ]
|
3831 | 3852 | }
|
3832 | 3853 | ],
|
|
3848 | 3869 | },
|
3849 | 3870 | {
|
3850 | 3871 | "cell_type": "code",
|
3851 |
| - "execution_count": 46, |
| 3872 | + "execution_count": 47, |
3852 | 3873 | "metadata": {},
|
3853 | 3874 | "outputs": [
|
3854 | 3875 | {
|
|
4033 | 4054 | "4 POLYGON ((-73.84769 40.87912, -73.84784 40.879... "
|
4034 | 4055 | ]
|
4035 | 4056 | },
|
4036 |
| - "execution_count": 46, |
| 4057 | + "execution_count": 47, |
4037 | 4058 | "metadata": {},
|
4038 | 4059 | "output_type": "execute_result"
|
4039 | 4060 | }
|
|
4049 | 4070 | },
|
4050 | 4071 | {
|
4051 | 4072 | "cell_type": "code",
|
4052 |
| - "execution_count": 47, |
| 4073 | + "execution_count": 48, |
4053 | 4074 | "metadata": {},
|
4054 | 4075 | "outputs": [
|
4055 | 4076 | {
|
4056 | 4077 | "name": "stdout",
|
4057 | 4078 | "output_type": "stream",
|
4058 | 4079 | "text": [
|
4059 |
| - "output.cpg output.shp sample-data.dbf sample-data.shx\r\n", |
4060 |
| - "output.dbf output.shx sample-data.prj\r\n", |
4061 |
| - "output.prj sample-data.cpg sample-data.shp\r\n" |
| 4080 | + "nta-shape.cpg nta-shape.shx output.shp sample-data.prj\r\n", |
| 4081 | + "nta-shape.dbf output.cpg output.shx sample-data.shp\r\n", |
| 4082 | + "nta-shape.prj output.dbf sample-data.cpg sample-data.shx\r\n", |
| 4083 | + "nta-shape.shp output.prj sample-data.dbf\r\n" |
4062 | 4084 | ]
|
4063 | 4085 | }
|
4064 | 4086 | ],
|
|
0 commit comments