-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata_science_TEMPLATE.Rmd
executable file
·239 lines (165 loc) · 6.99 KB
/
data_science_TEMPLATE.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "Introduction to Data Science (EYH)"
author:
- "@@@"
- "@@@"
date: "March 4, 2017"
output: html_document
---
# Basic Computations in R
We start by demonstrating how to do some basic computations in R.
## Example 1 (basic arithmetic)
You may notice that the "code chunk" below looks a little different - it has a gray background and begins and ends with 3 copies of the ` symbol. These symbols help to differentiate computer code from normal text.
We can type code in this part of the file and have the computer execute it. Begin by typing "1 + 2" into the code chunk. Then click on the green triangle to run the code and see the output.
```{r}
@@@
```
## Example 2 (variable assignment)
If we were baking a pie, we would usually prepare the crust and filling separately and then combine them together. We can do the same for a complex set of instructions for the computer, by saving the results and combining them later.
We can use the "<-" symbol to assign our results to **variables** that we can use later. The first line of the following code chunk saves the output of "10 * 10" into a variable named "a". Later, when we want to use the result, we only need to type "a" instead of the whole calculation
Try saving the output of "3 ^ 5" into a variable named "b".
```{r}
a <- 10 * 10
@@@
```
After you run the code, do you notice the variables "a" and "b" appearing in your environment? Try typing "a" and "b" in the console.
Now we can use the variables "a" and "b" for further calculations:
```{r}
a - b
a * a + b
sqrt((a + b) / 7) # pseudo Fermat-triple: 10^2 + 3^5 = 7^3
```
# Loading and Viewing Data
## Read in raw csv data (Carbon Dioxide Measurements)
We can load in large amounts of data in R by reading in some files directly.
First, we can look at our file, which contains monthly measures of Carbon Dioxide (CO2) from Mauna Loa, Hawaii, and global temperature anomalies computed by the NASA Goddard Institute for Space Science.
```{r}
co2_temperature <- read.csv("co2_temperature.csv")
```
This creates a new object containing the data read in from the file.
We can look at the entire dataset by simply evaluating the object as its own line of code:
```{r}
co2_temperature
```
Or portions of the dataset, such as the first 5 rows (but all 8 columns):
```{r}
co2_temperature[1:5, ]
```
Or the first 2 columns (but all 707 rows):
```{r}
co2_temperature[, 1:2]
```
Or only the first 5 rows and first 2 columns:
```{r}
co2_temperature[1:5, 1:2]
```
There are also convenient functions for viewing datasets:
```{r}
summary(co2_temperature)
```
# Make Plots
We can explore patterns in the data by producing graphs, that allow visualization of information that may otherwise by too big or complex to see otherwise.
To do this, we are going to load the `ggplot2` package.
```{r}
library(ggplot2)
```
## Basic lineplot (Carbon Dioxide)
First, let's make a line plot, where the data points are joined together in order. Let's try this with the `co2_temperature` dataset.
Here, the first variable to be replaced will be the name of the dataset. The x-axis variable has already been selected as `decimal_date`, which spaces out the months of the year accordingly, and the y-axis variable is `co2_interpolated`, which is the average monthly carbon-dioxide measurement, with missing values filled in through estimatation.
```{r}
ggplot(@@@,
aes(x = decimal_date, y = co2_interpolated)) +
geom_line() +
theme_bw()
```
Now try picking your own y-axis variable - select any of the columns of the dataset to try:
```{r}
ggplot(@@@,
aes(x = decimal_date, y = @@@)) +
geom_line() +
theme_bw()
```
## Load data from package (Gapminder)
Some datasets are also part of packages, which are extensions of R that can be installed and then loaded.
Here, we load the `gapminder` package, which contains information about life expectancy, population size, and gross domestric product (GDP) for 136 countries from 1952 to 2007:
```{r}
library(gapminder)
```
Try viewing this dataset:
```{r}
summary(@@@)
```
## Basic scatterplot (Gapminder)
First, we want to see if there is a relationships between GDP and life expectancy - whether people in richer countries live longer. Since the `gapminder` data contains measurements from multiple years, let's first restrict it to data from 2007.
```{r}
gapminder_2007 <- gapminder[gapminder$year == @@@,]
```
Now make a plot with the average income (variable `gdpPercap`) on the x-axis, and life expectancy (variable `lifeExp`) on the y-axis.
Recall that the first variable for `ggplot` will be the name of the dataset:
```{r}
ggplot(@@@,
aes(x = @@@, y = @@@)) +
geom_point() +
theme_bw()
```
Try changing one of the axes variables to be "pop" (population). Is there still a clear pattern?
.
```{r}
ggplot(@@@,
aes(x = @@@, y = @@@)) +
geom_point() +
theme_bw()
```
## Basic barplot (Thanksgiving pies)
The `thanksgiving-2015-pie-data.csv` file contains survey results about the kinds of pie that people have at Thanksgiving.
Try reading this into a new variable.
```{r}
@@@ <- read.csv("thanksgiving-2015-pie-data.csv")
```
Let's use the `summary` function to see the contents of the dataset:
```{r}
summary(@@@)
```
To create a barplot, we're going to use a new geom, and this time allow the color to change according to the column name corresponding to the kind of pie.
As before, the first input to `ggplot` will be the name of the dataset,
and the `fill` variable will be the column of the dataset with the data we want to plot:
```{r}
ggplot(@@@,
aes(x = "", fill = @@@)) +
geom_bar() +
theme_bw()
```
If we wanted to make this into a "pie" chart, we can use the same code as before, but add an additional line to change the coordinate system.
```{r}
ggplot(@@@,
aes(x = "", fill = @@@)) +
geom_bar() +
theme_bw() +
coord_polar("y")
```
## Challenge Tasks
If we want to change the default colors, we can do that by adding additional lines to define the color to use for the individual kinds of pie. Note that we can use `colors()` to see the list of available colors in R:
```{r}
colors()
```
Custom-colored pie pie-chart:
```{r}
ggplot(@@@,
aes(x = "", fill = @@@)) +
geom_bar() +
scale_fill_manual(values = c("Apple" = "@@@",
"Buttermilk" = "@@@",
"Cherry" = "@@@",
"Chocolate" = "@@@",
"Coconut cream" = "@@@",
"Key lime" = "@@@",
"None" = "@@@",
"Peach" = "@@@",
"Pecan" = "@@@",
"Pumpkin" = "@@@",
"Sweet Potato" = "@@@")) +
theme_bw() +
coord_polar("y")
```
# Generate Report
The specific kind of file we used is known as R markdown, which allows us to create reports easily. You can try this know by clicking on the "Knit" button above.