-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathLab01RnDataExpRead.Rmd
204 lines (148 loc) · 14 KB
/
Lab01RnDataExpRead.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "Lab01 Preparation. R Introduction, Data Exploration and Data Summaries"
author: "Emilio A. Laca"
date: "`r date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction to R and RStudio
### Installing R and RStudio
R is the current state-of-the-art Open Source software for statistical analysis and reproducible research. Reproducible research requires that scientific articles be published with the data and the code so anyone can very the results. When working on your computer, you can download and install R from <a href="https://cran.r-project.org" target="new">here</a>.
We will use a friendly user interface to work with R: RStudio. When working on your computer, you can download and install RStudio <a href="https://www.rstudio.com/products/rstudio/download3/" target="new">here</a>. RStudio is actually an IDE (Interactive Development Environment) that is also free Open Source. Take a look at the intro video <a href="https://www.rstudio.com/products/rstudio/" target="new">here</a>.
Both installers have instructions to guide you. If you run into problems, bring your computer to office hours and we will help you.
R consists of a programming environment with a "line" interface that interprets and runs the code and produces results. Results are not sent to permanent files or output unless you specifically ask for it. This is a main point in the philosophy of R that differs dramatically from SAS. R force you to specify what you want to see or produce in specific.
We will use the "knitr" package to produce the lab reports. You have to install and the load knitr. Start RStudio and type *options(repos = c(CRAN = "http://cran.rstudio.com"))* <ENTER> to specify where your computer will find the packages. Then, type *install.packages(knitr)* <ENTER>. You can also click on the Packages tab, then click on Install at the upper left and type "knitr" in the space provided. Then, click on the Install button.
The reports are actually written in Markdown, so we need to install *rmarkdown*. As an exercise, repeat the procedure above to install this package. For an intro video about rmarkdown, go <a href="http://rmarkdown.rstudio.com/lesson-1.html" target="new">here</a>. The installation instructions, as well as more links for R Markdown are also found on that page.
Load the packages by using the library() function:
library(rmarkdown)
library(knitr)
**Do not worry about the details of markdown and knitr!** You will be given all code you need to use in the course, and most reports will be formatted for you so you can learn reproducible research, but you will not have to be an expert in programming or markdown!
###Exploring RStudio###
RStudio have 4 panes in the window. The arrangement of the panes and tabs can be customized in the Preferences of RStudio. These windows and many other features of RStudio are explained <a href="https://www.rstudio.com/resources/webinars/rstudio-essentials-webinar-series-part-1/" target="new">here</a>. Make sure to watch the video as part of your homework and send questions to your instructors.
Once you have installed the necessary packages, you can render this Rmd file and produce the html file that can be viewed in any browser. **For full credit you will submit both the Rmd and html files you complete.**
###R Basics###
For the basic features of R Programming we will use *"A (very) short introduction to R"* <a href="https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf" target="new"> found here</a>.
In the lab we will create a lab report based on some sections of this document.
First, insert your flash drive and set the working directory to be in your drive. This way you will have all of your files with you in your drive. Files that you save to the lab computers are wiped every night. The following script chunk allows you to manually select your working directory by selecting a file within it.
```{r setworkd, echo=TRUE}
getwd() # First, find out what the current working directory is
# All file are saved to your working directory
# Note the use of the hash symbol to create comments within R scripts
# In order to set the working directory "by hand" you need to know the path used by the operating system for your drive. It is probably something like "E:/". Select a file within your flash drive when prompted.
# copy the rest of this line to the console and run it: setwd(dirname(file.choose()))
getwd() # Check that you actually set the desired working directory
```
##Data Exloration and Summary##
###Create a dataframe and table###
To create a data frame we use the data.frame() function. In this example we create a data frame named "mydata." This data frame has 2 columns (Y1 and Y2) and 3 rows.
```{r createdf, echo=TRUE}
mydata <- data.frame(Y1 = c("treatment 1", "treatment 2", "treatment 3"), Y2 = c(35, 23, 30)) # Create data frame with 2 columns and 3 rows
mydata # see the contents of the data frame
str(mydata) # see the structure of the data frame
class(mydata) # see the class of the data frame
library(pander) # we need to install pander first
pander(mydata, caption = "Table showing my data.")
```
###Summary Statistics###
Summary statistics are classified as measures of central tendency and measures of dispersion. Mean, and median are measures of central tendency. Variance, standard deviation, range minimum and maximum are measures of dispersion. For positive random variables, the coefficient of variation is a measure of variation relative to the mean.
We use the *clover.txt* data set provided in the first assignment. These data represent the mass of clover plants grown for different periods at three different temperatures. Temperatures are in the first column coded as 1 for 5-15 C, 2 for 10-20 C and 3 for 15-25 C. The second column contains the number of days of growth and the last column contains the log of the plant mass in g. Column names are in the first row of the file, so we specify *header = TRUE* in the line of code to read the data. Data are placed in a data frame named "clover.
```{r summarystats1, echo = TRUE}
clover <- read.csv("Datasets/Lab01clover.txt", header = TRUE) # read in data.
help(clover) # Read about the nature of the data set.
(avg.lnwt <- mean(clover$lnwt)) # obtain the average lnwtance; note the use of $ to select parts of an object and the outer parentheses to display the result.
(med.lnwt <- median(clover$lnwt)) # median or 50th percentile
quantile(clover$lnwt, 0.5) # 50th percentile is the same as the median
(var.lnwt <- var(clover$lnwt)) # variance
(std.lnwt <- sd(clover$lnwt)) # standard deviation
sum(clover$lnwt)/length(clover$lnwt) # manual calculation of average
sum( (clover$lnwt - avg.lnwt) ^ 2 ) / (length(clover$lnwt) - 1)
(rng.lnwt <- range(clover$lnwt)) # output is a vector of lenght 2
min(clover$lnwt) # minimum braking lnwtance
max(clover$lnwt) # maximum braking lnwtance
(cv.lnwt <- std.lnwt / avg.lnwt) # coefficient of variation
```
The **coefficient of variation** is the standard deviation in units of the average. A value greater than 1 is high and indicates that there is a large amount of variation unexplained. Values below 0.30 are very low and indicate that there is little unexplained variation. For laboratory or controlled field experiments the CV should be below 0.5, whereas for observational studies an acceptable CV can be up to 1.0.
### Frequency Table and Histogram
We can create a frequency table by dividing the range of lnwt into bins according to a general guideline: number of bins = 1 + 3.3 log(sample size). For this we use the cut() function. Then we create the frequency table using the xtabs() function and plot the histogram. All of this is done automatically by R using the hist() function. We will worry about making the graphs look pretty later on.
```{r freqhist, echo = TRUE, include = TRUE}
(sample.size <- length(clover$lnwt))
(nbins <- max(8, (1 + log(sample.size, 2)))) # we round down for bins
clover$bin <- cut(clover$lnwt, breaks = nbins) # create column with bin
(freq.table <- as.data.frame(xtabs( ~clover$bin))) # creates frequency table
pander(freq.table) # format the table a little better
plot(freq.table, xlim = c()) # make histogram "by hand"
hist(clover$lnwt) # by default it uses Sturges rule for bins.
```
Note that by adding parenthesis outside the whole expression we make the new object and display it at the same time.The function as.data.frame transforms the output of xtabs() into a table that is easy to understand.
The histogram can be used to visually assess the type of distribution the data may exhibit.
Data can be easily summarized by using the summary() function of R. This function will produce results that take into account the type of variable in the object it operates on.
```{r summaryfunc, echo = TRUE, include = TRUE}
summary(clover$lnwt) # summary of a numeric variable or vector
class(clover$lnwt) # a numeriv vector contains real numbers
summary(clover$bin) # this is a charcater variable coded as a factor
class(clover$bin)
summary(clover) # gives the summary for all variables
# What kind of result or object does summary() produce?
class(summary(clover)) # This is functional programming; nested functions
mode(summary(clover))
str(summary(clover)) # See the inside of an object!!
summary(clover)[1,1] # get the first element of the table
summary(clover)[1,] # get the first row of the summary table
summary(clover)[,1] # get the first column of the summary table
```
The function str() is really important to see what objects are composed of and to use parts of those objects. Any part of an object can be referenced and modified using its "address."
### Box-and-Whisker Plot and 5-Number Summary
Data can be summarized using a boxplot and or the 5-number summary. These contain information about the average, median, first quartile, third quartile, range and outliers. The *range* argument determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than *range* times the interquartile range (IQR) from the box. Data points that fall outside the whiskers are plotted and can be interpreted as "outliers" or values more extreme than expected. Outliers are so only on the basis of some preconceived expectation, usually a more or less normal distribution. A range value of zero causes the whiskers to extend to the data extremes, so no outliers are shown.
```{r boxplot1, echo = TRUE, include = TRUE}
myrange.factor = 1.5 # save number in a named object for later use
boxplot(clover$lnwt, range = myrange.factor)
text(x = 0.70, y = median(clover$lnwt), label = "Median")
text(x = 0.65, y = quantile(clover$lnwt, 0.75), label = "3rd quartile or 75% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.75), label = "Upper Hinge")
text(x = 0.65, y = quantile(clover$lnwt, 0.25), label = "1st quartile or 25% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.25), label = "Lower Hinge")
# To get the location of the top "fence" we need to identify the datapoint that is closest to but no greater than the upper hinge plus 1.5 * interquartile range.
(iqrange <- IQR(clover$lnwt)) # Calculate the interquartile range
(uplimit <- quantile(clover$lnwt, 0.75) + myrange.factor * iqrange)
(rows.where.lt.uplimit <- which(clover$lnwt <= uplimit)) # gives row numbers of those that are less than uplimit
clover$lnwt[rows.where.lt.uplimit] # gives lnwt in those rows
(up.fence <- max(clover$lnwt[rows.where.lt.uplimit])) # gives the value we seek
(outl <- clover$lnwt[clover$lnwt > up.fence]) # No outliers!
text(x = 0.65, y = up.fence, label = "Upper Fence")
help(boxplot) # get more information about the boxplot.
```
If we choose a smaller value for the *range* argument of boxplot we can make some values appear like outliers. Note that 1.5 is a commonly accepted value for range. Just to see how the plots behave with a different range, we copy the code above and re-run it after changing the range value.
```{r boxplot2, echo = TRUE, include = TRUE}
myrange.factor = 0.5
boxplot(clover$lnwt, range = myrange.factor)
text(x = 0.70, y = median(clover$lnwt), label = "Median")
text(x = 0.65, y = quantile(clover$lnwt, 0.75), label = "3rd quartile or 75% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.75), label = "Upper Hinge")
text(x = 0.65, y = quantile(clover$lnwt, 0.25), label = "1st quartile or 25% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.25), label = "Lower Hinge")
# To get the location of the top "fence" we need to identify the datapoint that is closest to but no greater than the upper hinge plus 1.5 * interquartile range.
(iqrange <- IQR(clover$lnwt)) # Calculate the interquartile range
(uplimit <- quantile(clover$lnwt, 0.75) + myrange.factor * iqrange)
(rows.where.lt.uplimit <- which(clover$lnwt <= uplimit)) # gives row numbers of those that are less than uplimit
clover$lnwt[rows.where.lt.uplimit] # gives lnwt in those rows
(up.fence <- max(clover$lnwt[rows.where.lt.uplimit])) # gives the value we seek
(outl <- clover$lnwt[clover$lnwt > up.fence]) # We have outliers!
text(x = 0.65, y = up.fence, label = "Upper Fence")
```
The values for the five-number summary can be obtained directly with the fivenum() function:
```{r fiven, echo = TRUE, include = TRUE}
pander(cbind(c("min", "lower hinge", "median", "upper hinge", "max"), fivenum(clover$lnwt)))
```
### Analyses by Groups
The clover data has groups defined by the temperature treatments. Because the group numbers are just lables, we need to transform them into what is called "factor" class in R. This way, we can apply functions that obtain summaries and figures for each group. In the case of the clover, we expect that plants to grow faster at higher temperature; let's see what the data say.
```{r grouped, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE}
clover$temp <- factor(clover$temp)
summary(clover)
str(clover)
boxplot(lnwt ~ temp, data = clover, notch = TRUE) # boxplots by treatment
help(by) # by produces a list with an element for each value of the grouping variable.
pander(sumry.by.temp <- by(data = clover$lnwt, INDICES = clover$temp, FUN = summary))
(fiven.by.temp <- by(data = clover$lnwt, INDICES = clover$temp, FUN = fivenum))
```