-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path02.0_Rcomputation.Rmd
962 lines (515 loc) · 45.5 KB
/
02.0_Rcomputation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
---
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
---
# Statistics with R {#chStatsR}
## Learning Objectives for Chapter
1. Install R and RStudio.
1. Find information about objects and functions in R.
1. Create and use R scripts and projects in RStudio.
1. Read data into R from csv files and line input.
1. Data curation: validate and properly organize and manipulate data.
1. Perform simple assignments and calculations.
1. Use vectorized operations.
1. Create, study and use data frames.
1. Create simple plots and export them.
1. Customize global and project options.
1. Identify, list and describe function of each pane.
1. Describe the efects of the parameters of the Uniform distribution
1. Calculate the expected value of a random variable with Uniform distribution.
1. Explain what the pdf of the Uniform distribution shows (y axis, X axis)
1. Draw a sketch of the sum of 2 Uniform distributions.
1. Give an example of a random variable that might be modelled using the Uniform distribution.
### Background on R and RStudio
W. N. Venables, D. M. Smith and the R Core Team [@R-base] state: "R is an integrated suite of software facilities for data manipulation, calculation and graphical display." They specifically leave out the word "statistics" to emphasize that R is actually an "environment" or fully planned coherent system for programming. RStudio furthers this idea and provides an IDE or integrated development environment for programming in R and other languages. In fact, this book and all the figures and interactive examples and simulations were completely programmed, compiled and deployed using R in the RStudio IDE. The only way to get a good idea of what R and Rstudio can do is to use them with a curious and exploratory approach.
For this class, we will working with RStudio, however both R and RStudio programs needs to be installed to work in RStudio. R contains the software that performs the calculations using R-programming language, whereas RStudio provides a more user-friendly interface.
R is the current state-of-the-art Open Source software for statistical analysis and reproducible research. Reproducible research requires that scientific articles be published with the data and the code so anyone can very the results. When working on your computer, you can download and install R from <a href="https://cran.r-project.org" target="new">here</a>.
We will use a friendly user interface to work with R: RStudio. When working on your computer, you can download and install RStudio <a href="https://www.rstudio.com/products/rstudio/download3/" target="new">here</a>. RStudio is actually an IDE (Interactive Development Environment) that is also free Open Source. Take a look at the intro video <a href="https://www.rstudio.com/products/rstudio/" target="new">here</a>.
Both installers have instructions to guide you. If you run into problems, bring your computer to office hours and we will help you.
R consists of a programming environment with a "line" interface that interprets and runs the code and produces results. Results are not sent to permanent files or output unless you specifically ask for it. This is a main point in the philosophy of R that differs dramatically from SAS. R force you to specify what you want to see or produce in specific.
We will use the "knitr" package to produce the lab reports. You have to install and the load knitr. Start RStudio and type *options(repos = c(CRAN = "http://cran.rstudio.com"))* <ENTER> to specify where your computer will find the packages. Then, type *install.packages(knitr)* <ENTER>. You can also click on the Packages tab, then click on Install at the upper left and type "knitr" in the space provided. Then, click on the Install button.
The reports are actually written in Markdown, so we need to install *rmarkdown*. As an exercise, repeat the procedure above to install this package. For an intro video about rmarkdown, go <a href="http://rmarkdown.rstudio.com/lesson-1.html" target="new">here</a>. The installation instructions, as well as more links for R Markdown are also found on that page.
### R Basics
**Do not worry about the details of markdown and knitr!** You will be given all code you need to use in the course, and most reports will be formatted for you so you can learn reproducible research, but you will not have to be an expert in programming or markdown!
For the basic features of R Programming we will use *"A (very) short introduction to R"* <a href="https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf" target="new"> found here</a>.
In the lab we will create a lab report based on some sections of this document.
### Exploring RStudio
RStudio have 4 panes in the window. These windows and many other features of RStudio are explained in the video below. Make sure to watch the video as part of your homework and send questions to your instructors.
<br>
<a href="https://resources.rstudio.com/wistia-rstudio-essentials-2/rstudioessentialsprogrammingpart1-2" target="new"><img src="images/RStudioVideo01.png" width="640" height="362" title="RStudio Video1" alt="RStudio Video"></a>
<br>
When RStudio is opened, four windows that appear :
- the Terminal or Console window, which is where R computes the command line code you enter,
- the Source or Markdown File window, which is where you can edit and save lines of code and send code to the console to calculate,
- the Environment and History window, which is where working datasets and objects are listed, and
- the Files/Plots/Packages/Help window, which is where you can see what other files are in your working directory, view plots, install and load packages and find help on functions.
You can customize the windows and adjust the size of each of these windows by clicking and dragging the borders between them, and minimize/maximize them depending on your preference.
<br>
```{r WindowPanes, echo=FALSE, out.width="80%", fig.pos="center"}
knitr::include_graphics("images/CH2Windows.png")
```
<br>
### Terminal or Console Window
The Terminal or Console window is where the R program computes command line code. R is ready to receive a command when there is a
">" on the left of each line in the Terminal is called the cursor. It is important to note that command line code is case sensitive, so be careful to include capital and lower case letters when appropriate. Next to the cursor is where you type a complete line of code and press "enter" to be prompted with the computed result. Some of the code is straight forward using universal symbols such as "+", "-", "/" and "*" for basic arithmetic that is similar to commands in excel.
Take some time to practice simple arithmetic commands in the Console window. After inputting the code press "Enter" to perform the calculation.
```{r Calculations, echo=FALSE}
knitr::include_graphics("images/CH2Console.png")
```
When R returns the calculation it will include a "[1]" when there is only a single line to the calculation. For other R calculations, the results may be listed across multiple lines where each line will be numbered within the brackets. You can also build upon this code by including parentheses "()" and powers "^" similar to code from excel.
In R studio you can easily add balanced parenthesis ( ), brackets [ ], braces { }, single ' ' and double " " quotation marks by selecting the text to be enclosed and then typing the beginning character (, [, {, ', or ". RStudio will insert both the beginning and end parenthesis etc. making them "balanced."
```{r Calculations2, echo=FALSE}
knitr::include_graphics("images/CH2Console2.png")
```
If you enter a command and the code is not complete, the cursor changes to "+" and R is waiting for you to finish the command. A common example of this is when your code includes multiple parentheses and an open or closed end of the parentheses is missing.
```{r IncompleteCode, echo=FALSE}
knitr::include_graphics("images/CH2IncompleteError.png")
```
If parts of the code is missing or incorrectly entered, you will recieve an **error** message with some detail of the error. Using the "up arrow" with the cursor in the Console allows you to scroll through previous commands; this is helpful when making corrections to a previous command that did not compute, especially if the code is complex.
```{r ErrorMessage, echo=FALSE}
knitr::include_graphics("images/CH2Error.png")
```
To clear the history in the Console, you can click on the broom button or press "Ctrl" + "L".
### Source (Markdown) Window
The Markdown window is similar to a notepad; this is where you can input and save command line code, include instructions and attach notes. All of the computer labs for this class will have you working with instructions provided in R Markdown, saved in .rmd files. These files will contain written instructions and then "chunks" within the R Markdown file where you can input and send command line code to the Terminal window for computation. The "chunks" will appear as gray lines that always start with three ticks and a lowercase r in brackets and end with another three ticks. See the example below of an empty chunk.
Code entered in these "chunks" can be sent to the Console for computation several ways. The fastest method to compute a single line of code is by having the cursor on the same line as the code and pressing "Ctrl" + "Enter" (for Windows) or "command" + "return" (for Macs). To run multiple lines of code within a single chunk, you can press "Ctrl" + "Alt" + "C" (for Windows) or "command" + "option" + "C" (for Macs). You can also go to the top right bar of the Markdown window and click on "Run" which will give you multiple options to run a single line, an entire chunk, all of the code in a document, etc. The third option is to click on the green arrow at the top right corner of the chunk, which will run the entire chunk of code.
```{r RunningChunk, echo=FALSE}
knitr::include_graphics("images/CH2RunChunk.png")
```
Code that is entered in white space is not computable; code can only be sent to the Console for computation when it is in the gray space known as a chunk. You can also include non-computable notes within a chunk when it is prefaced with a "#". Entering these notes can be useful as code gains complexity.
```{r BasicFunctions, echo=FALSE}
#two ways to calculate the sum of 1 and 5
1 + 5
sum(c(1,5))
#sum() is the function to calculate the sum of the two numbers.
#c() is the function used to combine numbers into a list/vector
#two ways to calculate the mean of 8, 4, and 6
(8 + 4 + 6) / 3
mean(c(8,4,6))
#mean() is the function used for calculating the mean.
```
### Saving Files
The benefit of using RMarkdown is the ability to save instructions, codes and notes. There are two ways that you can save the file (1) as the working .Rmd file, which will allow the Markdown file to be accessed for future edits and (2) as a .html file that provides a finished easy-to-read file that can be opened up in a browser complete with all plots. For each lab, you will be submitting both the .Rmd and .html files.
Saving as a .Rmd file can be done by either clicking on the floppy disk button on the top of the Markdown pane, by pressing "Ctrl" + "S" or going to the upper left File button in the RStudio pane.
```{r RmdFile, echo=FALSE}
knitr::include_graphics("images/CH2RmdSave.png")
```
Saving as a .html file can be done by clicking on the Knit icon that has the ball of yarn at the top of the Markdown pane.
```{r html File}
knitr::include_graphics("images/CH2htmlSave.png")
```
### Files, Plots, Packages, Help Window
### Files
This pane will includes all of the files that are currently available in your working directory. Here you can read and save files and set your working directory. Your working directory is the default location R will use as the file path for importing and saving files on the computer. If there is a particular location or folder on your computer that is where you plan to import and save files from R, you will need to set this file path to your working directory in R. This simplifies the need to re-write the entire file path everytime you plan to import or save data.
In the Files tab, you can click on "More" and set your working directory to the desired location on your computer.
```{r WorkingDirectory}
knitr::include_graphics("images/CH2WD.png")
```
You can also use the function getwd() to check the location of your current working directory and setwd() to set your working directory, where you need to enter the designated path file within the function for the latter. In order to set the working directory "by hand" you need to know the path used by the operating system for your drive. It is probably something like "E:/". Select a file within your flash drive when prompted.
setwd(dirname(file.choose()))
After you set your working directory, verify that it is set at the appropriate location with getwd().
### Plots
This pane will allow you to see each of the plot images that you have created individually when the code is entered in the Console. Within this tab you have the ability to open up the plot in a separate window, to copy the plot to your clipboard, and to export the plot as a PDF or JPEG image.
```{r Plot}
knitr::include_graphics("images/CH2Plot.png")
```
### Packages
The Packages pane is where we can install and access "packages" in our library for different functions. At the beginning of each of our computer labs, there will be a list of packages that we will be using for different functions throughout each lab. We load each of these packages with the library() function. However the first time you use a package, we need to install the package on the local computer. This step requires internet access to install a package on your computer, but once installed it will be saved in your library. Click on the "Install" button and then enter the name of the package you wish to install on your computer. Then once installed we need to load the package into our current working session by using the library() function or by manually checking the box next to the function. Using the library() function will automatically check this box in the Packages pane
```{r Packages}
library(car)
knitr::include_graphics("images/CH2Packages.png")
```
### Help
RStudio has a pane specifically designed to find information about functions, objects, data and many other things in R. The help pane has an entry area for keywords at the top right. There is a basic toolbar that can be used to navigate pages as in a browser. You can also use help() with the name of a function within it to prompt this pane.
There is a large community of R users all over the world. This community provides lots of help and support through multiple websites. Whenever you are stuck and cannot find the information you need in a form that you understand using R help, you should use a seach engine, typically Google, and pose your question directly as a web search. This works remarkably well if you use the correct keyworkds. Always end your query with "in R" so the search is much more focused.
```{r Help}
help(help) #allows you to access more information on the help() function within the Help pane
knitr::include_graphics("images/CH2Help.png")
```
### Environment and History Window
The Environment and History window shows you data and objects that are loaded into R. The History tab will provide you a list of all code that has been entered into the Console during your active working session. The Environment will list data imported from external files (.xls, .txt, .csv) and objects that are defined within R. These objects include vectors and data frames.
### Vectors and Data Frames
```{r VectorsAndDataFrames}
knitr::include_graphics("images/CH2VectorDataFrame.png")
```
A vector is a basic data structure in R, which is a string of values of the same data type. For example, a vector will consist of a string of numbers (1, 2, 3) or a string of characters ("one", "two", "three"), but not both. You can use "<-" to designate a name to a given object, and it will then appear in the Environment window. Vectors only have a single dimension, their length, which is what distinguishes them from data frames.
```{r Vectors}
myvector <- c(1,2,3)
myvector
myvector2 <- c("one","two","three")
myvector2
help(c) #help() allows you to search for a description of a given function
#c() is a function that combines values to form a vector
```
One of the most common way to have data in R is using objects called data frames. These are table-like structures with rows and columns like a spreadsheet. Data frames are two-dimensional data sets that consist of a width and height, or more accurately the length of their columns (or number of rows) and the number of columns (or length of their rows). Data frames can also be thought of as multiple vectors (columns) of the same length that are bound together. Each column can have only one type of data, but different columns can have different data types. Data and objects in general in R can be of various types, called "modes." Whenever we enter data into R and do not specify the type, R will assign it the data type that does not lose any information. For example, if we enter A, 1, 3, 4, t, i, 7.5, into a column of a data frame, R will make a column of type character, because that is the only mode that can contain letters and numbers, but the numbers are stored as characters. The mode of objects can be changed, with restrictions.
Another data structure is a matrix, which is a two-dimensional table. Matrices may look like data frames in their structure, but they are different in that it can only contain one type of data, i.e. a table of numeric values only. Due to this limitation, we will primarily be working with data frames throughout the class.
```{r DataFrames}
mymatrix <- cbind(myvector, myvector2)
mymatrix
help(cbind)
#cbind() is a function that combines columns
mydataframe <- as.data.frame(cbind(myvector, myvector2))
mydataframe
help(as.data.frame)
#as.data.frame() is a function that converts an object to a dataframe
```
"<-" and "=" mean the same thing. They assign the value on the right into the object on the left: object.name <- operation. The "=" sign is also used to give values to the arguments of functions, where "<-" does not work. For example, to make an object called "my.random.numbers" with 5 random numbers from a normal distribution with mean 3 and standard deviation 2 we write:
`my.random.numbers <- rnorm(n = 3, mean = 3, sd = 2)`
### Importing Data
Data files can be imported from your local computer for manipulation and analysis in R and new files can be created and saved back to your local computer from R. We will be importing data from select files for most of our computer labs to perform statistical analyses and creating new .RMD and .HTML files in R.
Depending on how a data file is saved on the computer, as a text, csv or excel file, there are different functions for uploading the data into R. For each of the different functions, two important arguments to include are the file path to direct you to the file and designating that the column names, also known as headers, should be preserved when imported into R. When the desired data file is located within your working directory, you do not need to enter the complete file path, you can just include the name of the file in quotations. If the file is outside of your working directory, you will need to designate the complete file path in the quotations.
read.csv() imports .csv files into your Environment
read.table() imports .txt files into your Environment
read_excel() imports .xls and .xlsx files into your Environment
```{r ImportingData}
knitr::include_graphics("images/CH2Import2.png")
```
Another option is to click on the Import Dataset button at the top of the Environment Window tab and to manually select the location of the file. Here you can designate if the file is from Excel or from a Text (readr); the latter will also give you the option to designate if your file is comma (.csv) or tab (.txt) separated. This will also provide you with code for importing the data, including the complete file path. The code will be provided at the beginning of each lab for how to import the dataset we will be working with.
### Functions
Functions take arguments, perform operations on them and return results. For example, in mean(x = c(8, 4, 6)), mean() is the function and the argument called x receives the value c(8, 4, 6), which is a vector made up of the numbers 8, 4, and 6 in that order. The function returns a result that is the number 6.
Users interact with R mostly using functions. R and R packages provide thousands of functions, and each user can create new custom fuctions. Suppose we want to make a plot of a polynomial of order 3 for many different values of the parameters. We define a function that calculates the values of the polynomial and plots them all in one step.
``` {r}
plotPoly <- function(b0, b1, b2, b3, xmin, xmax){ # name function and arguments
x <- seq(from = xmin, to = xmax, length.out = 100) # create x values
y <- b0 + b1 * x + b2 * x^2 + b3 * x^3 # calculate y values
plot(y ~ x, type = "l") # plot the results
}
# Now that the function is defined we can make plots for any values of the arguments pretty quickly
plotPoly(b0 = 1, b1 = 0.5, b2 = 0.03, b3 = -0.01, xmin = -5, xmax = 10)
# Adjust the range of x to get a better view
plotPoly(b0 = 1, b1 = 0.5, b2 = 0.03, b3 = -0.01, xmin = -10, xmax = 10)
```
Command line code varies in its complexity, but **it is never expected for all of the code to be memorized**. This is why the following window, R Markdown, is an important tool to save code and attach instructions and notes.
R can do basic arithmetic operations that can be completed on a calculator, but what is particularly useful is the ability to perform these operations on vectors and data frames
```{r}
a <- c(1, 2, 3, 4, 5)
b <- c(10, 20, 30, 40, 50)
a + 100
## [1] 101 102 103 104 105
a + b
## [1] 11 22 33 44 55
(a + b) / 10
## [1] 1.1 2.2 3.3 4.4 5.5
```
We can also rename existing objects, perform arithmetic functions across vectors, and print out the values within defined objects
```{r}
a <- 1:5
b <- 1:5
ab.sum <- a + b
ab.diff <- a - b
ab.prod <- a * b
ab.sum
## [1] 2 4 6 8 10
ab.diff
## [1] 0 0 0 0 0
ab.prod
## [1] 1 4 9 16 25
```
There are also basic summary statistics that can be calculated with different functions in R. These are important statistics that we will continue to use throughout the class to extrapolate different information from datasets.
Lets take the numbers 1-10 and extrapolate different summary statistics from the data.
### Summary Statistics Functions
Table: (\#tab:SumStatFun) Summary Statistics for the length, summation, minimum, maximum, median, mean, range, variance, standard deviation and summary of a set of data
| Function | Example | Result returned |
|-------------:|:-------------:|:-----------------:|
| length() | length(1:10) | 10 |
| sum() | sum(1:10) | 55 |
| min() | min(1:10) | 1 |
| max() | max(1:10) | 10 |
| median() | median(1:10) | 5.5 |
| mean() | mean(1:10) | 5.5 |
| range() | range(1:10) | 1 10 |
| var() | var(1:10) | 9.16 |
| sd() | sd(1:10) | 3.03 |
| | |Min. 1, 1st Q.3.25|
| summary() | summary(1:10) |Med. 5.5, Mean 5.5|
| | |3rd Q.7.75, Max 10|
```{r}
length(1:10)
sum(1:10)
min(1:10)
max(1:10)
median(1:10)
mean(1:10)
range(1:10)
var(1:10)
sd(1:10)
summary(1:10)
```
<br>
## Exercises
1. For data set E1 consisting of 12 values, calculate the following summary statistics
```{r}
E1 <- c(14, 18, 47, 3, 112, 63, 32, 27, 91, 81, 4, -71)
```
a. mean
b. range
c. variance
d. standard deviation
2. For data set E2 consisting of 12 values, calculate the following summary statistics
```{r}
E2 <- c(8, 128, -103, 92, 14, 44, -67, 9, 31, 12, -7, 0)
```
a. mean
b. range
c. variance
d. standard deviation
3. Subtract dataset E1 from dataset E2 and calculate the following summary statistics
a. length
b. mean
c. range
d. variance
e. standard deviation
4. Add 100 to each of the 12 values you calculated in Question 3, how does this change the mean, range, variance and standard deviation?
## Laboratory Exercises
### Plant Sciences
#### Introduction to R and RStudio
If you are using a lab computer and not your personal computer, insert your flash drive and set the working directory to be in your drive. This way you will have all of your files with you in your drive. Files that you save to the lab computers are wiped every night. The following script chunk allows you to manually select your working directory by selecting a file within it.
Once you have installed the necessary packages, you can render this Rmd file and produce the html file that can be viewed in any browser. **For full credit you will submit both the Rmd and html files you complete.**
#### Create a dataframe and table
To create a data frame we use the data.frame() function. In this example we create a data frame named "mydata." This data frame has 2 columns (Y1 and Y2) and 3 rows.
```{r PsCreateDf, echo=TRUE}
mydata <- data.frame(Y1 = c("treatment 1", "treatment 2", "treatment 3"), Y2 = c(35, 23, 30)) # Create data frame with 2 columns and 3 rows
mydata # see the contents of the data frame
str(mydata) # see the structure of the data frame
class(mydata) # see the class of the data frame
library(pander) # we need to install pander first
pander(mydata, caption = "Table showing my data.") #pander() creates a table for the data
```
#### Summary Statistics
Summary statistics are classified as measures of central tendency and measures of dispersion. Mean, and median are measures of central tendency. Variance, standard deviation, range minimum and maximum are measures of dispersion. For positive random variables, the coefficient of variation is a measure of variation relative to the mean.
We use the *clover.txt* data set provided in the first assignment. These data represent the mass of clover plants grown for different periods at three different temperatures. Temperatures are in the first column coded as 1 for 5-15 C, 2 for 10-20 C and 3 for 15-25 C. The second column contains the number of days of growth and the last column contains the log of the plant mass in g. Column names are in the first row of the file, so we specify *header = TRUE* in the line of code to read the data. Data are placed in a data frame named "clover.
```{r ps summarystats1, echo = TRUE}
clover <- read.csv("Datasets/Lab01clover.txt", header = TRUE) # read in data.
help(clover) # Read about the nature of the data set.
(avg.lnwt <- mean(clover$lnwt)) # obtain the average lnwtance; note the use of $ to select parts of an object and the outer parentheses to display the result.
(med.lnwt <- median(clover$lnwt)) # median or 50th percentile
quantile(clover$lnwt, 0.5) # 50th percentile is the same as the median
(var.lnwt <- var(clover$lnwt)) # variance
(std.lnwt <- sd(clover$lnwt)) # standard deviation
sum(clover$lnwt)/length(clover$lnwt) # manual calculation of average
sum( (clover$lnwt - avg.lnwt) ^ 2 ) / (length(clover$lnwt) - 1)
(rng.lnwt <- range(clover$lnwt)) # output is a vector of lenght 2
min(clover$lnwt) # minimum braking lnwtance
max(clover$lnwt) # maximum braking lnwtance
(cv.lnwt <- std.lnwt / avg.lnwt) # coefficient of variation
```
The **coefficient of variation** is the standard deviation in units of the average. A value greater than 100% is high and indicates that there is a large amount of variation unexplained. Values below 0.30 are low and indicate that there is little unexplained variation. For controlled field experiments the CV should be below 0.5, whereas for observational studies an acceptable CV can be up to 1.0.
#### Frequency Table and Histogram
We can create a frequency table by dividing the range of lnwt into bins according to a general guideline: number of bins = 1 + 3.3 log(sample size). For this we use the cut() function. Then we create the frequency table using the xtabs() function and plot the histogram. All of this is done automatically by R using the hist() function. We will worry about making the graphs look pretty later on.
```{r ps freqhist, echo = TRUE, include = TRUE}
(sample.size <- length(clover$lnwt))
(nbins <- max(8, (1 + log(sample.size, 2)))) # we round down for bins
clover$bin <- cut(clover$lnwt, breaks = nbins) # create column with bin
(freq.table <- as.data.frame(xtabs( ~clover$bin))) # creates frequency table
pander(freq.table) # format the table a little better
plot(freq.table, xlim = c()) # make histogram "by hand"
hist(clover$lnwt) # by default it uses Sturges rule for bins.
```
Note that by adding parenthesis outside the whole expression we make the new object and display it at the same time.The function as.data.frame transforms the output of xtabs() into a table that is easy to understand.
The histogram can be used to visually assess the type of distribution the data may exhibit.
Data can be easily summarized by using the summary() function of R. This function will produce results that take into account the type of variable in the object it operates on.
```{r ps summaryfunc, echo = TRUE, include = TRUE}
summary(clover$lnwt) # summary of a numeric variable or vector
class(clover$lnwt) # a numeriv vector contains real numbers
summary(clover$bin) # this is a character variable coded as a factor
class(clover$bin)
summary(clover) # gives the summary for all variables
# What kind of result or object does summary() produce?
class(summary(clover)) # This is functional programming; nested functions
mode(summary(clover))
str(summary(clover)) # See the inside of an object!!
summary(clover)[1,1] # get the first element of the table
summary(clover)[1,] # get the first row of the summary table
summary(clover)[,1] # get the first column of the summary table
```
The function str() is really important to see what objects are composed of and to use parts of those objects. Any part of an object can be referenced and modified using its "address."
#### Box-and-Whisker Plot and 5-Number Summary
Data can be summarized using a boxplot and or the 5-number summary. These contain information about the average, median, first quartile, third quartile, range and outliers. The *range* argument determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than *range* times the interquartile range (IQR) from the box. Data points that fall outside the whiskers are plotted and can be interpreted as "outliers" or values more extreme than expected. Outliers are so only on the basis of some preconceived expectation, usually a more or less normal distribution. A range value of zero causes the whiskers to extend to the data extremes, so no outliers are shown.
```{r PsBoxplot1, echo = TRUE, include = TRUE}
myrange.factor <- 1.5
boxplot(clover$lnwt, range = myrange.factor)
text(x = 0.70, y = median(clover$lnwt), label = "Median")
text(x = 0.65, y = quantile(clover$lnwt, 0.75), label = "3rd quartile or 75% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.75), label = "Upper Hinge")
text(x = 0.65, y = quantile(clover$lnwt, 0.25), label = "1st quartile or 25% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.25), label = "Lower Hinge")
# To get the location of the top "fence" we need to identify the datapoint that is closest to but no greater than the upper hinge plus 1.5 * interquartile range.
(iqrange <- IQR(clover$lnwt)) # Calculate the interquartile range
(uplimit <- quantile(clover$lnwt, 0.75) + myrange.factor * iqrange)
(rows.where.lt.uplimit <- which(clover$lnwt <= uplimit)) # gives row numbers of those that are less than uplimit
clover$lnwt[rows.where.lt.uplimit] # gives lnwt in those rows
(up.fence <- max(clover$lnwt[rows.where.lt.uplimit])) # gives the value we seek
(outl <- clover$lnwt[clover$lnwt > up.fence]) # No outliers!
text(x = 0.65, y = up.fence, label = "Upper Fence")
help(boxplot) # get more information about the boxplot.
```
If we choose a smaller value for the *range* argument of boxplot we can make some values appear like outliers. Note that 1.5 is a commonly accepted value for range. Just to see how the plots behave with a different range, we copy the code above and re-run it after changing the range value.
```{r ps boxplot2, echo = TRUE, include = TRUE}
myrange.factor = 0.5
boxplot(clover$lnwt, range = myrange.factor)
text(x = 0.70, y = median(clover$lnwt), label = "Median")
text(x = 0.65, y = quantile(clover$lnwt, 0.75), label = "3rd quartile or 75% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.75), label = "Upper Hinge")
text(x = 0.65, y = quantile(clover$lnwt, 0.25), label = "1st quartile or 25% quantile")
text(x = 1.30, y = quantile(clover$lnwt, 0.25), label = "Lower Hinge")
# To get the location of the top "fence" we need to identify the datapoint that is closest to but no greater than the upper hinge plus 1.5 * interquartile range.
(iqrange <- IQR(clover$lnwt)) # Calculate the interquartile range
(uplimit <- quantile(clover$lnwt, 0.75) + myrange.factor * iqrange)
(rows.where.lt.uplimit <- which(clover$lnwt <= uplimit)) # gives row numbers of those that are less than uplimit
clover$lnwt[rows.where.lt.uplimit] # gives lnwt in those rows
(up.fence <- max(clover$lnwt[rows.where.lt.uplimit])) # gives the value we seek
(outl <- clover$lnwt[clover$lnwt > up.fence]) # We have outliers!
text(x = 0.65, y = up.fence, label = "Upper Fence")
```
The values for the five-number summary can be obtained directly with the fivenum() function:
```{r ps fiven, echo = TRUE, include = TRUE}
pander(cbind(c("min", "lower hinge", "median", "upper hinge", "max"), fivenum(clover$lnwt)))
```
#### Analyses by Groups
The clover data has groups defined by the temperature treatments. Because the group numbers are just lables, we need to transform them into what is called "factor" class in R. This way, we can apply functions that obtain summaries and figures for each group. In the case of the clover, we expect that plants to grow faster at higher temperature; let's see what the data say.
```{r PsGrouped, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE}
clover$temp <- factor(clover$temp)
summary(clover)
str(clover)
boxplot(lnwt ~ temp, data = clover, notch = TRUE) # boxplots by treatment
help(by) # by produces a list with an element for each value of the grouping variable.
sumry.by.temp <- by(data = clover$lnwt, INDICES = clover$temp, FUN = summary)
pander(sumry.by.temp)
(fiven.by.temp <- by(data = clover$lnwt, INDICES = clover$temp, FUN = fivenum))
```
### Animal Sciences
#### Introduction to R and RStudio
If you are using a lab computer and not your personal computer, insert your flash drive and set the working directory to be in your drive. This way you will have all of your files with you in your drive. Files that you save to the lab computers are wiped every night. The following script chunk allows you to manually select your working directory by selecting a file within it.
Once you have installed the necessary packages, you can render this Rmd file and produce the html file that can be viewed in any browser. **For full credit you will submit both the Rmd and html files you complete.**
#### Create a dataframe and table
To create a data frame we use the data.frame() function. In this example we create a data frame named "mydata." This data frame has 2 columns (Y1 and Y2) and 3 rows.
```{r as createdf, echo=TRUE}
mydata <- data.frame(Y1 = c("treatment 1", "treatment 2", "treatment 3"), Y2 = c(35, 23, 30)) # Create data frame with 2 columns and 3 rows
mydata # see the contents of the data frame
str(mydata) # see the structure of the data frame
class(mydata) # see the class of the data frame
library(pander) # we need to install pander first
pander(mydata, caption = "Table showing my data.") #pander() creates a table for the data
```
#### Summary Statistics
Summary statistics are classified as measures of central tendency and measures of dispersion. Mean, and median are measures of central tendency. Variance, standard deviation, range minimum and maximum are measures of dispersion. For positive random variables, the coefficient of variation is a measure of variation relative to the mean.
We use the *heifer.xlsx* data set provided in the first assignment. These data represent the birth weight of heifers across three different years (2005, 2006, 2007). Column names are in the first row of the file, so we specify *col_names = TRUE* in the line of code to read the data. Data are placed in a data frame named "heifer".
```{r as summarystats1, echo = TRUE}
library(readxl)
# read in data.
heifer <- read_excel("Datasets/HeiferDataLab1.xlsx", col_names = TRUE)
(avg.bwt <- mean(heifer$Birth_Wt)) # obtain the average bwt
# note the use of $ to select parts of an object and the outer
# parentheses to display the result.
(med.bwt <- median(heifer$Birth_Wt)) # median or 50th percentile
quantile(heifer$Birth_Wt, 0.5) # 50th percentile is the same as the median
(var.bwt <- var(heifer$Birth_Wt)) # variance
(std.bwt <- sd(heifer$Birth_Wt)) # standard deviation
sum(heifer$Birth_Wt)/length(heifer$Birth_Wt) # manual calculation of average
sum( (heifer$Birth_Wt - avg.bwt) ^ 2 ) / (length(heifer$Birth_Wt) - 1)
(rng.bwt <- range(heifer$Birth_Wt)) # output is a vector of lenght 2
min(heifer$Birth_Wt) # minimum braking bwtance
max(heifer$Birth_Wt) # maximum braking bwtance
(cv.bwt <- std.bwt / avg.bwt) # coefficient of variation
```
The **coefficient of variation** is the standard deviation in units of the average. A value greater than 1 is high and indicates that there is a large amount of variation unexplained. Values below 0.30 are very low and indicate that there is little unexplained variation. For laboratory or controlled field experiments the CV should be below 0.5, whereas for observational studies an acceptable CV can be up to 1.0.
#### Frequency Table and Histogram
We can create a frequency table by dividing the range of bwt into bins according to a general guideline: number of bins = 1 + 3.3 log(sample size). For this we use the cut() function. Then we create the frequency table using the xtabs() function and plot the histogram. All of this is done automatically by R using the hist() function. We will worry about making the graphs look pretty later on.
```{r as freqhist, echo = TRUE, include = TRUE}
(sample.size <- length(heifer$Birth_Wt))
(nbins <- max(8, (1 + log(sample.size, 2)))) # we round down for bins
heifer$bin <- cut(heifer$Birth_Wt, breaks = nbins) # create column with bin
(freq.table <- as.data.frame(xtabs( ~heifer$bin))) # creates frequency table
pander(freq.table) # format the table a little better
plot(freq.table, xlim = c()) # make histogram "by hand"
hist(heifer$Birth_Wt) # by default it uses Sturges rule for bins.
```
Note that by adding parenthesis outside the whole expression we make the new object and display it at the same time.The function as.data.frame transforms the output of xtabs() into a table that is easy to understand.
The histogram can be used to visually assess the type of distribution the data may exhibit.
Data can be easily summarized by using the summary() function of R. This function will produce results that take into account the type of variable in the object it operates on.
```{r as summaryfunc, echo = TRUE, include = TRUE}
summary(heifer$Birth_Wt) # summary of a numeric variable or vector
class(heifer$Birth_Wt) # a numeriv vector contains real numbers
summary(heifer$bin) # this is a charcater variable coded as a factor
class(heifer$bin)
summary(heifer) # gives the summary for all variables
# What kind of result or object does summary() produce?
class(summary(heifer)) # This is functional programming; nested functions
mode(summary(heifer))
str(summary(heifer)) # See the inside of an object!!
summary(heifer)[1,1] # get the first element of the table
summary(heifer)[1,] # get the first row of the summary table
summary(heifer)[,1] # get the first column of the summary table
```
The function str() is really important to see what objects are composed of and to use parts of those objects. Any part of an object can be referenced and modified using its "address."
#### Box-and-Whisker Plot and 5-Number Summary
Data can be summarized using a boxplot and or the 5-number summary. These contain information about the average, median, first quartile, third quartile, range and outliers. The *range* argument determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than *range* times the interquartile range (IQR) from the box. Data points that fall outside the whiskers are plotted and can be interpreted as "outliers" or values more extreme than expected. Outliers are so only on the basis of some preconceived expectation, usually a more or less normal distribution. A range value of zero causes the whiskers to extend to the data extremes, so no outliers are shown.
```{r as boxplot1, echo = TRUE, include = TRUE}
myrange.factor = 1.5 # save number in a named object for later use
boxplot(heifer$Birth_Wt, range = myrange.factor)
text(x = 0.70, y = median(heifer$Birth_Wt), label = "Median")
text(x = 0.65, y = quantile(heifer$Birth_Wt, 0.75), label = "3rd quartile or 75% quantile")
text(x = 1.30, y = quantile(heifer$Birth_Wt, 0.75), label = "Upper Hinge")
text(x = 0.65, y = quantile(heifer$Birth_Wt, 0.25), label = "1st quartile or 25% quantile")
text(x = 1.30, y = quantile(heifer$Birth_Wt, 0.25), label = "Lower Hinge")
# To get the location of the top "fence" we need to identify the datapoint that is closest to but no greater than the upper hinge plus 1.5 * interquartile range.
(iqrange <- IQR(heifer$Birth_Wt)) # Calculate the interquartile range
(uplimit <- quantile(heifer$Birth_Wt, 0.75) + myrange.factor * iqrange)
(rows.where.lt.uplimit <- which(heifer$Birth_Wt <= uplimit)) # gives row numbers of those that are less than uplimit
heifer$Birth_Wt[rows.where.lt.uplimit] # gives bwt in those rows
(up.fence <- max(heifer$Birth_Wt[rows.where.lt.uplimit])) # gives the value we seek
(outl <- heifer$Birth_Wt[heifer$Birth_Wt > up.fence]) # 10 outliers
text(x = 0.65, y = up.fence, label = "Upper Fence")
help(boxplot) # get more information about the boxplot.
```
If we choose a smaller value for the *range* argument of boxplot we can make some values appear like outliers. Note that 1.5 is a commonly accepted value for range. Just to see how the plots behave with a different range, we copy the code above and re-run it after changing the range value.
```{r as boxplot2, echo = TRUE, include = TRUE}
myrange.factor = 0.5
boxplot(heifer$Birth_Wt, range = myrange.factor)
text(x = 0.70, y = median(heifer$Birth_Wt), label = "Median")
text(x = 0.65, y = quantile(heifer$Birth_Wt, 0.75), label = "3rd quartile or 75% quantile")
text(x = 1.30, y = quantile(heifer$Birth_Wt, 0.75), label = "Upper Hinge")
text(x = 0.65, y = quantile(heifer$Birth_Wt, 0.25), label = "1st quartile or 25% quantile")
text(x = 1.30, y = quantile(heifer$Birth_Wt, 0.25), label = "Lower Hinge")
# To get the location of the top "fence" we need to identify the datapoint that is closest to but no greater than the upper hinge plus 1.5 * interquartile range.
(iqrange <- IQR(heifer$Birth_Wt)) # Calculate the interquartile range
(uplimit <- quantile(heifer$Birth_Wt, 0.75) + myrange.factor * iqrange)
(rows.where.lt.uplimit <- which(heifer$Birth_Wt <= uplimit)) # gives row numbers of those that are less than uplimit
heifer$Birth_Wt[rows.where.lt.uplimit] # gives bwt in those rows
(up.fence <- max(heifer$Birth_Wt[rows.where.lt.uplimit])) # gives the value we seek
(outl <- heifer$Birth_Wt[heifer$Birth_Wt > up.fence]) # We have 90 outliers!
text(x = 0.65, y = up.fence, label = "Upper Fence")
```
<!-- **PUT THIS IN THE CORRECT PLACE** -->
<!-- Note that the uniform distribution U(0, 1) is the distribution of the random numbers generated by the function RAND() in Excel. For more details on this distribution, see Uniform Distribution at <https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)>. -->
<!-- \begin{equation} -->
<!-- f(x) = -->
<!-- \begin{cases} -->
<!-- \frac{1}{b-a} & \quad \text{if } a \leq x \leq b \\ -->
<!-- 0 & \quad \text{otherwise } -->
<!-- \end{cases} -->
<!-- (\#eq:unifPdf) -->
<!-- \end{equation} -->
<!-- ```{r UnifDist, fig.cap="Probability density function for a continuous uniform distribution."} -->
<!-- ``` -->
<!-- **DOWN TO HERE** -->
The values for the five-number summary can be obtained directly with the fivenum() function:
```{r as fiven, echo = TRUE, include = TRUE}
pander(cbind(c("min", "lower hinge", "median", "upper hinge", "max"), fivenum(heifer$Birth_Wt)))
```
#### Analyses by Groups
The heifer data has groups defined by the year. Because the group numbers are just labels, we need to transform them into what is called "factor" class in R. This way, we can apply functions that obtain summaries and figures for each group. We can see if there are any differences in the birth weight across the three years for heifers.
```{r as grouped, echo=TRUE, message=FALSE, warning=FALSE, include=TRUE}
heifer$year <- factor(heifer$Birth_Year)
summary(heifer)
str(heifer)
boxplot(Birth_Wt ~ year, data = heifer, notch = TRUE) # boxplots by treatment
help(by) # by produces a list with an element for each value of the grouping variable.
pander(sumry.by.temp <- by(data = heifer$Birth_Wt, INDICES = heifer$year, FUN = summary))
(fiven.by.temp <- by(data = heifer$Birth_Wt, INDICES = heifer$year, FUN = fivenum))
```