-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path04.0_DataExploration.Rmd
1448 lines (988 loc) · 54.9 KB
/
04.0_DataExploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
fig_caption: yes
number_sections: yes
theme: readable
toc: yes
toc_depth: 3
editor_options:
chunk_output_type: console
---
```{r setup0, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r message=FALSE, warning=FALSE, paged.print=FALSE, echo=FALSE, include=FALSE}
# load packages for chapter
options(digits = 10)
library(bookdown)
library(emmeans)
library(ggplot2)
library(dplyr)
library(kableExtra)
library(knitr)
library(tables)
library(pander)
library(multcomp)
library(agricolae)
library(nlme)
library(car)
library(tidyr)
```
# Data Manipulation, Exploration and Summaries {#chData}
One of the goals of statistics is to summarize and display data in ways that are easily understood. Data are the basis of much knowledge and advancement, and they are more permanent than models. As we gather more data, models change, but data remain the same and models should have good agreement with all data. Therefore, it is essential for data to be obtained in valid manner, well organized, and properly preserved and curated. This chapter has two main sections, the first gives a very brief but practical introduction to data curation. The main point there is to help students get started with their own data in a way that will avoid major pitfalls and waste of time. The second and larger section of the chapter shows the basic methods to summarize and display data, including a preliminary and intuitive introduction to the concepts of population, parameter, sample and statistic. The main point of the second section is to learn those basic methods, such as the calculation of sample average and standard deviation. The concepts will be formalized in [Chapter 6](#chSampling).
## Learning Objectives for Chapter
1. Create well organized and documented data files.
1. Distinguish between population parameters and sampling statistics.
1. List, define and calculate measures of central tendency for data and populations.
1. List, define and calculate measures of dispersion for data and populations.
1. Explain organization of one- and two-way data using subscripts.
1. Make histograms for sample data, explain histogram parts and interpret it.
1. Calculate the 5-number summary and make a box and whisker plot with all elements labeled.
1. Asses the normality of data using the histogram and boxplot.
1. Calculate the coefficient of variation and describe its weaknesses.
## Data curation
<!-- See wikipedia entry on <a href="https://en.wikipedia.org/wiki/Data_curation" target = "_blank">Data Curation</a>. -->
<!-- See message from Duncan Lang and Vessela Ensberg: -->
Data analysis depends completely on having good data that is correctly organized. The data curation is a new discipline that has become extremely important. We have the ability to gather and store very, very large amounts of data, but those data are useful only if the are well organized, maintained and validated. Most of us work with data and use our senses, hands and brains to organize and check the data. For example, we look at two numbers on the screen and decide if the are equal or not. This may work when we handle a few numbers and pieces of information, but not for more than that, even if it feels that it works. We must use computers and validated logic and programs to manage data.
There are several guidelines and trick for data entry and curation, and we mention a few here.
**Entry and validation**
- Only one point of entry and storage for each piece of information or data. No data should be entered more than at one point or stored in more than one place (except for backups, of course).
If the same datum can be entered in different places, it is possible to have inconsistent values. When data have to be corrected, many locations have to be accessed for the correction, which is not desirable. Moreover, multiple entries increase the probability of making a mistake. There should be a single entry and redundant validation and checkpoints.
**Metadata**
Metadata is data about data. It is the information that allows one to understand and interpret the data. Metadata should include:
1. Who
1. What
1. Where
1. When
1. Why
- Do not mix data and metadata in the same portion of the file.
- Metadata. Where does data come from? Methods? Who entered it? When? Where? Units. Contact info. Does the data come from a larger time series? Describe experimental conditions.
**Format**
- Keep data in a format that is open, widely adopted, widely supported and stored without loss (this last part is particularly important for media). Data should be uncompressed and unencrypted.
1. Open and free formats.
1. Video: MOV, MPEG
1. Sound: WAVE, MP3
1. Numbers and text: ASCII (.txt and .csv)
1. Images: TIFF, JPEG 2000
Word, Excel, and other proprietary formats are not suitable because they are not free and widely interpretable. Software updates will make data in those formats obsolete and unreadable.
**Naming conventions**
Data require that we have identifiers for many categories. Identifying names have to be short to be useful, for example, as filenames. They have to be unique and they must allow enough values to account for all possible names necessary. Name and labels should be informative and easily readable, and they should not include non-ASCII characters or spaces.
- Use a validated file naming convention. You will have to handle files as if each were a datapoint.
**Data organization**
- Format data area as a rectangular set of cells.
- One row per observation or number. Nothing unrelated to the observation in the row.
- A row is analogous to a RECORD in a relational database structure.
- One column per variable. Specify units. A column is analogous to a FIELD in a relational database structure.
- Short column names without spaces or non-ASCII characters. Styles that work with R can be my.col.name.cm, my_col_name_cm or MyColNameCm.
- Never enter values that can be calculated based on other primary values.
- No Excel or other formatting except for dates.
- Enter dates as 10-Mar-2018. 10-5-18 is ambiguous. It could be October 5, 2018 or May 10, 1918, or other combinations.
- No spaces before or after visible characters. No leading or trailing spaces.
The following figure shows a messy data file created in Excel. See if you can spot all the problems and errors in the file. Interestingly, some errors are not visible at all. Can you guess which ones are those?^[There are invisible leading and trailing spaces in several fields]
<br>
```{r messyExcelData, echo=FALSE, message=FALSE, fig.cap="Example of what NOT TO DO with data. This file has multiple errors and violations of the principles listed above."}
include_graphics("images/messyData.png")
```
<br>
In contrast with the data organization above, the data shown in the figure below has proper formating and it is ready for being transferred to R or other statistical packages.
<br>
```{r TidyExcelData, echo=FALSE, message=FALSE, fig.cap="Example of how data can be properly formatted and organized in a spreadsheet."}
include_graphics("images/ProperlyFormattedData.png")
```
<br>
### Other Resources for Data Sciences
The Data Carpentry web site has some good guidelines.
http://www.datacarpentry.org/semester-biology/materials/tidy-data/
New England <a href = "https://library.umassmed.edu/resources/necdmc/modules" target = "_blank">Collaborative Data Management Curriculum</a> has extensive material and lesson plans on data management:
## Populations and samples
Within a statistical framework, the term population refers to the entire set objects we are interested on for a particular question or study. Although we are usually interested in populations whose membership is at least partially determined prior to the study, we mist explicitly define the population we are referring to and we can define it in the best way for the study. Suppose that we are studying the damage that rust causes to wheat production. Our clientele is in a specific region that has certain type of soils and climate. In that case it is a lot more practical and useful to define the population as all the fields with wheat in that region within certain dates. It would not be useful or practical to define the population as all of the wheat fields in the world.
<br>
```{block ChoosingPopulation, type='stattip'}
- The population or universe being studied and from which data are obtained has to be explicitly and unambiguously defined.
- The definition of the population can be arbitrary, but it should be consistent with the goals of the study.
```
<br>
As we will see in more detail in the next chapter, the set of objects and the population of values are not the same thing, and we should use the proper names for them. The set of all California Condors is the **universe** of interest and the set of all of their weights is the **population**. A given universe of interest may contain many populations because each object has multiple interesting characteristics. Usually, no confusion is created by using the term "population" to mean both the universe and the populations for a specific characteristic of interest.
If we observe or measure all of the members of a population we have done a *census* and we know the complete distribution of the variables measured. Frequently, even if we measure every *available* object from a population, we do not have the whole population about which we want to make statistical statements. The population may be a virtual construct from which the realized set can observe originated. Think, for example, of the population of California condors. When we study the biology of the few individuals that are left (170 in captivity and 280 in the wild in 2016) we are not interested exclusively in those individuals, but we want to know how the class "California Condor" works, both in captivity and in the wild and including future individuals. Thus, the present population can be considered a sample or realization of the virtual class or populations "California Condor." If we measure all of them, we will not have the mean for all Condors, past and future. The same idea applies to most populations that are though of as abstract classes, such as humans, UCD students, polar bears, delta smelt, etc.
Data come from observing samples from populations. The concepts of sample and sampling are main topics in Chapter 6 about [Random Variables, Sampling and Distributions](#chSampling). For now just think of a sample as a subset of a population. We explore data from samples and calculate summary and descriptive statistics to paint a picture of what the population might be like. These are called "descriptive" because there is no attempt to make general statements or statistical statements about the population At this point the statistics are just different ways to look at the data collected. One of the main purposes is to give depictions that can be communicated ans understood quickly and succinctly, as opposed to having ot look at all data points at once. Data description and summaries are obtained with statistics, tables and graphical representations.
For example, the population of interest may be the weight newborn dairy calves in California. In the figure below (\@ref(fig:MapCows)) two random samples were drawn from the California newborn dairy calf population to study the weight of newborn calves. The sample mean, more correctly referred to as **average** was calculated for each sample.
<br>
```{r MapCows, message=FALSE, warning=FALSE, paged.print=FALSE, out.width = '60%', fig.align='center', echo=FALSE, fig.cap ="Example of random samples of newborn calf weight collected from dairy cows in northern and southern California."}
knitr::include_graphics("images/Map_cows.png")
```
<br>
## Measures of central tendency
A measure of central tendency is a summary measure that represents the center point of a whole data set and indicates central location of the data. The central tendency can be calculated for quantitative variables. "Central" does not have any meaning for characteristics that are unordered categorical variables, such as color, brand, model, species, coat pattern, etc. The three measures of central tendency we consider are mean, median, and mode. Thee term "mean" is reserved for the whole population, so we should use either "sample mean" or "average" when referring to the calculation for a sample. The other two terms have to be preceded by either "sample" or "population" to be clear about what they are. In the rest of this chapter we will be talking about samples, unless we specifically indicate that we are referring to population values.
The **average** is a central tendency measure that represents the center of mass of a set of values. It is calculated by summing all values and dividing by the number of values or sample size. We usually use the letter "r" to refer to sample size, but it is also common to use the letter "n" for it, in both instances in lower case.
<br>
$$\bar{Y} = \frac{1}{r} \ \sum_{i =1}^r \ Y_i$$
<br>
Characteristics of the average:
1. Easy to calculate and to operate with in equations.
1. Highly susceptible to change due to extreme values.
1. Unbiased estimator of the population mean.
1. Minimizes the sum of squared deviations.
Let's get a random sample from a $\chi^2$ distribution with 3 degrees of freedom and explore the properties for the average.
<br>
```{r AverageProperties, message=FALSE, out.width='70%', fig.align='center', fig.cap="Numerical calculation of sum of squared deviations from sample values to each X value. The sum is minimized when X equals the average for the sample."}
set.seed(1221) # To always make the same random sample.
# Random sample os size 100 from Chi-sq distribution
# with 3 degrees of freedom.
smpl <- rchisq(n = 100, df = 2)
# Get average and save into object called "avg"
(avg <- mean(smpl))
# What happens when we add the extreme value 1200?
mean(c(smpl, 200))
# Calculate the sum of squared deviations for values from 0 to 10
# and view the results in a plot.
plot(sapply(sort(c(-20:80/10, mean(smpl))), FUN = function(i) sum((smpl - i)^2)) ~
sort(c(-20:80/10, mean(smpl))),
type = "l",
lwd = 4,
col = "chocolate",
ylab = "Sum of squared deviations",
xlab = "Value to calculate deviation to")
arrows(x0 = mean(smpl) , y0 = 1500,
x1 = mean(smpl), y1 = sum((smpl - mean(smpl))^2),
angle = 25, length = 0.2,
lwd = 2)
text(x = mean(smpl) , y = 1900 ,
labels = c("Sum of squared deviations \n is minimized when \n X equals the average"))
```
<br>
The **median** of a sample data set is the middle value when the data are arranged in ascending or descending order. The median is the value that leaves 50% of the data below and 50% above it. The median is simple to calculate as it is the average of the two middle values when observations are ranked and r is even. If r is odd, then the median is the observation that leaves (r-1)/2 values below and above.
Characteristics of the median:
1. Very hard to operate with in equations, as most "order" statistics.
1. Robust and resistant against extreme values; useful for skewed distributions.
1. It minimizes the sum of absolute deviations.
<br>
```{r MedianProperties, message=FALSE, out.width='70%', fig.align='center', fig.cap="Numerical calculation of sum of absolute deviations from sample values to each X value. The sum is minimized when X equals the median for the sample."}
plot(sapply(sort(c(-20:80/10, median(smpl))), FUN = function(i) sum(abs(smpl - i))) ~
sort(c(-20:80/10, median(smpl))),
type = "l",
lwd = 4,
col = "chocolate",
ylab = "Sum of absolute deviations",
xlab = "Value to calculate deviation to")
arrows(x0 = median(smpl) , y0 = 350,
x1 = median(smpl), y1 = sum(abs(smpl - median(smpl))),
angle = 25, length = 0.2,
lwd = 2)
text(x = median(smpl) , y = 400 ,
labels = c("Sum of absolute deviations \n is minimized when \n X equals the median"))
```
<br>
The **mode** of a sample data set is the value that occurs most frequently, or the central value of a histogram's tallest bar. Since multiple values can occur several times, one or more modes may be present for a data set. Alternatively, there may not be a most frequent value in a data set and, thus, no mode value.
Sometimes, data are referred as being bimodal because they have two values that are clearly more frequent than the rest, although one of the two is the true mode. Using "mode" in this lax manner is useful because it pinpoints distributions for which it may be tricky to make confidence intervals with a single central location. In cases of bimodal distributions we should look for ways to classify the data or the population into strata or groups such that each group has a unimodal distribution. Then we can provide measures of central tendency and dispersion for each group.
<br>
```{r Bimodal, message=FALSE, echo=FALSE, out.width='70%', fig.align='center', fig.cap="Example of bimodal distribution. A single central confidence interval would not be very useful. These data should be explored to find a way to stratify into groups."}
set.seed(1333)
smpl2 <- c(rnorm(n = 50, mean = 10, sd = 1), rnorm(n = 50, mean = 20, sd = 2))
plot(density(smpl2, adjust = 0.5),
type = "l",
lwd = 4,
col = "chocolate",
ylab = "Probability density",
xlab = "Random variable value",
main = "")
arrows(x0 = median(smpl2) , y0 = 0.10,
x1 = median(smpl2), y1 = 0,
angle = 25, length = 0.2,
lwd = 2)
text(x = median(smpl2) , y = 0.11 ,
labels = c("Median"), cex = 1.5)
arrows(x0 = mean(smpl2) , y0 = 0.07,
x1 = mean(smpl2), y1 = 0,
angle = 25, length = 0.2,
lwd = 2,
col = "blue")
text(x = mean(smpl2) , y = 0.08 ,
labels = c("Mean"), cex = 1.5, col = "blue")
```
<br>
## Measures of dispersion
Measures of dispersion are important parameters and statistics that describe the spread of the data around its central measure (e.g., mean). Knowing the variability in conjunction with the measure if central tendency, we get a more complete summary of the data or population.
The most important measure of dispersion is the **variance** which numerically describes the variability of values in a data set. The **standard deviation** is the square root of the variance and it evaluates dispersion in the same units as the random variable, while the variance is in the units squared. For example, the random variable defined as the girth growth rate of a randomly selected valley oak tree in California during the spring of 2018 has units of mm day^-1^, its variance has units of mm^2^ days^-2^ and its standar deviation has units of mm day^-1^. The population variance of random variable $X$ is denoted by $\sigma^2_X = V\{X\}$ and its standard deviation is $\sigma_X$.
The variance is estimated from a sample with the following equation, where $S^2$ stands for *sample variance*. Note that the numerator is the sum of squared deviations about the average.
<br>
$$S^2 = \frac{1}{r-1} \ \sum_{i=1}^r \ (Y_i - \bar{Y})^2$$
<br>
```{r}
# R code
# Functions:
#var (calculates variance)
var(smpl)
```
For the standard deviation, which is the square root of the variance, we use:
<br>
$$S = \sqrt{\frac{{\sum_{i=1}^r(Y_i - \bar{Y})^2}}{r-1}}$$
<br>
```{r}
# R code
# Functions:
#sd (calculates standard deviation)
sd(smpl)
sqrt(var(smpl))
```
Another measure of dispersion that is useful when we want to avoid excessive influence of extreme data values is the *MAD* or mean absolute deviation. As shown in Figure \@ref(fig:MedianProperties) the MAD is the average of the sum of absolute deviations with respect to the median.
The final measure of dispersion is the range. The **range** is the difference between the largest and smallest values in a data set. The range is the most ambiguous measure of dispersion and is not suited for determining data dispersion when dealing with large data sets, because it only depends on two values of the sample.
## Frequency tables and histograms
Sample data sets from a population can be summarized in a **frequency table** as a first step in data summary and analysis (Table \@ref(tab:FreqTable)). A frequency table classifies all observations into a reasonable number of classes and reports the number of observations that fall within each class. Classes must be non-oevrlapping, and it is best to avoid observations falling exactly on the boundary between two classes. We will use data on the percent of sucrose in 100 beets sampled by F. J. Hills in Woodland, CA in 1949. The `hist` function creates a histogram and frequency table automatically. We will use the function with the default arguments first and then we will modify the number and width of classes.
<br>
```{r SucroseHist01, echo=FALSE, message=FALSE, fig.cap="Histogram of sucrose percentage in 100 beets. The horizontal axis is the percentage of sucrose in each beet. Height of bars represents the number of observations that falls within the base of each bar."}
# Read in data
beets <- read.csv(file = "Datasets/BeetSucrose.csv", strip.white = TRUE,
header = TRUE)
# Check structure of data frame
str(beets)
# Plot histogram and save all numerical data used for it.
beet.hist <- hist(beets$sucrose.pcnt, xlab = "Percentage of sucrose",
ylab = "Frequency or number of beets represented by each bar",
main = "",
col = "skyblue")
# Look at the structure of the results R used to create the histogram
str(beet.hist)
```
<br>
The component `breaks` in `beet.his` has the list of class or bar boundaries, whereas `counts` has the frequencies fro each bar and `mids` has the bar midpoints. If we want to make sure no observation falls on a boundary we proceed as follows. Determine the precision of the measurements. By inspecting the data we see that sugarbeet sucrose concentration was measured with a precision of 0.1\%, because values have only one figure to the right of the decimal point. Then, define the lowest boundary as the "floor" or integer part of the smallest sucrose value plus half of the precision. The smallest value in the data is `min(beets$sucrose.pcnt)` = `r min(beets$sucrose.pcnt)` so we choose `floor(min(beets$sucrose.pcnt)) + 0.05` = `r floor(min(beets$sucrose.pcnt)) + 0.05` as the lowest boundary. We also define the class width with the same precision as the measurements. To get class width wi first determine class number k using Sturges rule:
$$ k = 1 + 3.3 \times log_{10}(r)$$
where k is the number of classes (i.e., bins), r is the sample size (100) and the log is in base 10. In this case the equation yields 7.6, which R rounded to 7, as shown by the number of bars in the graph and the number of counts in the beet.host$counts vector. Then the width of each bar can be gauged by dividing the range of values by the number of classes. By setting our class number k = 9 we will use a class width smaller than what R selected by default.
`(max(beets$sucrose.pcnt) -min(beets$sucrose.pcnt)) / 9` = `r (max(beets$sucrose.pcnt) -min(beets$sucrose.pcnt)) / 9`
We round up to the nearest number with the same precision as the original data, 1.5. This yields classes starting ag 4.05 and in increments of 1.5. R has a very handy function for creating the classes and assigning the observations to each class: `cut`, where `breaks` is the list of class boundaries, which we create with the `seq` function.
<br>
```{r FreqTable, message=FALSE}
# Create class boundaries
(breaks <- seq(from = 4.05, to = 18, by = 1.5))
# Add a variable with the class to which each observation belongs
(beets$class <- cut(beets$sucrose.pcnt,
breaks = breaks,
dig.lab = 4))
# Compute the frequency table
beet.freq.table <- as.data.frame(table(beets$class))
# Change the name of the first column
names(beet.freq.table)[1] <- "Class"
# Calculate the relative frequencies
beet.freq.table$Rel.Freq <- beet.freq.table$Freq / sum(beet.freq.table$Freq)
# Display table
kable(beet.freq.table, caption = "Absolute and relative frequency distribution of sucrose content in a sample for 100 sugarbeets.") %>%
kable_styling(full_width = FALSE)
```
<br>
The table is best graphed into a histogram using the `hist` function. Instead of using the default classes calculated by R, we use the `breaks` we calculated.
<br>
```{r SucroseHist02, echo=FALSE, message=FALSE, fig.cap="Histogram of sucrose percentage in 100 beets with classes specified manually suing the 'breaks' argument."}
beet.hist2 <- hist(beets$sucrose.pcnt, xlab = "Percentage of sucrose",
breaks = breaks,
ylab = "Frequency or number of beets represented by each bar",
main = "",
col = "skyblue")
```
<br>
## Grouped data
Some times data are reported as histograms only and we are interested in getting the average and variance for the sample the histogram represents. We can use the following formulas for data grouped in such a way. If we have all data we do not need to use these formulas. These formulas give approximations to the sample mean and variance when the only information available is a frequency table.
<br>
$$ \bar{X} = \sum_{i=1}^k \ X_{m \ i} \ f(i) \\[25pt]
S^2 = \sum_{i=1}^k \ (X_{m \ i} - \bar{X})^2 \ f(i) $$
<br>
where $X_{m \ i}$ is the midpoint and $f(i)$ is the relative frequency of class $i$.
For the sugrabeet example we extract necessary data from the `beet.hist2` object:
<br>
```{r}
# Get midpoints
Xmi <- beet.hist2$mids
# Get relative frequencies
fi <- beet.hist2$counts / sum(beet.hist2$counts)
# Calculate average
(Xbar <- sum(Xmi * fi))
# Calculate sample variance
sum((Xmi - Xbar)^2 * fi)
```
<br>
## Five-number summary
Another form of initial data analysis can be conducted by computing a five-number summary. The **five-number summary** is a set of descriptive statistics that provides information about the frequency distribution of a data. These descriptive statistics are as follows:
- Minimum (Min.): Smallest value in the data set.
- Lower quartile (Q~1~): The median for the lower half of the data set that separates the first 25% of the values from the other 75%.
- Median: The value that falls in the middle of the data set.
- Upper Quartile (Q~3~): The median for the upper half of the data set that separates the last 25% of the values from the lower 75%.
- Maximum (Max.): Largest value in the data set.
- Not included in the five-number summary is the interquartile range (IQR), which may be useful in determining the middle 50% values in the data set. The IQR is calculated:
$$IQR = Q_3 - Q_1$$
The five number summary can be displayed graphically in a box-and-whisker plot using the `boxplot` function in R. By default, the bosplot calculates the fences or whiskers as the most extreme data point which is no more than 1.5 times the interquartile range from the box. In this case the boxplot can show points beyond the whisker, which are called *outliers*. Figure \@ref(fig:boxplot) shows the boxplot using the R default. Therefore, the in the boxplot shown the whiskers are not the maximum and minimum of the data.
<br>
```{r boxplot, message=FALSE, warning=FALSE, out.width = '50%', fig.align='center', echo=FALSE, fig.cap ="Boxplot for sugarbeet sucrose content."}
# Min - Q1 - Q2 - Q3 - Max
fivenum(beets$sucrose.pcnt)
boxplot(beets$sucrose.pcnt)
text(x = 0.70, y = median(beets$sucrose.pcnt), label = "Median")
text(x = 0.65, y = quantile(beets$sucrose.pcnt, 0.75), label = "3rd quartile or 75% quantile")
text(x = 1.30, y = quantile(beets$sucrose.pcnt, 0.75), label = "Upper Hinge")
text(x = 0.65, y = quantile(beets$sucrose.pcnt, 0.25), label = "1st quartile or 25% quantile")
text(x = 1.30, y = quantile(beets$sucrose.pcnt, 0.25), label = "Lower Hinge")
text(x = 0.8, y = 17.2, label = "Upper Fence")
text(x = 0.8, y = 6.3, label = "Lower Fence")
```
<br>
## Coefficient of variation
The final topic to be discussed in this chapter is the statistic known as the coefficient of variation. The **coefficient of variation (CV)** is simply the ratio of the standard deviation to the mean.
$$ CV_{population} = \frac{\sigma}{\mu}$$
Or
$$ CV_{sample} = \frac{S}{\bar{X}}$$
```{r}
# R code
# Assigning values
PRmean = 194.9
PRsd = 50.29
(CV = PRmean/PRsd)
```
What is the use of the coefficient of variation? The CV provides a dimensionless value that permits comparison of variability between random variables that are in different dimensions (i.e., have different types of units) or that have ranges of values that are very different from each other.
For example, in some cases, we may be interested in determining if the seed mass produced by an individual plant is more or less variable than the variability of leaf area. Mass and area are different dimensions, so it makes no sense to compare the values of their variances or standard deviations directly. The question "is 1 g greater than 1 m^2^?" is not answerable, and therefore it is not a meaningful question. If we divide the standard deviation of each variable by its average, the units cancel out and we obtain a dimensionless coefficient that can now be compared.
<!-- Include figure with example of CV comparison. -->
Frequently, tolerances, allowable or acceptable values of certain variables are expressed as proportions of percentages of the mean or average value. This makes sense because usually, the needed precision in instruments, construction, machines, etc. increases as the units become smaller or lighter. For example, vehicle scales may have a precision of 2%. Accordingly, vehicles within 2% of the nominal value are allowable. It would not make sense to establish a set tolerance for all vehicles. A range of 800 lb might make sense for loaded trucks up to 80,000 lb, but it would be ridiculous for a smart car that has a maximum weight of 2,000 lb. The coefficient of variation allows quick and meaning full comparisons of variability across vehicle categories.
The coefficient of variation is also useful to establish standards for the acceptable variability in lab results. A CV smaller than 10% is achievable in and acceptable for indoor lab experiments with a great level of environmental control. However, such a level is very difficult to achieve in field experiments, where 20-40% is pretty frequent.
**CV pitfalls**
The CV has issues with zero in a couple of ways. This makes sense because it has a denominator, the mean or the average, that can be very close to zero. First, the CV is only meaningful for random variables that are measured in *ratio* scales like mass, length, area and volume. The ratio scale is characterized for having a set and well defined zero. Because of this it does make sense to use ratios in those scales. To say that 2 cm is twice as long as 1 cm makes sense. Interval scale variables such as temperature in C is not a ratio scale variable. It does not make sense to say that 10 C is twice as hot as 5 degrees, because 0 C is an arbitrary zero.
Second, the CV is not very useful when the variable measures has a range of values that is close to zero, particularly when the mean or average is also close to zero. When the means are close to zero, minor difference in mean will cause large changes in the CV.
## Exercises and Solutions
Wilder and Rypstra (2004) examined the effect of praying mantis excrement on the behavior of wolf spiders to test whether cues from an introduced predator (the praying mantis) would change the movement rate of the native wolf spider. They put 15 wolf spiders in individual containers; inside each container there were two semicircles of filter paper. One semicircle was smeared with praying mantis excrement and one circle was without excrement. The researchers observed each spider for one hour, and calculated spider mean walking speed while it moved across first the excrement circle and then the non-excrement circle (data were modified for the purpose of this exercise and are not the original true data).
1. Calculate the average speed and sample variance for each treatment
2. Calculate the difference in speed between treatments for each spider. Report the average difference.
3. Calculate the sample variance for the difference between treatments.
4. Calculate the estimated variance of the averages of difference between treatments.
5. What is the median, first and third quartile for difference between treatments?
6. What is the coefficient of variation for both treatments?
**Data**
```{r}
Spider<-(1:15)
Nexc<-c(2.5, 5.5, 1.1, 2.7, 2.8, 1.6, 3.2, 4.5, 5, 6.9, 2.2, 3.9, 3.8, 3.5, 5.7)
Wexc<-c(0.4, 1.9, 1.2, 2.6, 4.3, 0.3, 1, 1.5, 3.3, 2.6, 0.7, 1.4, 2.1, 3.4, 2.3)
Diff<-c(2.1, 3.6, -0.1, 0.1, -1.5, 1.3, 2.2, 3, 1.7, 4.3, 1.5, 2.5, 1.7, 0.1, 3.4)
```
**Solutions**
1. Average speed No excrement: 3.66; Variance: 2.63 cm/s
Average speed with excrement: 1.93; Variance: 1.36 cm/s
```{r}
Nomean<-mean(Nexc) # Mean value for no excrement treatment
Novar<-var(Nexc) # Variance for no excrement treatment
Wmean<-mean(Wexc) # Mean value for excrement treatment
Wvar<-var(Wexc) # Variance for excrement treatment
```
2. Average difference in speed between treatments: 1.73 cm/s
```{r}
DifAvg<-mean(Diff)
```
3. Sample variance for difference between treatments: 2.48 cm/s
```{r}
DifVar<-var(Diff)
```
4. Estimated variance of averages of difference between treatments: 0.166 cm/s
```{r}
Estvar = DifVar/length(Diff) # Variance of difference divided by the number of observations (n)
```
5. Median, first and third quartiles for difference between treatments: -1.50, 1.70, 2.75 cm/s
```{r}
sum<-fivenum(Diff) # Get five-number summary
```
6. CV for both treatments:
No excrement: 0.443
With excrement: 0.604
```{r}
CVNoexc = (sqrt(Novar))/Nomean
CVWexc = (sqrt(Wvar))/Wmean
```
<br><br>
Table: (\#spider) Wolf spider walking speed with and without mantis excrement.
| Spider #| cm/sec no|cm/sec with| difference|
| | excrement| excrement | |
|--------:|---------:|----------:|----------:|
| 1 | 2.5 | 0.4 | 2.1 |
| 2 | 5.5 | 1.9 | 3.6 |
| 3 | 1.1 | 1.2 | -0.1 |
| 4 | 2.7 | 2.6 | 0.1 |
| 5 | 2.8 | 4.3 | -1.5 |
| 6 | 1.6 | 0.3 | 1.3 |
| 7 | 3.2 | 1.0 | 2.2 |
| 8 | 4.5 | 1.5 | 3.0 |
| 9 | 5 | 3.3 | 1.7 |
| 10 | 6.9 | 2.6 | 4.3 |
| 11 | 2.2 | 0.7 | 1.5 |
| 12 | 3.9 | 1.4 | 2.5 |
| 13 | 3.8 | 2.1 | 1.7 |
| 14 | 3.5 | 3.4 | 0.1 |
| 15 | 5.7 | 2.3 | 3.4 |
<br><br>
## Homework
1. Samples are taken from a population of heifers with known mean and variance of weight gains equal to 80 kg and 324 kg, respectively.
a. What are the mean and variance of the average weight gain of random samples of 20 heifers?
n = ?
Mean = ?
Variance = ?
b. What are the mean and variance of the average weight gain of random samples of 30 heifers?
n = ?
Mean = ?
Variance = ?
c. How many heifers have to be sampled for the variance of the average weight to equal 12.96 ?
2. The mean and standard deviation (kg/ha) of wheat yield two different soils are given in the table below.
**mean** **sd**
soil A 3000 450
soil B 4000 600
a. What are the mean and variance of the differences between sample averages (average of B - average of A) when independent samples of size 9 are taken from fields
in soil A and samples of size 16 are taken from fields in soil B?
n (soil A) = ?
n (soil B) = ?
Mean = ?
Variance = ?
3. Heights of many individual plants were measured for plants of two varieties. Find summary statistics for both plant varieties.
**variety 1** **variety 2**
14.9 3.1
9.5 11.8
10.6 15.4
19.8 13.3
20.2 8.2
14.8 18.3
16.8 20.7
14.6 6.6
11.2 8.5
18.3 9.8
20.9 9.1
13.3 18.1
18.4 28.0
17.7 7.4
16.5 12.9
16.7 15.7
8.3 8.7
10.1 9.0
16.7 12.6
16.3 13.6
14.9 12.2
12.6 19.2
12.4 9.9
12.5 8.1
11.7 13.4
16.3 23.3
14.9 19.7
9.5
19.3
14.9
14.8
7.9
14.2
9.8
a. r = ?
mean = ?
sample variance = ?
q1 = ?
min = ?
median = ?
max = ?
q3 = ?
IQR = ?
k (number of classes) = ?
b. Create a boxplot for both plant varieties.
## Laboratory Exercises
### Plant Sciences Lab
Prepare an .Rmd document starting with the following text, where you substitute the corresponding information for author name and date.
```
---
title: "Lab01: Data Exploration and Summaries"
author: "YourFirstName YourLastName"
date: "today's date here"
output: html_document
---
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#### Instructions
For this lab you will modify this file and submit this file with the file name changed so it has your email ID (the part before @) in lower case instead of "email." Do not add spaces to the file name.
This is a markdown document. You will type your code and run it one line at a time as you add it in the lines indicated below. Add code **ONLY** in the areas between `{r}` and ` ``` `. These areas are highlighted with a light grey color. Run each line and parts to learn and experiment until you get the result you want. Keep the lines that worked and move on. At any time you can see if your document "knits" or not by clicking on the Knit HTML icon at the top. Once you have completed all work, knit your document and save the html file produced with the same file name but with an html extension (Lab01email.html).
**Submit BOTH files for your lab report using the appropriate Canvas tool**
For each part and question below, type your code in the grey area below, between the sets of back-ticks (```) to perform the desired computation and get output. Type your answers below the corresponding grey area. A completed example follows.
#### Example Question with Answer
Create a vector with 5 random numbers from the standard normal distribution. Call it "x." Calculate the average and variance of the values in x. Create a second 5-number vector using the same procedure. Call it "y." Calculate the sum of the elements in y. Make a nicely formatted table showing x and y as columns. Calculate x + y and then sum(x) + sum(y). Discuss the difference between the last two calculations.
```{r}
library(pander) # to do nice tables
x = rnorm(5)
mean(x)
var(x)
y = rnorm(5)
sum(y)
pander(data.frame(x = x, y = y))
x + y
sum(x) + sum(y)
```
The x + y and sum(x) + sum(y) are totally different. The former results in a vector of five numbers, each number being the sum of the corresponding elements in x and y. The latter results in a single number that is the sum of the 5 numbers in x plus the 5 numbers in y. (Full credit!)
#### Part 1. Data input [10 points]
Load the iris data that comes with R into a data frame called myiris, and get its structure. Use help() to learn what the iris data are about. When were the data first published? Hint: you need help. [5 points]
```{r iris.info, echo = TRUE, include = TRUE}
myiris <- iris # put iris data into dataframe object or container called myiris
str(myiris)
myiris <- read.table(header = TRUE, text = "
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3 1.4 0.1 setosa
4.3 3 1.1 0.1 setosa
5.8 4 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5 3 1.6 0.2 setosa
5 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5 3.3 1.4 0.2 setosa
7 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5 2 3.5 1 versicolor
5.9 3 4.2 1.5 versicolor
6 2.2 4 1 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3 4.5 1.5 versicolor
5.8 2.7 4.1 1 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3 5 1.7 versicolor
6 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1 versicolor
5.8 2.7 3.9 1.2 versicolor
6 2.7 5.1 1.6 versicolor
5.4 3 4.5 1.5 versicolor
6 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3 4.1 1.3 versicolor
5.5 2.5 4 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3 4.6 1.4 versicolor
5.8 2.6 4 1.2 versicolor
5 2.3 3.3 1 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3 5.8 2.2 virginica
7.6 3 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3 5.5 2.1 virginica
5.7 2.5 5 2 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6 2.2 5 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2 virginica
7.7 2.8 6.7 2 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6 3 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3 5.2 2.3 virginica
6.3 2.5 5 1.9 virginica
6.5 3 5.2 2 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3 5.1 1.8 virginica
")
## add code to ask for information about dataframe object iris
help(iris)
```
When were the data first published? Your answer here:.
#### Part 2. Coefficient of variation [15 points]
Do a web search and explain the use of the coefficient of variation and one of its main disadvantages. Hint: what happens when your mean is zero or close to zero? [15 points]
Your answer here:
#### Part 3. Five number summary table of iris data [15 points]
Make a data frame and a nicely formatted table containing the following statistics: mean, median, range, minimum, maximum, standard deviation, coefficient of variation and sample size for each of the measurements (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) in myiris data. What variable has the most variation relative to the average? [15 points]
```{r summary.table1, echo=FALSE}
mymeans <- sapply(X = myiris[ , 1:4], FUN = mean) # put means in a column
mymedians <- sapply(X = myiris[, 1:4], FUN = median)
# add formula to get the column with minima here
mymins <- sapply(X = myiris[, 1:4], FUN = min)
mymaxs <- sapply(X = myiris[, 1:4], FUN = max)
myranges <- mymaxs - mymins
# add formula to get standard deviation here
mysds <- sapply(X = myiris[, 1:4], FUN = sd)