forked from AlanBerger/Practice-programming-exercises-for-R
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFifth-article-Review-of-getting-subsets-of-a-data-frame-constructing-data-framesPDF.Rmd
530 lines (417 loc) · 18.8 KB
/
Fifth-article-Review-of-getting-subsets-of-a-data-frame-constructing-data-framesPDF.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
---
#output:
# md_document:
# variant: markdown_github
output: pdf_document
---
## Fifth article: Review of getting subsets of a data frame, constructing data frames
### Alan E. Berger December 9, 2020
### version 1
### available at https://github.com/AlanBerger/Practice-programming-exercises-for-R
## Motivation
Data frames, which are analogous to Excel spreadsheets for which the entries in each column are of the same
"type" and each column has the same number of rows, are a fundamental way of handling data in R.
I'll first review various ways of extracting subsets of a data frame, and then review several ways to
construct a data frame from multiple vectors or from smaller data frames. This is a fairly long article -
it is not intended to be read all at one time.
R often has several equivalent ways of doing something. Perhaps?? this is from R having been developed in
a collaborative fashion by several people with their own favorite ways of doing various programming constructs, so
they all got included. One can choose one's favorite ways of doing things, but one needs to be familiar with all
the commonly used constructs in order to be able to understand code written by others (and, importantly,
to understand code in instructions and examples for R packages one wants to use).
## Review of how to get specified subsets of a data frame: A. getting a single column
This material is taken/modified from a pinned post of mine
"Examples of extracting as a vector a single column from a data frame" in the Week 2 Discussion Forum for
the R Programming Course in the Johns Hopkins Data Science Specialization on Coursera.
If you are taking this course, then note in the Week 1 pinned posts in the Discussion Forum, that Leonard Greski has
written a very good more general article on getting row and column subsets from a data
frame "Forms of the Extract Operator in R" (this article also contains some more advanced material covered later
in the R course, so read what is relevant to where you are in the R Programming course and refer back later as you
learn more R); and one is also well advised to read Al Warren's pinned post in the Week 2 Discussion
Forum "Subsetting with bracket notation".
Getting (often referred to as *extracting*) a single column from a data frame is a common step in an R function,
and one usually will want to get the column in the form of a vector, not as a data frame with that one column.
Let's see how to get, as a vector, for example the sulfate column of a simple example data frame;
```
df <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
```
To get the sulfate column as vector you can do either of the following 5 equivalent statements:
```
df[["sulfate"]] # double brackets
[1] 4.79 1.46 4.28 NA
# or
df$sulfate # note sulfate does not need to be in quotes for the $ form of extraction
# (but a text string with blanks in it would need to be)
[1] 4.79 1.46 4.28 NA
# or
df[, "sulfate"] # single brackets but note the comma so we get all the rows in the sulfate column
[1] 4.79 1.46 4.28 NA
# or one could use the column number
df[[1]]
[1] 4.79 1.46 4.28 NA
df[, 1]
[1] 4.79 1.46 4.28 NA
# Note that df["sulfate"] # single brackets, no comma,
# is a 1 column data frame containing the sulfate column; if you are getting
# 1 column from a data frame you will usually want it as a vector
df["sulfate"] # single brackets gives a data frame
sulfate
1 4.79
2 1.46
3 4.28
4 NA
class(df["sulfate"])
[1] "data.frame"
# Note for example the mean function "expects" a vector and
# will return NA and give a not very informative message if you
# "feed it" a data frame
mean(df["sulfate"])
[1] NA
Warning message:
In mean.default(df["sulfate"]) :
argument is not numeric or logical: returning NA
# If pollutant is an R variable containing the text string "sulfate"
# then these will work to extract the column as a vector
pollutant <- "sulfate"
df[[pollutant]]
[1] 4.79 1.46 4.28 NA
# or
df[, pollutant]
[1] 4.79 1.46 4.28 NA
# BUT NOT
df$pollutant
NULL
```
`df$pollutant` does NOT work since pollutant is NOT an actual column name; it is a variable *containing*
the text string sulfate which is not acceptable for the $ form of getting/extracting a
column from a data frame as a vector (those are the R "rules" and we have to live with them).
And note R does NOT even warn you about this type of mistake - it just cheerfully gives back
NULL which can lead to v e r y mysterious bugs. Similarly, mistyping the name of a column in the
following example commands results in NULL (with NO warning): `df$sulfffate` and also `df[["sulffffate"]]`.
Programming requires very careful attention to details - one might be tempted to think R should be
able to "figure out" what you meant, but recall what type of mischief an "auto correct" in a word
processor or message app can create - and in a programming language you wouldn't even get to view
in real time what the "compiler" had done to your code. Better to know that if you program correctly
exactly what you want, some "gremlin" won't be changing it!
## Review of how to get specified subsets of a data frame: B. subsetting rows and/or columns
If v is a vector of row indices (that are in the range of the number of rows of the data frame df)
one can get the rows of df corresponding to v
```
df <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
v <- c(1, 3, 2, 2, 2)
df[v, ]
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
2 1.46 NA
2.1 1.46 NA
2.2 1.46 NA
# note reordering and repeats are allowed
# note the indication of repeats in the row numbers R generates (R "does not like"
# duplicate row names and so does modifications to make them unique)
```
Note R does not issue warnings or errors for indices that are "out of range",
it just fills in NA's
```
v <- c(1, 3, 2, 6, 2) # 6 is out of the range of the number of rows of df
df[v, ]
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
2 1.46 NA
NA NA NA
2.1 1.46 NA
```
One can use negative row indices to **exclude** those rows:
```
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
v <- c(-1, -3) # exclude rows 1, 3 and keep the rest
df[v, ]
sulfate nitrate
2 1.46 NA
4 NA 3.56
```
One can also specify desired columns (and repeats of columns)
```
v <- c(1, 3, 2, 2, 2)
# if we just want the first column (sulfate) with these rows
# we can do
df[v, 1]
[1] 4.79 4.28 1.46 1.46 1.46
# or
df[v, "sulfate"]
[1] 4.79 4.28 1.46 1.46 1.46
# or
df[v, ]$sulfate
[1] 4.79 4.28 1.46 1.46 1.46
# we can also specify several columns
w <- c(1, 2, 1, 2)
df[v, w]
sulfate nitrate sulfate.1 nitrate.1
1 4.79 0.299 4.79 0.299
3 4.28 4.280 4.28 4.280
2 1.46 NA 1.46 NA
2.1 1.46 NA 1.46 NA
2.2 1.46 NA 1.46 NA
```
Note R "does not like" duplicate column names and so does modifications to make them unique.
Also one can use a logical vector V having the same number of rows as df; rows where the
corresponding entry of V is TRUE are kept, rows where the corresponding entry of V is FALSE are
not kept.
```
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
logicalVector <- c(T, F, F, T)
df[logicalVector, ]
sulfate nitrate
1 4.79 0.299
4 NA 3.560
```
## The **which** function
If one has a vector V, one can ask for which rows of V is some logical condition TRUE; R's **which** function
does this: the conceptual description is
which(some logical condition on each entry of V)
returns the vector of the **indices** of V for which the the condition is TRUE
(Any entries of V that are NA will be considered to yield FALSE, so those indices of V will **not** be included
in the result.) If there are no indices for which the condition is TRUE, the which function returns an empty
integer vector (integer(0))
For example
```
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
result <- which(df[["sulfate"]] > 2)
result
[1] 1 3
# any entries of V that are NA are considered to yield FALSE
# one can then use result to get the rows of df for which the condition was TRUE
df[result, ]
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
# Note the result of having an NA involved in the following:
df # repeating what df is
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
V <- df[["sulfate"]] > 2 # a logical vector with a value for each row of df
V
[1] TRUE FALSE TRUE NA
df[V, ] # keeps rows of the data frame where V is TRUE, but note the effect of the NA
sulfate nitrate
1 4.79 0.299
3 4.28 4.280
NA NA NA
```
I like the conceptual viewpoint of the which function. Apparently it is rather universal in
that, for example, **IDL** has the corresponding function called **where** and **MATLAB** has a
corresponding function called **find**
## The **%in%** function
The **%in%** function addresses the question of whether or not each entry of some vector v occurs in another vector w.
It returns a logical vector z with z[k] being TRUE if v[k] is equal to some entry in w, and z[k] FALSE if
v[k] is not equal some entry in w. This can be used to obtain a logical vector for use in selecting
rows of a data frame
An example of how %in% behaves:
```
?"%in%" # look at the help on %in% Note because of the special
# character % one needs to "protect" %in% by enclosing it in
# either quotes or apostrophes
v <- c(1, 2, 3, 4, 5, NA)
w <- c(12, 3, 8, 22, 4)
v %in% w
[1] FALSE FALSE TRUE TRUE FALSE FALSE
v <- c(1, 2, 3, 4, 5, NA)
w <- c(12, 3, 8, 22, 4, NA)
v %in% w
[1] FALSE FALSE TRUE TRUE FALSE TRUE
# note %in% will declare a match for an NA in v if there is an NA in w
```
## Creating data frames
### Reading in a data file as a data frame
As noted above, data frames are a fundamental way that R handles data. Many data files that are text files (as opposed
to binary files) are naturally suitable for reading in as a data frame using for example **read.csv**
(where the column separator (delimiter) is a comma), or
more generally **read.table** The options for read.table also apply for read.csv but note some of the important
default choices are different, in particular the default for "telling R" whether there is a header line containing
column names is **header = TRUE** for read.csv and **header = FALSE** for read.table, and the default column separator
for read.csv is `sep = ","` while
for read.table one should usually specify it since the default is "white space"; a common column separator other
than comma is a tab, which is specified by `sep = "\t"` (the backslash "tells" R to interpret the t in a special way).
To start with, when learning R, there are 2 other options one should be aware of: **stringsAsFactors = FALSE** "tells" R to
read in character columns as character data, not as factors which is the default (unless a column is to be used
as a factor in a statistical analysis it is likely better to have it read in as character data).
The second option one should be aware of when starting to learn read.csv and read.table is **na.strings** This option lets
one specify the character string (or several character strings) that should be interpreted as NA (the default is `"NA"`).
For example `na.strings = c("NA", "data is missing", "not available", "the experimenter dropped the sample", "the experimenter was texting when
the data should have been measured")` If there is character data other than NA signifying missing data in a column that
should be read in as numeric data, and R is not "informed" about this, then R will read in that column as factor or
character data, which can lead to "issues" that are best avoided by properly reading in the data.
Another option to be aware of if one is dealing with a file that has "non-standard for R" column
names is **check.names** which if set equal TRUE (the default), then R will modify read in column names to conform with what
R considers standard. That means blank spaces and many characters that are not letters will get replaced by a period.
I find this rather annoying since I like to use long descriptive column headers in files I create, and as long as I am not
having R use the column names other than to write them back out after I have done some analysis on the data, it is OK to
instruct R to leave the column names alone (by setting check.names = FALSE).
## Creating a data frame from smaller data frames and/or vectors and matrices: A. the **data.frame** function
Looking over the R help on **data.frame**, one sees that it can combine objects that are or can be converted to be
data frames into one combined data frame. As with read.csv
and read.table, one may well want to use the option **stringsAsFactors = FALSE**
(unless one needs to have one or more factor columns). (The data frame function will do *recycling* on rows
but I would recommend having the number of rows in objects being combined into a data frame all be the same.)
Note the R **rep** (replicate) function can be used to replicate "patterns", for example
```
rep(c(1,2), times = 4) # repeat the pattern 4 times
[1] 1 2 1 2 1 2 1 2
# rep can also be used this way:
rep(c(1,2), each = 4)
[1] 1 1 1 1 2 2 2 2
```
## Some examples with the data.frame function:
```
df1 <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df1 # our continuing example data frame
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
v1 <- c(1,2,3,4)
v2 <- c(TRUE, FALSE, TRUE, TRUE)
v3 <- c("a", "b", "c", "d")
m1 <- matrix(1:12, nrow = 4, ncol = 3)
m1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
df <- data.frame(df1, v1, v2, m1, v3, stringsAsFactors = FALSE)
df
sulfate nitrate v1 v2 X1 X2 X3 v3
1 4.79 0.299 1 TRUE 1 5 9 a
2 1.46 NA 2 FALSE 2 6 10 b
3 4.28 4.280 3 TRUE 3 7 11 c
4 NA 3.560 4 TRUE 4 8 12 d
# The row names of df are the row names of the first argument of data.frame, i.e.,
# the first of the objects being combined into one data frame
# If one wants to change some column names one could do, for example,
colnames(df)[c(5, 6, 7)] <- c("m1", "m2", "m3")
df
sulfate nitrate v1 v2 m1 m2 m3 v3
1 4.79 0.299 1 TRUE 1 5 9 a
2 1.46 NA 2 FALSE 2 6 10 b
3 4.28 4.280 3 TRUE 3 7 11 c
4 NA 3.560 4 TRUE 4 8 12 d
# and similarly with row names
rownames(df) <- c("r1", "r2", "r3", "r4")
df
sulfate nitrate v1 v2 m1 m2 m3 m4
r1 4.79 0.299 1 TRUE 1 5 9 a
r2 1.46 NA 2 FALSE 2 6 10 b
r3 4.28 4.280 3 TRUE 3 7 11 c
r4 NA 3.560 4 TRUE 4 8 12 d
```
## Creating a data frame from smaller data frames and/or vectors and matrices: B. **rbind**
If one has several data frames that have the same number of columns AND the same column names,
then one can "stack them vertically" using the **rbind** (row bind) function. For example with 2 data frames
df1 and df2
```
df1 <- data.frame(sulfate = c(4.79, 1.46, 4.28, NA), nitrate = c(0.299, NA, 4.280, 3.560))
df1
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
df2 <- data.frame(sulfate = c(24.79, 21.46, 24.28, NA), nitrate = c(2.299, NA, 2.280, 2.560))
df2
sulfate nitrate
1 24.79 2.299
2 21.46 NA
3 24.28 2.280
4 NA 2.560
# then one can do
df <- rbind(df1, df2)
df
sulfate nitrate
1 4.79 0.299
2 1.46 NA
3 4.28 4.280
4 NA 3.560
5 24.79 2.299
6 21.46 NA
7 24.28 2.280
8 NA 2.560
# The column names for all the items being "rbinded" must be the same
# (except for an important exception described below).
# will get an error message if the column names don't match: for example below I have
# the column names in df2 not matching those in df1
colnames(df2)[2] <- "another.name"
df2
sulfate another.name
1 24.79 2.299
2 21.46 NA
3 24.28 2.280
4 NA 2.560
df <- rbind(df1, df2) # gets error message
Error in match.names(clabs, names(xi)) :
names do not match previous names
```
## Creating a data frame from smaller data frames and/or vectors and matrices: C. Using rbind in a loop
In some circumstances one might be reading in or constructing a succession of data frames, each with
the same number of columns and the same column names, and want to combine them vertically.
One can do this in a loop if one initializes an empty data frame via `df <- data.frame()`
One can rbind any data.frame to this empty data frame; this is the exception to the rule
on same number of columns and column names so then this "conceptual" for loop will work:
```
df <- data.frame() # initialize an empty data frame
for (i in some.set) {
read in or derive a data frame dfi corresponding to i (each dfi must have the same
number of columns and the same column names)
df <- rbind(df, dfi)
}
# after this loop the data frame df will consist of all the data frames dfi stacked vertically
```
## Creating a data frame from smaller data frames and/or vectors and matrices: D. the **cbind** function
The **cbind** (column bind) function can combine data frames or combine objects that are or can be converted to be
data frames. cbind is the same as data.frame except that the default for cbind is check.names = FALSE
Hope this review was informative.
The next set of exercises will get into practicing using and creating data frames.
= = = = = = = = = = = = = = = = = = = = = = = =
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/ or send a letter
to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. There is a full version of this license at this web site:
https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode
Some of the material above (Review of how to get specified subsets of a data frame: getting a single column) was
taken/modified from a post of mine in the Discussion Forum for the R Programming Course in
the Johns Hopkins Data Science Specialization on Coursera, as noted above.
As such Coursera and Coursera authorized Partners retain additional rights to that material as described in
their "Terms of Use" https://www.coursera.org/about/terms
Note the reader should not infer any endorsement or recommendation or approval for the material in this article from
any of the sources or persons cited above or any other entities mentioned in this article.