forked from clauswilke/dataviz
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathcoordinate_systems_axes.Rmd
452 lines (364 loc) · 34 KB
/
coordinate_systems_axes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
```{r echo = FALSE, message = FALSE}
# run setup script
source("_common.R")
library(lubridate)
library(forcats)
library(tidyr)
library(ggrepel)
```
# Coordinate systems and axes {#coordinate-systems-axes}
To make any sort of data visualization, we need to define position scales, which determine where in a graphic different data values are located. We cannot visualize data without placing different data points at different locations, even if we just arrange them next to each other along a line. For regular 2d visualizations, two numbers are required to uniquely specify a point, and therefore we need two position scales. These two scales are usually but not necessarily the *x* and *y* axis of the plot. We also have to specify the relative geometric arrangement of these scales. Conventionally, the *x* axis runs horizontally and the *y* axis vertically, but we could choose other arrangements. For example, we could have the *y* axis run at an acute angle relative to the *x* axis, or we could have one axis run in a circle and the other run radially. The combination of a set of position scales and their relative geometric arrangement is called a *coordinate system.*
## Cartesian coordinates
The most widely used coordinate system for data visualization is the 2d *Cartesian coordinate system*, where each location is uniquely specified by an *x* and a *y* value. The *x* and *y* axes run orthogonally to each other, and data values are placed in an even spacing along both axes (Figure \@ref(fig:cartesian-coord)). The two axes are continuous position scales, and they can represent both positive and negative real numbers. To fully specify the coordinate system, we need to specify the range of numbers each axis covers. In Figure \@ref(fig:cartesian-coord), the *x* axis runs from -2.2 to 3.2 and the *y* axis runs from -2.2 to 2.2. Any data values between these axis limits are placed at the respective location in the plot. Any data values outside the axis limits are discarded.
(ref:cartesian-coord) Standard cartesian coordinate system. The horizontal axis is conventionally called *x* and the vertical axis *y*. The two axes form a grid with equidistant spacing. Here, both the *x* and the *y* grid lines are separated by units of one. The point (2, 1) is located two *x* units to the right and one *y* unit above the origin (0, 0). The point (-1, -1) is located one *x* unit to the left and one *y* unit below the origin.
```{r cartesian-coord, fig.asp = 0.8, fig.cap = '(ref:cartesian-coord)'}
df_points <- data.frame(x = c(-1, 0, 2),
y = c(-1, 0, 1),
label = c("(-1, -1)", "(0, 0)", "(2, 1)"),
vjust = c(1.4, -.8, -.8),
hjust = c(1.1, 1.1, -.1))
df_segments <- data.frame(x0 = c(0, 2, 0, -1),
x1 = c(2, 2, -1, -1),
y0 = c(1, 0, -1, 0),
y1 = c(1, 1, -1, -1))
df_labels <- data.frame(x = c(-1, -.5, 1, 2),
y = c(-.5, -1, 1, 0.5),
vjust = c(.5, 1.3, -.3, .5),
hjust = c(1.1, .5, .5, -.1),
label = c("y = -1", "x = -1", "x = 2", "y = 1"))
ggplot(df_points, aes(x, y)) +
geom_hline(yintercept = 0, color = "gray60") +
geom_vline(xintercept = 0, color = "gray60") +
geom_segment(data = df_segments, aes(x = x0, xend = x1, y = y0, yend = y1),
linetype = 2) +
geom_point(size = 3, color = "#0072B2") +
geom_text(aes(label = label, vjust = vjust, hjust = hjust), size = 12/.pt) +
geom_text(data = df_labels, aes(label = label, hjust = hjust, vjust = vjust), size = 12/.pt) +
coord_fixed(xlim = c(-2.2, 3.2), ylim = c(-2.2, 2.2), expand = FALSE) +
xlab("x axis") +
ylab("y axis") +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"))
```
Data values usually aren't just numbers, however. They come with units. For example, if we're measuring temperature, the values may be measured in degrees Celsius or Fahrenheit. Similarly, if we're measuring distance, the values may be measured in kilometers or miles, and if we're measuring duration, the values may be measured in minutes, hours, or days. In a Cartesian coordinate system, the spacing between grid lines along an axis corresponds to discrete steps in these data units. In a temperature scale, for example, we may have a grid line every 10 degrees Fahrenheit, and in a distance scale, we may have a grid line every 5 kilometers.
A Cartesian coordinate system can have two axes representing two different units. This situation arises quite commonly whenever we're mapping two different types of variables to *x* and *y*. For example, in Figure \@ref(fig:temp-normals-vs-time), we plotted temperature vs. days of the year. The *y* axis of Figure \@ref(fig:temp-normals-vs-time) is measured in degrees Fahrenheit, with a grid line every at 20 degrees, and the *x* axis is measured in months, with a grid line at the first of every third month. Whenever the two axes are measured in different units, we can stretch or compress one relative to the other and maintain a valid visualization of the data (Figure \@ref(fig:temperature-normals-Houston)). Which version is preferable may depend on the story we want to convey. A tall and narrow figure emphasizes change along the *y* axis and a short and wide figure does the opposite. Ideally, we want to choose an aspect ratio that provides a clear and accurate presentation of the data.
(ref:temperature-normals-Houston) Daily temperature normals for Houston, TX. Temperature is mapped to the *y* axis and day of the year to the *x* axis. Parts (a), (b), and (c) show the same figure in different aspect ratios. All three parts are valid visualizations of the temperature data. Data source: NOAA.
```{r temperature-normals-Houston, fig.width = 7.5, fig.cap = '(ref:temperature-normals-Houston)'}
temps_wide <- filter(ncdc_normals,
station_id %in% c(
"USW00014819", # Chicago, IL 60638
"USC00516128", # Honolulu, HI 96813
"USW00027502", # Barrow, AK 99723, coldest point in the US
"USC00042319", # Death Valley, CA 92328 hottest point in the US
"USW00093107", # San Diego, CA 92145
"USW00012918", # Houston, TX 77061
"USC00427606" # Salt Lake City, UT 84103
)) %>%
mutate(location = fct_recode(factor(station_id),
"Chicago" = "USW00014819",
"Honolulu" = "USC00516128",
"Barrow, AK" = "USW00027502",
"Death Valley" = "USC00042319",
"San Diego" = "USW00093107",
"Houston" = "USW00012918",
"Salt Lake City, UT" = "USC00427606")) %>%
select(-station_id, -flag) %>%
spread(location, temperature) %>%
arrange(date)
temps_wide_label <- mutate(temps_wide,
label = ifelse(date %in% c(ymd("0000-01-01"), ymd("0000-04-01"),
ymd("0000-07-01"), ymd("0000-10-01")),
format(date, "%b 1st"), ""))
temp_plot <- ggplot(temps_wide_label, aes(x = date, y = `Houston`)) +
geom_line(size = 1, color = "#0072B2") +
scale_x_date(name = "month", limits = c(ymd("0000-01-01"), ymd("0001-01-03")),
breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
ymd("0000-10-01"), ymd("0001-01-01")),
labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(2/366, 0)) +
scale_y_continuous(limits = c(50, 90),
name = "temperature (°F)") +
theme_minimal_grid()
plot_grid(plot_grid(temp_plot, temp_plot, rel_widths = c(1, 2), labels = "auto"),
temp_plot, rel_heights = c(1.5, 1), labels = c("", "c"), label_y = c(1, 1.15), ncol = 1)
```
If on the other hand the *x* and the *y* axes are measured in the same units, then the grid spacings for the two axes should be equal, such that the same distance along the *x* or *y* axis corresponds to the same number of data units. As an example, we can plot the temperature in Houston, TX against the temperature in San Diego, CA, for every day of the year (Figure \@ref(fig:temperature-normals-Houston-San-Diego)a). Since the same quantity is plotted along both axes, we need to make sure that the grid lines form perfect squares, as is the case in Figure \@ref(fig:temperature-normals-Houston-San-Diego).
(ref:temperature-normals-Houston-San-Diego) Daily temperature normals for Houston, TX, plotted versus the respective temperature normals of San Diego, CA. The first days of the months January, April, July, and October are highlighted to provide a temporal reference. (a) Temperatures are shown in degrees Fahrenheit. (b) Temperatures are shown in degrees Celsius. Data source: NOAA.
```{r temperature-normals-Houston-San-Diego, fig.width = 8.5, fig.asp = 0.5, fig.cap = '(ref:temperature-normals-Houston-San-Diego)'}
tempsplot_F <- ggplot(temps_wide_label, aes(x = `San Diego`, y = `Houston`)) +
geom_path(size = 1, color = "#0072B2") +
geom_text_repel(aes(label = label), point.padding = .4, color = "black",
min.segment.length = 0) +
coord_fixed(xlim = c(45, 85), ylim = c(48, 88),
expand = FALSE) +
scale_color_continuous_qualitative(guide = "none") +
scale_x_continuous(breaks = c(10*(5:8))) +
xlab("temperature in San Diego (°F)") +
ylab("temperature in Houston (°F)") +
theme_minimal_grid()
# Fahrenheit to Celsius conversion
F2C <- function(t) {(t-32)*5/9}
tempsplot_C <- ggplot(temps_wide_label, aes(x = F2C(`San Diego`), y = F2C(`Houston`))) +
geom_path(size = 1, color = "#0072B2") +
geom_text_repel(aes(label = label), point.padding = .4, color = "black",
min.segment.length = 0) +
coord_fixed(xlim = F2C(c(45, 85)), ylim = F2C(c(48, 88)),
expand = FALSE) +
scale_color_continuous_qualitative(guide = "none") +
scale_x_continuous(breaks = c(5*(2:6))) +
xlab("Temperature in San Diego (°C)") +
ylab("Temperature in Houston (°C)") +
theme_minimal_grid()
plot_grid(tempsplot_F, tempsplot_C, labels = "auto")
```
You may wonder what happens if you change the units of your data. After all, units are arbitrary, and your preferences might be different from somebody else's. A change in units is a linear transformation, where we add or subtract a number to or from all data values and/or multiply all data values with another number. Fortunately, Cartesian coordinate systems are invariant under such linear transformations. Therefore, you can change the units of your data and the resulting figure will not change as long as you change the axes accordingly. As an example, compare Figures \@ref(fig:temperature-normals-Houston-San-Diego)a and \@ref(fig:temperature-normals-Houston-San-Diego)b. Both show the same data, but in part (a) the temperature units are degrees Fahrenheit and in part (b) they are degrees Celsius. Even though the grid lines are in different locations and the numbers along the axes are different, the two data visualizations look exactly the same.
## Nonlinear axes
In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both in data units and in the resulting visualization. We refer to the position scales in these coordinate systems as *linear*. While linear scales generally provide an accurate representation of the data, there are scenarios where nonlinear scales are preferred. In a nonlinear scale, even spacing in data units corresponds to uneven spacing in the visualization, or conversely even spacing in the visualization corresponds to uneven spacing in data units.
The most commonly used nonlinear scale is the *logarithmic scale* or *log scale* for short. Log scales are linear in multiplication, such that a unit step on the scale corresponds to multiplication with a fixed value. To create a log scale, we need to log-transform the data values while exponentiating the numbers that are shown along the axis grid lines. This process is demonstrated in Figure \@ref(fig:linear-log-scales), which shows the numbers 1, 3.16, 10, 31.6, and 100 placed on linear and log scales. The numbers 3.16 and 31.6 may seem a strange choice, but they were chosen because they are exactly half-way between 1 and 10 and between 10 and 100 on a log scale. We can see this by observing that $10^{0.5} = \sqrt(10) \approx 3.16$ and equivalently $3.16 \times 3.16 \approx 10$. Similarly, $100^{0.5} \approx 31.6$.
(ref:linear-log-scales) Relationship between linear and logarithmic scales. The dots correspond to data values 1, 3.16, 10, 31.6, 100, which are evenly-spaced numbers on a logarithmic scale. We can display these data points on a linear scale, we can log-transform them and then show on a linear scale, or we can show them on a logarithmic scale. Importantly, the correct axis title for a logarithmic scale is the name of the variable shown, not the logarithm of that variable.
```{r linear-log-scales, fig.width = 7.5, fig.cap = '(ref:linear-log-scales)'}
df <- data.frame(x = c(1, 3.16, 10, 31.6, 100))
xaxis_lin <- ggplot(df, aes(x, y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3.5, 20, 3.5, 20))
xaxis_log <- ggplot(df, aes(log10(x), y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3.5, 20, 3.5, 20))
plotlist <-
align_plots(xaxis_lin + scale_x_continuous(limits = c(0, 100)) +
ggtitle("original data, linear scale"),
xaxis_log + scale_x_continuous(limits = c(0, 2)) +
xlab(expression(paste("log"["10"], "(x)"))) +
ggtitle("log-transformed data, linear scale"),
xaxis_lin + scale_x_log10(limits = c(1, 100), breaks = c(1, 3.16, 10, 31.6, 100),
labels = c("1", "3.16", "10", "31.6", "100")) +
ggtitle("original data, logarithmic scale"),
xaxis_lin + scale_x_log10(limits = c(1, 100), breaks = c(1, 3.16, 10, 31.6, 100),
labels = c("1", "3.16", "10", "31.6", "100")) +
xlab(expression(paste("log"["10"], "(x)"))) +
ggtitle("logarithmic scale with incorrect axis title"),
align = 'vh')
plot_grid(plotlist[[1]], plotlist[[2]], plotlist[[3]], stamp_bad(plotlist[[4]]), ncol = 1)
```
Because multiplication on a log scale looks like addition on a linear scale, log scales are the natural choice for any data that have been obtained by multiplication or division. In particular, ratios should generally be shown on a log scale. As an example, I have taken the number of inhabitants in each county in Texas and have divided it by the median number of inhabitants across all Texas counties. The resulting ratio is a number that can be larger or smaller than 1. A ratio of exactly 1 implies that the corresponding county has the median number of inhabitants. When visualizing these ratios on a log scale, we can see clearly that the population numbers in Texas counties are symmetrically distributed around the median, and that the most populous counties have over 100 times more inhabitants than the median while the least populous counties have over 100 times fewer inhabitants (Figure \@ref(fig:texas-counties-pop-ratio-log)). By contrast, for the same data, a linear scale obscures the differences between a county with median population number and a county with a much smaller population number than median (Figure \@ref(fig:texas-counties-pop-ratio-lin)).
(ref:texas-counties-pop-ratio-log) Population numbers of Texas counties relative to their median value. Select counties are highlighted by name. The dashed line indicates a ratio of 1, corresponding to a county with median population number. The most populous counties have approximately 100 times more inhabitants than the median county, and the least populous counties have approximately 100 times fewer inhabitants than the median county. Data source: 2010 Decennial U.S. Census.
```{r texas-counties-pop-ratio-log, fig.width = 7.5, fig.asp = 0.6, fig.cap = '(ref:texas-counties-pop-ratio-log)'}
set.seed(3878)
US_census %>% filter(state == "Texas") %>%
select(name, pop2010) %>%
extract(name, "county", regex = "(.+) County") %>%
mutate(popratio = pop2010/median(pop2010)) %>%
arrange(desc(popratio)) %>%
mutate(index = 1:n(),
label = ifelse(index <= 3 | index > n()-3 | runif(n()) < .04, county, ""),
label_large = ifelse(index <= 6, county, "")) -> tx_counties
ggplot(tx_counties, aes(x = index, y = popratio)) +
geom_hline(yintercept = 1, linetype = 2, color = "grey40") +
geom_point(size = 0.5, color = "#0072B2") +
geom_text_repel(aes(label = label), point.padding = .4, color = "black", min.segment.length = 0) +
scale_y_log10(breaks = c(.01, .1, 1, 10, 100),
name = "population number / median",
labels = label_log10) +
scale_x_continuous(limits = c(.5, nrow(tx_counties) + .5), expand = c(0, 0),
breaks = NULL, #c(1, 50*(1:5)),
name = "Texas counties, from most to least populous") +
theme_minimal_hgrid() +
theme(axis.line = element_blank(),
plot.margin = margin(7, 14, 7, 7))
```
(ref:texas-counties-pop-ratio-lin) Population sizes of Texas counties relative to their median value. By displaying a ratio on a linear scale, we have overemphasized ratios > 1 and have obscured ratios < 1. As a general rule, ratios should not be displayed on a linear scale. Data source: 2010 Decennial U.S. Census.
```{r texas-counties-pop-ratio-lin, fig.width = 7.5, fig.asp = 0.6, fig.cap = '(ref:texas-counties-pop-ratio-lin)'}
counties_lin <- ggplot(tx_counties, aes(x = index, y = popratio)) +
geom_point(size = 0.5, color = "#0072B2") +
geom_text_repel(aes(label = label_large), point.padding = .4, color = "black", min.segment.length = 0) +
scale_y_continuous(name = "population number / median") +
scale_x_continuous(limits = c(.5, nrow(tx_counties) + .5), expand = c(0, 0),
breaks = NULL, #c(1, 50*(1:5)),
name = "Texas counties, from most to least populous") +
theme_minimal_hgrid() +
theme(axis.line = element_blank(),
plot.margin = margin(7, 14, 7, 7))
stamp_bad(counties_lin)
```
On a log scale, the value 1 is the natural midpoint, similar to the value 0 on a linear scale. We can think of values greater than 1 as representing multiplications and values less than 1 divisions. For example, we can write $10 = 1\times 10$ and $0.1 = 1/10$. The value 0, on the other hand, can never appear on a log scale. It lies infinitely far from 1. One way to see this is to consider that $\log(0) = -\infty$. Or, alternatively, consider that to go from 1 to 0, it takes either an infinite number of divisions by a finite value (e.g., $1/10/10/10/10/10/10\dots = 0$) or alternatively one division by infinity (i.e., $1/\infty = 0$).
Log scales are frequently used when the data set contains numbers of very different magnitudes. For the Texas counties shown in Figures \@ref(fig:texas-counties-pop-ratio-log) and \@ref(fig:texas-counties-pop-ratio-lin), the most populous one (Harris) had 4,092,459 inhabitants in the 2010 U.S. Census while the least populous one (Loving) had 82. So a log scale would be appropriate even if we hadn't divided population numbers by their median to turn them into ratios. But what would we do if there was a county with 0 inhabitants? This county could not be shown on the logarithmic scale, because it would lie at minus infinity. In this situation, the recommendation is sometimes to use a square-root scale, which uses a square root transformation instead of a log transformation (Figure \@ref(fig:sqrt-scales)). Just like a log scale, a square-root scale compresses larger numbers into a smaller range, but unlike a log scale, it allows for the presence of 0.
(ref:sqrt-scales) Relationship between linear and square-root scales. The dots correspond to data values 0, 1, 4, 9, 16, 25, 36, 49, which are evenly-spaced numbers on a square-root scale, since they are the squares of the integers from 0 to 7. We can display these data points on a linear scale, we can square-root-transform them and then show on a linear scale, or we can show them on a square-root scale.
```{r sqrt-scales, fig.width = 7.5, fig.asp = 0.464, fig.cap = '(ref:sqrt-scales)'}
df <- data.frame(x = c(0, 1, 4, 9, 16, 25, 36, 49))
xaxis_lin <- ggplot(df, aes(x, y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3.5, 20, 3.5, 20))
xaxis_sqrt <- ggplot(df, aes(sqrt(x), y = 1)) +
geom_point(size = 3, color = "#0072B2") +
scale_y_continuous(limits = c(0.8, 1.2), expand = c(0, 0), breaks = 1) +
theme_minimal_grid() +
theme(axis.ticks.length = grid::unit(0, "pt"),
axis.text.y = element_blank(),
axis.title.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title = element_text(face = "plain"),
plot.margin = margin(3.5, 20, 3.5, 20))
plotlist <-
align_plots(xaxis_lin + scale_x_continuous(limits = c(0, 50)) +
ggtitle("original data, linear scale"),
xaxis_sqrt + scale_x_continuous(limits = c(0, 7.07)) +
xlab(expression(sqrt(x))) +
ggtitle("square-root-transformed data, linear scale"),
xaxis_sqrt + scale_x_continuous(limits = c(0, 7.07), breaks = c(0, 1, sqrt(5), sqrt(10*(1:5))),
labels = c(0, 1, 5, 10*(1:5)), name = "x") +
expand_limits(expand = c(0, 1)) +
ggtitle("original data, square-root scale"),
align = 'vh')
plot_grid(plotlist[[1]], plotlist[[2]], plotlist[[3]], ncol = 1)
```
I see two problems with square-root scales. First, while on a linear scale one unit step corresponds to addition or subtraction of a constant value and on a log scale it corresponds to multiplication with or division by a constant value, no such rule exists for a square-root scale. The meaning of a unit step on a square-root scale depends on the scale value at which we're starting. Second, it is unclear how to best place axis ticks on a square-root scale. To obtain evenly spaced ticks, we would have to place them at squares, but axis ticks at, for example, positions 0, 4, 25, 49, 81 (every second square) would be highly unintuitive. Alternatively, we could place them at linear intervals (10, 20, 30, etc), but this would result in either too few axis ticks near the low end of the scale or too many near the high end. In Figure \@ref(fig:sqrt-scales), I have placed the axis ticks at positions 0, 1, 5, 10, 20, 30, 40, and 50 on the square-root scale. These values are arbitrary but provide a reasonable covering of the data range.
Despite these problems with square-root scales, they are valid position scales and I do not discount the possibility that they have appropriate applications. For example, just like a log scale is the natural scale for ratios, one could argue that the square-root scale is the natural scale for data that come in squares. One scenario in which data are naturally squares are in the context of geographic regions. If we show the areas of geographic regions on a square-root scale, we are highlighting the regions' linear extent from East to West or North to South. These extents could be relevant, for example, if we are wondering how long it might take to drive across a region. Figure \@ref(fig:northeast-state-areas) shows the areas of states in the U.S. Northeast on both a linear and a square-root scale. Even though the areas of these states are quite different (Figure \@ref(fig:northeast-state-areas)a), the time it will take to drive across each state will more closely resemble the figure on the square-root scale (Figure \@ref(fig:northeast-state-areas)b) than the figure on the linear scale (Figure \@ref(fig:northeast-state-areas)a).
(ref:northeast-state-areas) Areas of Northeastern U.S. states. (a) Areas shown on a linear scale. (b) Areas shown on a square-root scale. Data source: Google.
```{r northeast-state-areas, fig.width = 8.5, fig.asp = 0.4, fig.cap = '(ref:northeast-state-areas)'}
# areas in square miles
# source: Google, 01/07/2018
northeast_areas <- read.csv(text = "state_abr,area
NY,54556
PA,46055
ME,35385
MA,10565
VT,9616
NH,9349
NJ,8723
CT,5543
RI,1212")
northeast_areas$state_abr <- factor(northeast_areas$state_abr, levels = northeast_areas$state_abr)
areas_base <- ggplot(northeast_areas, aes(x = state_abr, y = area)) +
geom_col(fill = "#56B4E9") +
ylab("area (square miles)") +
xlab("state") +
theme_minimal_hgrid()
p1 <- areas_base + scale_y_sqrt(limits = c(0, 55000), breaks = c(0, 1000, 5000, 10000*(1:5)),
expand = c(0, 0))
p2 <- areas_base + scale_y_continuous(limits = c(0, 55000), breaks = 10000*(0:6), expand = c(0, 0))
plot_grid(p2, p1, labels = "auto")
```
## Coordinate systems with curved axes
All coordinate systems we have encountered so far used two straight axes positioned at a right angle to each other, even if the axes themselves established a non-linear mapping from data values to positions. There are other coordinate systems, however, where the axes themselves are curved. In particular, in the *polar* coordinate system, we specify positions via an angle and a radial distance from the origin, and therefore the angle axis is circular (Figure \@ref(fig:polar-coord)).
(ref:polar-coord) Relationship between Cartesian and polar coordinates. (a) Three data points shown in a Cartesian coordinate system. (b) The same three data points shown in a polar coordinate system. We have taken the *x* coordinates from part (a) and used them as angular coordinates and the *y* coordinates from part (a) and used them as radial coordinates. The circular axis runs from 0 to 4 in this example, and therefore *x* = 0 and *x* = 4 are the same locations in this coordinate system.
```{r polar-coord, fig.width = 7.5, fig.asp = 0.5, fig.cap = '(ref:polar-coord)'}
df_points <- data.frame(x = c(1, 3.5, 0),
y = c(3, 4, 0),
label = c("(1, 3)", "(3.5, 4)", "(0, 0)"),
vjust_polar = c(1.6, 1, 1.6),
hjust_polar = c(.5, -.1, 0.5),
vjust_cart = c(1.6, 1.6, -.6),
hjust_cart = c(0.5, 1.1, -.1))
df_segments <- data.frame(x0 = c(0, 1, 2, 3, 0, 0, 0, 0),
x1 = c(0, 1, 2, 3, 4, 4, 4, 4),
y0 = c(0, 0, 0, 0, 1, 2, 3, 4),
y1 = c(4, 4, 4, 4, 1, 2, 3, 4))
p_cart <- ggplot(df_points, aes(x, y)) +
geom_point(size = 3, color = "#0072B2") +
geom_text(aes(label = label, vjust = vjust_cart, hjust = hjust_cart), size = 12/.pt) +
scale_x_continuous(limits = c(-0.5, 4.5), expand = c(0, 0)) +
scale_y_continuous(limits = c(-0.5, 4.5), expand = c(0, 0)) +
coord_fixed() +
xlab("x axis") +
ylab("y axis") +
theme_minimal_grid() +
theme(axis.ticks = element_blank(),
axis.ticks.length = grid::unit(0, "pt"))
p_polar <- ggplot(df_points, aes(x, y)) +
geom_segment(data = df_segments, aes(x = x0, xend = x1, y = y0, yend = y1),
size = theme_minimal_grid()$panel.grid.major$size, color = theme_minimal_grid()$panel.grid.major$colour) +
geom_point(size = 3, color = "#0072B2") +
geom_text(aes(label = label, vjust = vjust_polar, hjust = hjust_polar), size = 12/.pt) +
scale_x_continuous(limits = c(0, 4)) +
scale_y_continuous(limits = c(0, 4)) +
coord_polar() +
xlab("x values (circular axis)") +
ylab("y values (radial axis)") +
theme_minimal_grid() +
background_grid(major = "none") +
theme(axis.line.x = element_blank(),
axis.ticks = element_line(color = "black"))
plot_grid(p_cart, p_polar, labels = "auto")
```
Polar coordinates can be useful for data of a periodic nature, such that data values at one end of the scale can be logically joined to data values at the other end. For example, consider the days in a year. December 31st is the last day of the year, but it is also one day before the first day of the year. If we want to show how some quantity varies over the year, it can be appropriate to use polar coordinates with the angle coordinate specifying each day. Let's apply this concept to the temperature normals of Figure \@ref(fig:temp-normals-vs-time). Because temperature normals are average temperatures that are not tied to any specific year, Dec. 31st can be thought of as 366 days later than Jan. 1st (temperature normals include Feb. 29) and also one day earlier. By plotting the temperature normals in a polar coordinate system, we emphasize this cyclical property they have (Figure \@ref(fig:temperature-normals-polar)). In comparison to Figure \@ref(fig:temp-normals-vs-time), the polar version highlights how similar the temperatures are in Death Valley, Houston, and San Diego from late fall to early spring. In the Cartesian coordinate system, this fact is obscured because the temperature values in late December and in early January are shown in opposite parts of the figure and therefore don't form a single visual unit.
(ref:temperature-normals-polar) Daily temperature normals for four selected locations in the U.S., shown in polar coordinates. The radial distance from the center point indicates the daily temperature in Fahrenheit, and the days of the year are arranged counter-clockwise starting with Jan. 1st at the 6:00 position.
```{r temperature-normals-polar, fig.width = 7.5, fig.cap = '(ref:temperature-normals-polar)'}
temps_long <- gather(temps_wide, location, temperature, -month, -day, -date) %>%
filter(location %in% c("Chicago",
"Death Valley",
"Houston",
"San Diego")) %>%
mutate(location = factor(location, levels = c("Death Valley",
"Houston",
"San Diego",
"Chicago")))
ggplot(temps_long, aes(x = date, y = temperature, color = location)) +
geom_line(size = 1) +
scale_x_date(name = "date", expand = c(0, 0)) +
scale_y_continuous(limits = c(0, 105), expand = c(0, 0),
breaks = seq(-30, 90, by = 30),
name = "temperature (°F)") +
scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
theme_minimal() +
coord_polar(theta = "x", start = pi, direction = -1) +
theme_minimal_grid()
```
A second setting in which we encounter curved axes is in the context of geospatial data, i.e., maps. Locations on the globe are specified by their latitude and longitude. But because the earth is a sphere, drawing latitude and longitude as Cartesian axes is misleading and not recommended (Figure \@ref(fig:continental-usa-four-projections)). Instead, we use various types of non-linear projections that attempt to minimize artifacts and that strike different balances between conserving areas or angles relative to the true shape lines on the globe (Figure \@ref(fig:continental-usa-four-projections)).
(ref:continental-usa-four-projections) The continental U.S.A. shown in four different coordinate systems. The Cartesian latitude and longitude system maps latitude and longitude of each location onto a regular Cartesian coordinate system. This mapping causes substantial distortions in both areas and angles relative to their true values on the 3d globe. The Robinson projection is commonly used to project the entire world. It preserves neither areas nor angles perfectly but attempts to strike a balance between the two. The Lambert projection preserves areas but distorts angles. The Transverse Mercator projection preserves angles but distorts areas.
```{r continental-usa-four-projections, fig.width = 8.5, fig.cap = '(ref:continental-usa-four-projections)'}
library(sf)
library(maps)
usa = st_as_sf(map('state', plot = FALSE, fill = TRUE))
p1 <- ggplot(usa) +
geom_sf() +
theme_minimal_grid(12) +
theme(panel.grid.major = element_line(color = "gray30"),
axis.text = element_text(color = "gray30"),
plot.margin = margin(11, 21, 7, 7))
usa_robin <- st_transform(usa, st_crs("+proj=robin +lat_0=0 +lon_0=0 +x0=0 +y0=0")) #
p2 <- ggplot(usa_robin) +
geom_sf() +
theme_minimal_grid(12) +
theme(panel.grid.major = element_line(color = "gray30"),
axis.text = element_text(color = "gray30"),
plot.margin = margin(11, 21, 7, 7))
usa_laea <- st_transform(usa, st_crs("+proj=laea +lat_0=30 +lon_0=-100")) # Lambert equal area
p3 <- ggplot(usa_laea) +
geom_sf() +
theme_minimal_grid(12) +
theme(panel.grid.major = element_line(color = "gray30"),
axis.text = element_text(color = "gray30"),
plot.margin = margin(21, 21, 7, 7))
usa_tmerc <- st_transform(usa, st_crs("+proj=tmerc +lat_0=35 +lon_0=-100 +x0=0 +y0=0")) # Transverse Mercator
p4 <- ggplot(usa_tmerc) +
geom_sf() +
theme_minimal_grid(12) +
theme(panel.grid.major = element_line(color = "gray30"),
axis.text = element_text(color = "gray30"),
plot.margin = margin(21, 21, 7, 7))
plot_grid(p1, p2, p3, p4,
labels = c("Cartesian latitude and longitude", "Robinson", "Lambert equal area", "Transverse Mercator"),
label_x = 0.02, label_y = 0.98,
hjust = 0, vjust = 1,
label_fontface = "plain")
```