forked from jennybc/gapminder
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
54 lines (46 loc) · 1.92 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---
title: "Data cleaning"
output: github_document
---
I explicitly use this package to teach data cleaning, so have refactored my old cleaning code into several scripts. I also include them as compiled Markdown reports. Caveat: these are realistic cleaning scripts! Not the highly polished ones people write with 20/20 hindsight :) I wouldn't necessarily clean it the same way again (and I would download more recent data!), but at this point there is great value in reproducing the data I've been using for ~5 years.
Cleaning history
* 2010: The first time I documented cleaning this dataset. I started with
delimited files I exported from Excel. Not present in this repo.
* 2014: I re-cleaned the data and (mostly) forced myself to pull it straight
out of the spreadsheets. Used the `gdata` package. It was kind of painful, due to encoding and other issues. See the scripts in this state in [v0.1.0](https://github.com/jennybc/gapminder/tree/v0.1.0/data-raw).
* 2015: I revisited the cleaning and switched to `readxl`. This was much less painful. Present day.
```{r results='asis', echo = FALSE, warning = FALSE}
library(tidyverse)
library(stringr)
library(knitr)
library(here)
x <- tibble(fls = list.files(here("data-raw"))) %>%
mutate(fls_basename = basename(fls)) %>%
separate(fls_basename, c("script", "slug", "ext"), "[_\\.]")
x <- x %>%
filter(script %>% str_detect("^[0-9]+"),
ext %>% str_detect("R|r|md|tsv")) %>%
select(-slug)
y <- x %>%
group_by(script) %>%
nest()
collapse_md_links <- function(x) {
x %>% {
paste0("[", ., "](", ., ")")
} %>%
paste(collapse = ", ")
}
jfun <- function(z) {
tibble(
r_script = z$fls[z$ext == "R"] %>% collapse_md_links(),
notebook = z$fls[z$ext == "md"] %>% collapse_md_links(),
tsv = z$fls[z$ext == "tsv"] %>% collapse_md_links()
)
}
y$data %>%
map_df(jfun) %>%
kable()
```
```{r eval = FALSE, echo = FALSE}
devtools::session_info()
```