data-raw/README.Rmd

---
title: "Data cleaning"
output: github_document
---

I explicitly use this package to teach data cleaning, so have refactored my old cleaning code into several scripts. I also include them as compiled Markdown reports. Caveat: these are realistic cleaning scripts! Not the highly polished ones people write with 20/20 hindsight :) I wouldn't necessarily clean it the same way again (and I would download more recent data!), but at this point there is great value in reproducing the data I've been using for ~5 years.

Cleaning history
  
  * 2010: The first time I documented cleaning this dataset. I started with
  delimited files I exported from Excel. Not present in this repo.
  * 2014: I re-cleaned the data and (mostly) forced myself to pull it straight
  out of the spreadsheets. Used the `gdata` package. It was kind of painful, due to encoding and other issues. See the scripts in this state in [v0.1.0](https://github.com/jennybc/gapminder/tree/v0.1.0/data-raw).
  * 2015: I revisited the cleaning and switched to `readxl`. This was much less painful. Present day.

```{r results='asis', echo = FALSE, warning = FALSE}
library(tidyverse)
library(stringr)
library(knitr)
library(here)

x <- tibble(fls = list.files(here("data-raw"))) %>%
  mutate(fls_basename = basename(fls)) %>% 
  separate(fls_basename, c("script", "slug", "ext"), "[_\\.]")
x <- x %>% 
  filter(script %>% str_detect("^[0-9]+"),
         ext %>% str_detect("R|r|md|tsv")) %>% 
  select(-slug)
y <- x %>%
  group_by(script) %>% 
  nest()

collapse_md_links <- function(x) {
  x %>% {
    paste0("[", ., "](", ., ")")
    } %>% 
    paste(collapse = ", ")
}
jfun <- function(z) {
  tibble(
    r_script = z$fls[z$ext == "R"] %>% collapse_md_links(),
    notebook = z$fls[z$ext == "md"] %>% collapse_md_links(),
    tsv = z$fls[z$ext == "tsv"] %>% collapse_md_links()
  )
}

y$data %>% 
  map_df(jfun) %>% 
  kable()
```

```{r eval = FALSE, echo = FALSE}
devtools::session_info()
```