giving up in NZStats excel formats

dicook · dicook · commit 557289fe1326 · 2020-01-16T12:35:10.000+11:00
diff --git a/01-intro.Rmd b/01-intro.Rmd
@@ -66,3 +66,6 @@ Statistical thinking is about understanding our world by modeling the variation
 
 Computational statistical thinking is an exciting new way of doing statistics that makes use the computational tools of today. To understand randomness, we sample, re-sample, simulate or generate values from a model. We use these to learn how the problem might look if we'd collected different data, or if particular conditions hold. It allows us to create a sandbox to play in, a virtual world to examine randomness and variation. 
 
+## Workflow
+
+`here`
diff --git a/02-accessing-data.Rmd b/02-accessing-data.Rmd
@@ -101,11 +101,61 @@ passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), shee
 datatable(passengers)
 ```
 
+### Your turn: Munging spreadsheets
 
+The book [Spreaadsheet Munging Strategies](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) by Duncan Garmonsway is a really good source of dealing with complicated excel spreadsheets. We recommend working through examples in this book to familiarise ways to deal with messy spreadsheets, and incorporating information such as special formatting into the data. 
 
-### Your turn: Australian Bureau of Statistics data
+The [case study](https://nacnudus.github.io/spreadsheet-munging-strategies/vaccinations.html#vaccinations) developed from Bob Rudis's post on CDC vaccination data is especially recommended.
+
+<!--
+### Your turn: New Zealand Census data
+
+StatsNZ makes [tables of data from the five year censuses](http://nzdotstat.stats.govt.nz/) publicly available. Take a look at the 2018 Census data, the population and migration data. You need to expand the cells, and select the levels to use, to have the numbers broken down by age group, sex and ethnicity. (A sample file `NZ_census.xlsx` is provided as an example.) In excel format, the variables and levels of the variables are in the header names, with a twist, a row for each variable, in a multicolumn format. (The R package `tidyxl` has the capacity to deal with multiple headers like this, but requires `xlsx` format. The sample file has been opened and saved in this format, and the first two blank lines in the origial file were also manually removed.)
+
+Luckily, choosing the `csv` format will provide the data in tidy long form. 
+
+GIVING UP ON THE XLS FORMAT - ITS JUST REALLY IRREGULAR - AND tidyxl even cannot handle it.
+-->
+
+```{r eval=FALSE, echo=FALSE}
+library(tidyxl)
+library(unpivotr)
+NZ_all_cells <-
+  xlsx_cells(here::here("data", "NZ_census.xlsx")) %>%
+  dplyr::filter(!is_blank) %>%
+  select(row, col, data_type, character, numeric) %>%
+  dplyr::filter(row > 1)
+year <-
+  NZ_all_cells %>%
+  dplyr::filter(col >= 2, row == 5) %>%
+  select(row, col, year = character)
+ethnic <-
+  NZ_all_cells %>%
+  dplyr::filter(col >= 2, row == 4) %>%
+  select(row, col, ethnic = character)
+sex <-
+  NZ_all_cells %>%
+  dplyr::filter(col >= 2, row == 3) %>%
+  select(row, col, sex = character)
+age <-
+  NZ_all_cells %>%
+  dplyr::filter(col >= 2, row == 2) %>%
+  select(row, col, age = character)
+NZ <- read_xlsx(here::here("data", "NZ_census.xlsx"), skip=5, n_max=1) %>%
+  gather(v, count, -Area) %>%
+  filter(!is.na(count)) %>%
+  select(-v)
+NZ <- NZ %>%
+  mutate(year = year$year) %>%
+  mutate(ethnic = rep(ethnic$ethnic, c(rep(3, length(ethnic$ethnic)-1), 1))) %>%
+  mutate(sex = rep(sex$sex, 43))
+# NZ <- read_xls(here::here("data", "NZ_census.xls"), skip=2)
+# NZ <- read_csv("data/NZ_census.csv")
+```
+
+
+### Example: SPSS binary
 
-ABS - excel
 
 PISA - sav
 
diff --git a/03a-tidying-data.Rmd b/03a-tidying-data.Rmd
@@ -113,6 +113,19 @@ fly
 
 What are the variables?
 
+## ABS Datapack
+
+The Australian Bureau of Statistics (ABS) collects, maintains and delivers data and official statistics on a wide range of economic, social, population and environmental matters of importance to Australia. There are many different access points for data, but primarily aggregated data is the main type available. Examples at accessing the Census data from the ABS can be found in the `eechidna` package. 
+
+1. The individual `csv` files must be held locally. They come from a zip file and can be downloaded from: https://datapacks.censusdata.abs.gov.au/datapacks/
+2. Select: 2016 Census Datapacks, General Community Profile, Commonwealth Electoral Divisons
+3. Download for all of Australia
+4. Unzip the package - its necessary, because the data is delivered in many small csv files. There is also the license information detailing appropriate usage, and detailed information about the formats.
+
+```{r eval=FALSE}
+G1_Main <- read_csv(here::here("data/2016 Census GCP Commonwealth Electoral Divisions for AUST/", "2016Census_G01_AUS_CED.csv"))
+```
+
 ## Messy vs tidy
 
 Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution.