Skip to content

Commit 557289f

Browse files
committed
giving up in NZStats excel formats
1 parent 206cefd commit 557289f

File tree

3 files changed

+68
-2
lines changed

3 files changed

+68
-2
lines changed

01-intro.Rmd

+3
Original file line numberDiff line numberDiff line change
@@ -66,3 +66,6 @@ Statistical thinking is about understanding our world by modeling the variation
6666

6767
Computational statistical thinking is an exciting new way of doing statistics that makes use the computational tools of today. To understand randomness, we sample, re-sample, simulate or generate values from a model. We use these to learn how the problem might look if we'd collected different data, or if particular conditions hold. It allows us to create a sandbox to play in, a virtual world to examine randomness and variation.
6868

69+
## Workflow
70+
71+
`here`

02-accessing-data.Rmd

+52-2
Original file line numberDiff line numberDiff line change
@@ -101,11 +101,61 @@ passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), shee
101101
datatable(passengers)
102102
```
103103

104+
### Your turn: Munging spreadsheets
104105

106+
The book [Spreaadsheet Munging Strategies](https://nacnudus.github.io/spreadsheet-munging-strategies/index.html) by Duncan Garmonsway is a really good source of dealing with complicated excel spreadsheets. We recommend working through examples in this book to familiarise ways to deal with messy spreadsheets, and incorporating information such as special formatting into the data.
105107

106-
### Your turn: Australian Bureau of Statistics data
108+
The [case study](https://nacnudus.github.io/spreadsheet-munging-strategies/vaccinations.html#vaccinations) developed from Bob Rudis's post on CDC vaccination data is especially recommended.
109+
110+
<!--
111+
### Your turn: New Zealand Census data
112+
113+
StatsNZ makes [tables of data from the five year censuses](http://nzdotstat.stats.govt.nz/) publicly available. Take a look at the 2018 Census data, the population and migration data. You need to expand the cells, and select the levels to use, to have the numbers broken down by age group, sex and ethnicity. (A sample file `NZ_census.xlsx` is provided as an example.) In excel format, the variables and levels of the variables are in the header names, with a twist, a row for each variable, in a multicolumn format. (The R package `tidyxl` has the capacity to deal with multiple headers like this, but requires `xlsx` format. The sample file has been opened and saved in this format, and the first two blank lines in the origial file were also manually removed.)
114+
115+
Luckily, choosing the `csv` format will provide the data in tidy long form.
116+
117+
GIVING UP ON THE XLS FORMAT - ITS JUST REALLY IRREGULAR - AND tidyxl even cannot handle it.
118+
-->
119+
120+
```{r eval=FALSE, echo=FALSE}
121+
library(tidyxl)
122+
library(unpivotr)
123+
NZ_all_cells <-
124+
xlsx_cells(here::here("data", "NZ_census.xlsx")) %>%
125+
dplyr::filter(!is_blank) %>%
126+
select(row, col, data_type, character, numeric) %>%
127+
dplyr::filter(row > 1)
128+
year <-
129+
NZ_all_cells %>%
130+
dplyr::filter(col >= 2, row == 5) %>%
131+
select(row, col, year = character)
132+
ethnic <-
133+
NZ_all_cells %>%
134+
dplyr::filter(col >= 2, row == 4) %>%
135+
select(row, col, ethnic = character)
136+
sex <-
137+
NZ_all_cells %>%
138+
dplyr::filter(col >= 2, row == 3) %>%
139+
select(row, col, sex = character)
140+
age <-
141+
NZ_all_cells %>%
142+
dplyr::filter(col >= 2, row == 2) %>%
143+
select(row, col, age = character)
144+
NZ <- read_xlsx(here::here("data", "NZ_census.xlsx"), skip=5, n_max=1) %>%
145+
gather(v, count, -Area) %>%
146+
filter(!is.na(count)) %>%
147+
select(-v)
148+
NZ <- NZ %>%
149+
mutate(year = year$year) %>%
150+
mutate(ethnic = rep(ethnic$ethnic, c(rep(3, length(ethnic$ethnic)-1), 1))) %>%
151+
mutate(sex = rep(sex$sex, 43))
152+
# NZ <- read_xls(here::here("data", "NZ_census.xls"), skip=2)
153+
# NZ <- read_csv("data/NZ_census.csv")
154+
```
155+
156+
157+
### Example: SPSS binary
107158

108-
ABS - excel
109159

110160
PISA - sav
111161

03a-tidying-data.Rmd

+13
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,19 @@ fly
113113

114114
What are the variables?
115115

116+
## ABS Datapack
117+
118+
The Australian Bureau of Statistics (ABS) collects, maintains and delivers data and official statistics on a wide range of economic, social, population and environmental matters of importance to Australia. There are many different access points for data, but primarily aggregated data is the main type available. Examples at accessing the Census data from the ABS can be found in the `eechidna` package.
119+
120+
1. The individual `csv` files must be held locally. They come from a zip file and can be downloaded from: https://datapacks.censusdata.abs.gov.au/datapacks/
121+
2. Select: 2016 Census Datapacks, General Community Profile, Commonwealth Electoral Divisons
122+
3. Download for all of Australia
123+
4. Unzip the package - its necessary, because the data is delivered in many small csv files. There is also the license information detailing appropriate usage, and detailed information about the formats.
124+
125+
```{r eval=FALSE}
126+
G1_Main <- read_csv(here::here("data/2016 Census GCP Commonwealth Electoral Divisions for AUST/", "2016Census_G01_AUS_CED.csv"))
127+
```
128+
116129
## Messy vs tidy
117130

118131
Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution.

0 commit comments

Comments
 (0)