Skip to content

Tidying data in R

Steve Harris edited this page Apr 13, 2016 · 14 revisions

Tidying data

Merging columns

Examples:

  1. Date is given as 3 separate columns; Year, Month and Day
  2. Time is given as 2 separate columns; Hour, Minutes

Dates

Merging columns into a single data field is very easy to do in R. Optionally, I prefer to put that variable back into the imported data frame:

df$date_full <- paste(df$day, df$month, df$year, sep="/")

This will create a new column in the data frame called data_full. It will contain the day, month and year separated by a /. Notice that this is just a String of character as yet, there is no object/concept of a date yet. If you try and subtract 3 days from that column, the result is strange akin to trying to subtract 3 from "Ahmed".

To convert the column to a formatted, conceptualised date, use:

df$date_full <- as.Date(df$date_full, format="%d/%m/%Y)

It is now possible to manipulate the the date_full column as expected.

Time

Time in R requires to be converted to a full DateTime object; there is no concept of time along without a date hence some extra steps are needed

start_time_stamp <- with(df, ISOdatetime(date_year, date_month, date_day, start_hour, start_minute, "00"))
df$start_time <- strftime(df$start_time_stamp, format="%H:%M:%S")

The above statement is rather long but let's simplify it step by step:

  1. We're creating a new variable called start_time_stamp.
  2. We are using a new function called with. This allows us to avoid using data$date_year, data$date_month and just focus on the column names only.
  3. ISOdatetime. This creates a datetime object using a year, month, day, hour, minute and seconds. We do not have seconds in our dataset so I've hardcoded 0 seconds.
  4. strftime function allows us to re-format the start_time_stamp object to only hours, minutes and seconds while allowing us to manipulate time as one would expect.

Splitting Data

Examples:

  1. BP values are given as 120/80 in one column. This can be problematic as the / can be interpreted as a division, resulting in a meaningless value of "1.5" - not what we would expect.

Recommendations: Split BP values into two columns, systolic and diastolic.

If your data already has BP data stored as one column, we can fix that by first installing a package that helps with column manipulations called tidyr

install.packages("tidyr")

then:

separate(data = df, col = BP, into = c("Systolic", "Diastolic"), sep = "\\/")

note the sep variable, it is a "regular expression" and as our data is using a / we need to escape it by using the escape character \\

Cleaning Ordinal data

Often columns in our data has categorical data (e.g. blood group (A, B, AB, O)). Sometimes these can cause problems as computers are very literal, Ab is different from ab which is different from aB. Hence it is useful to have data-validation in your sheets. Alternatively, we can use a function in R to make data lowercase:

lower(df$categorical)

By limiting the number of misspellings in our column, we reduce the chance of introducing errors further down in our analysis. To make factors out of our categorical data:

factor(df$categorical)

This can be use on the fly when creating models that use categorical data, such as Analysis of Variance (ANOVA) models.

Common tools

Issues

  • Readr package
  • BP easier as a concept to grasp comapred to dates/time
  • Perhaps focus on converting datatypes into numbers (e.g. if someone coded '80kg' instead of '80' will result in a string of characters...how to clean that)
Clone this wiki locally