-
Notifications
You must be signed in to change notification settings - Fork 6
Tidying data in R
Examples:
- Date is given as 3 separate columns; Year, Month and Day
- Time is given as 2 separate columns; Hour, Minutes
Merging columns into a single data field is very easy to do in R. Optionally, I prefer to put that variable back into the imported data frame:
df$date_full <- paste(df$day, df$month, df$year, sep="/")
This will create a new column in the data frame called data_full. It will contain the day, month and year separated by a /.
Notice that this is just a String of character as yet, there is no object/concept of a date yet. If you try and subtract 3 days from that column, the result is strange akin to trying to subtract 3 from "Ahmed".
To convert the column to a formatted, conceptualised date, use:
df$date_full <- as.Date(df$date_full, format="%d/%m/%Y)
It is now possible to manipulate the the date_full column as expected.
Time in R requires to be converted to a full DateTime object; there is no concept of time along without a date hence some extra steps are needed
start_time_stamp <- with(df, ISOdatetime(date_year, date_month, date_day, start_hour, start_minute, "00"))
df$start_time <- strftime(df$start_time_stamp, format="%H:%M:%S")
The above statement is rather long but let's simplify it step by step:
- We're creating a new variable called start_time_stamp.
- We are using a new function called
with. This allows us to avoid using data$date_year, data$date_month and just focus on the column names only. -
ISOdatetime. This creates adatetimeobject using a year, month, day, hour, minute and seconds. We do not have seconds in our dataset so I've hardcoded 0 seconds. -
strftimefunction allows us to re-format thestart_time_stampobject to only hours, minutes and seconds while allowing us to manipulate time as one would expect.
Examples:
- BP values are given as 120/80 in one column. This can be problematic as the / can be interpreted as a division, resulting in a meaningless value of "1.5" - not what we would expect.
Recommendations: Split BP values into two columns, systolic and diastolic.
If your data already has BP data stored as one column, we can fix that by first installing a package that helps with column manipulations called tidyr
install.packages("tidyr")
then:
separate(data = df, col = BP, into = c("Systolic", "Diastolic"), sep = "\\/")
note the sep variable, it is a "regular expression" and as our data is using a / we need to escape it by using the escape character \\
Often columns in our data has categorical data (e.g. blood group (A, B, AB, O)). Sometimes these can cause problems as computers are very literal, Ab is different from ab which is different from aB. Hence it is useful to have data-validation in your sheets. Alternatively, we can use a function in R to make data lowercase:
lower(df$categorical)
By limiting the number of misspellings in our column, we reduce the chance of introducing errors further down in our analysis. To make factors out of our categorical data:
factor(df$categorical)
This can be use on the fly when creating models that use categorical data, such as Analysis of Variance (ANOVA) models.
-
dplyrandtidyrlibraries - www.stackoverflow.com
- Readr package
- BP easier as a concept to grasp comapred to dates/time
- Perhaps focus on converting datatypes into numbers (e.g. if someone coded '80kg' instead of '80' will result in a string of characters...how to clean that)
Please contact Steve Harris if you have any questions.