From 7b474fe7fde4be1086c3f27ed2d62d886ddde494 Mon Sep 17 00:00:00 2001 From: timothyleeXQ Date: Sun, 29 Sep 2019 14:25:54 -0400 Subject: [PATCH 1/4] Completed Class activity 1 Uploading .Rmd response and associated knitted .html file --- .gitignore | 5 + Class Activity 1 Response.Rmd | 100 +++++++ Class-Activity-1-Response.html | 473 +++++++++++++++++++++++++++++++++ 3 files changed, 578 insertions(+) create mode 100644 .gitignore create mode 100644 Class Activity 1 Response.Rmd create mode 100644 Class-Activity-1-Response.html diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..f4f606b --- /dev/null +++ b/.gitignore @@ -0,0 +1,5 @@ +.Rproj.user +.Rhistory +.RData +.Ruserdata +*.Rproj diff --git a/Class Activity 1 Response.Rmd b/Class Activity 1 Response.Rmd new file mode 100644 index 0000000..5176c17 --- /dev/null +++ b/Class Activity 1 Response.Rmd @@ -0,0 +1,100 @@ +--- +title: "Class Activity 1" +Author: Timothy Lee +output: html_document +--- + + +**Load the libraries tidyr and dplyr** + +```{r echo = FALSE, message = FALSE} +library(tidyr) +library(dplyr) +``` + +**Create a data frame from the swirl-data.csv file called DF1** + +```{r echo = FALSE} +DF1 = read.csv("C:/Users/Timothy/Google Drive/TC Stuff/Analytics/HUDK 4050 & 4051 - Learning Analytics/projects/class-activity-1/swirl-data.csv", header = TRUE) +``` + +**The variables are:** + +* **`course_name` - the name of the R course the student attempted** +* **`lesson_name` - the lesson name** +* **`question_number` - the question number attempted correct - whether the question was answered correctly** +* **`attempt` - how many times the student attempted the question** +* **`skipped` - whether the student skipped the question** +* **`datetime` - the date and time the student attempted the question** +* **`hash` - anonymyzed student ID** + +**Create a new data frame that only includes the variables `hash`, `lesson_name` and `attempt` called DF2** + +```{r} +DF2 = DF1 %>% select("hash", "lesson_name", "attempt") +#DF2 = data.frame('Hash' = DF1$hash, 'LessonName' = DF1$lesson_name, 'Attempt' = DF1$attempt) +``` + +**Use the `group_by` function to create a data frame that sums all the attempts for each hash by each lesson_name called DF3** + +Is this correct? +```{r} +DF3 = DF2 %>% filter(!is.na(attempt)) %>% group_by(hash, lesson_name) %>% summarise(sumAttempts = sum(attempt)) +``` + +**On a scrap piece of paper draw what you think DF3 would look like if all the lesson names were column names** + +Something like: + +| Hash | Basic Building Blocks | Logic | +| ------------- |:-------------------------:| -----:| +| 2864 | 2 | 45 | +| 4807 | 3 | 100 | +| 2864 | 45 | 251 | + + +Convert DF3 to this format. + +```{r} +DF3spread = DF3 %>% spread(key = lesson_name, value = sumAttempts) +``` + +**Create a new data frame from DF1 called DF4 that only includes the variables `hash`, `lesson_name` and `correct`** + +```{r} +DF4 = DF1 %>% select("hash", "lesson_name", "correct") +``` + +**Convert the correct variable so that `TRUE` is coded as the number 1 and `FALSE` is coded as 0** +```{r} +codedCorrect = ifelse(DF4$correct == TRUE, yes = 1, no = 0) +#No need for additional ifelse for NA, since ifelse documentation says "Missing values in test give missing values in the result." +#I checked and this appears to work. +DF4 = cbind(DF4, codedCorrect) +``` + +**Create a new data frame called DF5 that provides a mean score for each student on each course** + +```{r} +DF5 = DF4 %>% filter(!is.na(codedCorrect)) %>% group_by(hash, lesson_name) %>% summarise(meanScore = mean(codedCorrect)) +``` + +**Extra credit Convert the datetime variable into month-day-year format and create a new data frame (DF6) that shows the average correct for each day** + +```{r} +#Select only the relevant variables +DF6 = DF1 %>% select("correct", "datetime") +#Convert score to numbers - 1 for correct, 0 for incorrect +DF6$correct = ifelse(DF6$correct == TRUE, yes = 1, no = 0) +#Change datetime to proper datetime variables +DF6$datetime = as.POSIXct(DF6$datetime, origin = "1970-01-01") +#Omit time so group_by groups by day and not second +DF6$datetime = lubridate::round_date(DF6$datetime, unit = 'day') + +DF6 = DF6 %>% filter(!is.na(correct)) %>% + group_by(datetime) %>% + summarise(averageCorrect = mean(correct)) +DF6 + +``` + diff --git a/Class-Activity-1-Response.html b/Class-Activity-1-Response.html new file mode 100644 index 0000000..ac09697 --- /dev/null +++ b/Class-Activity-1-Response.html @@ -0,0 +1,473 @@ + + + + + + + + + + + + + + +Class-Activity-1-Response.utf8.md + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

title: “Class Activity 1” Author: Timothy Lee output: html_document—

+

Load the libraries tidyr and dplyr

+

Create a data frame from the swirl-data.csv file called DF1

+

The variables are:

+ +

Create a new data frame that only includes the variables hash, lesson_name and attempt called DF2

+
DF2 = DF1 %>%  select("hash", "lesson_name", "attempt")
+#DF2 = data.frame('Hash' = DF1$hash, 'LessonName' = DF1$lesson_name, 'Attempt' = DF1$attempt)
+

Use the group_by function to create a data frame that sums all the attempts for each hash by each lesson_name called DF3

+

Is this correct?

+
DF3 = DF2 %>% filter(!is.na(attempt)) %>% group_by(hash, lesson_name) %>% summarise(sumAttempts = sum(attempt))
+

On a scrap piece of paper draw what you think DF3 would look like if all the lesson names were column names

+

Something like:

+ + + + + + + + + + + + + + + + + + + + + + + + + +
HashBasic Building BlocksLogic
2864245
48073100
286445251
+

Convert DF3 to this format.

+
DF3spread = DF3 %>% spread(key = lesson_name, value = sumAttempts)
+

Create a new data frame from DF1 called DF4 that only includes the variables hash, lesson_name and correct

+
DF4 = DF1 %>% select("hash", "lesson_name", "correct")
+

Convert the correct variable so that TRUE is coded as the number 1 and FALSE is coded as 0

+
DF4coded = DF4 %>% mutate(codedCorrect = as.numeric(correct)) #Why do I need to subtract 2? It codes as 3 and 2, not 1 and 0
+

Create a new data frame called DF5 that provides a mean score for each student on each course

+
DF5 = DF4coded %>% filter(!is.na(codedCorrect)) %>% group_by(hash, lesson_name) %>% summarise(meanScore = mean(codedCorrect))
+

Extra credit Convert the datetime variable into month-day-year format and create a new data frame (DF6) that shows the average correct for each day

+
DF6 = DF1 %>% select("hash", "correct", "datetime") %>% mutate(datetime = as.POSIXct(DF1$datetime, origin = "1970-01-01")) %>%filter(!is.na(correct)) %>% group_by(datetime)
+ + + + +
+ + + + + + + + + + + + + + + From 776c6f31f899632bd2449602ccb47c3365af7aab Mon Sep 17 00:00:00 2001 From: Timothy Lee Date: Wed, 8 Apr 2020 16:28:55 +0800 Subject: [PATCH 2/4] update README - course info, links, clean instructor notes --- README.md | 59 ++++++++++++++++++++++++++++++++----------------------- 1 file changed, 34 insertions(+), 25 deletions(-) diff --git a/README.md b/README.md index 523ab9a..191d179 100644 --- a/README.md +++ b/README.md @@ -1,39 +1,48 @@ -# Class Activity 1 -## Data Manipulation +# Data Wrangling Basics in R + +This repo contains files for an in-class activity introducing the class to data +wrangling functions from the `dplyr` and `tidyr` packages. + +HUDK 4050 is the first of three core courses in the Learning Analytics MS at +Teachers College, Columbia University focusing on the thinking, methods, and +conventions in data science. Particular attention is given to the fields of +Educational Data Mining and Learning Analytics. Refer to the +[Syllabus](https://github.com/timothyLeeXQ/HUDK-4050-Syllabus) (forked from +the [main repo](https://github.com/core-methods-in-edm/syllabus) which may +contain updates for future class iterations) for more information on HUDK 4050. + +Other classes in the series are: +* [HUDK 4051: Learning Analytics: + Process and Theory](https://github.com/timothyLeeXQ/HUDK-4051-Syllabus) ([Main + repo](https://github.com/la-process-and-theory/syllabus)) +* HUDK 5053: Feature Engineering Studio (Starting in May 2020. + [Main repo](https://github.com/feature-engineering-studio/syllabus)) + +## Instructor Notes + +### Data Manipulation In this repository you will find data describing Swirl activity from the class so far this semester. Please connect RStudio to this repository. -### Instructions - +#### Instructions + 1. Open a new R Markdown file, please write and run all your commands from within the R Markdown document 2. Delete the contents of the Markdown file and insert a new code block 3. Load the libraries `tidyr` and `dplyr` -4. Create a data frame from the `swirl-data.csv` file called `DF1` - -The variables are: - -`course_name` - the name of the R course the student attempted -`lesson_name` - the lesson name -`question_number` - the question number attempted -`correct` - whether the question was answered correctly -`attempt` - how many times the student attempted the question -`skipped` - whether the student skipped the question -`datetime` - the date and time the student attempted the question -`hash` - anonymyzed student ID - +4. Create a data frame from the `swirl-data.csv` file called `DF1`. The variables are: + - `course_name` - the name of the R course the student attempted + - `lesson_name` - the lesson name + - `question_number` - the question number attempted + - `correct` - whether the question was answered correctly + - `attempt` - how many times the student attempted the question + - `skipped` - whether the student skipped the question + - `datetime` - the date and time the student attempted the question + - `hash` - anonymised student ID 5. Create a new data frame that only includes the variables `hash`, `lesson_name` and `attempt` called `DF2` - 6. Use the `group_by` function to create a data frame that sums all the attempts for each `hash` by each `lesson_name` called `DF3` - 7. On a scrap piece of paper draw what you think `DF3` would look like if all the lesson names were column names - 8. Convert `DF3` to this format - 9. Create a new data frame from `DF1` called `DF4` that only includes the variables `hash`, `lesson_name` and `correct` - 10. Convert the `correct` variable so that `TRUE` is coded as the **number** `1` and `FALSE` is coded as `0` - 11. Create a new data frame called `DF5` that provides a mean score for each student on each course - 12. **Extra credit** Convert the `datetime` variable into month-day-year format and create a new data frame (`DF6`) that shows the average correct for each day - From 4d2331b323d8a9a045df9c9eb581718a1245f7c0 Mon Sep 17 00:00:00 2001 From: Timothy Lee Date: Wed, 8 Apr 2020 16:29:03 +0800 Subject: [PATCH 3/4] add .gitattributes --- .gitattributes | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 .gitattributes diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..32a4ddf --- /dev/null +++ b/.gitattributes @@ -0,0 +1,2 @@ +*.html linguist-detectable=false +*.Rmd linguist-language=R \ No newline at end of file From 7bd9e05280cb386ae86c70ba0f4c3ee25f156be5 Mon Sep 17 00:00:00 2001 From: Timothy Lee Date: Wed, 8 Apr 2020 16:34:33 +0800 Subject: [PATCH 4/4] update readme with class activity number --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 191d179..40aa1a6 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ # Data Wrangling Basics in R -This repo contains files for an in-class activity introducing the class to data -wrangling functions from the `dplyr` and `tidyr` packages. +This repo contains files for an in-class activity (class activity 1) +introducing the class to data wrangling functions from the `dplyr` and `tidyr` +packages. HUDK 4050 is the first of three core courses in the Learning Analytics MS at Teachers College, Columbia University focusing on the thinking, methods, and