core-methods-in-edm · timothyLeeXQ · Sep 29, 2019 · Apr 8, 2020 · Apr 8, 2020 · Apr 8, 2020
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+*.html linguist-detectable=false
+*.Rmd linguist-language=R
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+.Rproj.user
+.Rhistory
+.RData
+.Ruserdata
+*.Rproj
diff --git a/Class Activity 1 Response.Rmd b/Class Activity 1 Response.Rmd
@@ -0,0 +1,100 @@
+---
+title: "Class Activity 1"
+Author: Timothy Lee
+output: html_document
+---
+
+
+**Load the libraries tidyr and dplyr**
+
+```{r echo = FALSE, message = FALSE}
+library(tidyr)
+library(dplyr)
+```
+
+**Create a data frame from the swirl-data.csv file called DF1**
+
+```{r echo = FALSE}
+DF1 = read.csv("C:/Users/Timothy/Google Drive/TC Stuff/Analytics/HUDK 4050 & 4051 - Learning Analytics/projects/class-activity-1/swirl-data.csv", header = TRUE)
+```
+
+**The variables are:**
+
+* **`course_name` - the name of the R course the student attempted**
+* **`lesson_name` - the lesson name**
+* **`question_number` - the question number attempted correct - whether the question was answered correctly**
+* **`attempt` - how many times the student attempted the question**
+* **`skipped` - whether the student skipped the question**
+* **`datetime` - the date and time the student attempted the question**
+* **`hash` - anonymyzed student ID**
+
+**Create a new data frame that only includes the variables `hash`, `lesson_name` and `attempt` called DF2**
+
+```{r}
+DF2 = DF1 %>%  select("hash", "lesson_name", "attempt")
+#DF2 = data.frame('Hash' = DF1$hash, 'LessonName' = DF1$lesson_name, 'Attempt' = DF1$attempt)
+```
+
+**Use the `group_by` function to create a data frame that sums all the attempts for each hash by each lesson_name called DF3**
+
+Is this correct?
+```{r}
+DF3 = DF2 %>% filter(!is.na(attempt)) %>% group_by(hash, lesson_name) %>% summarise(sumAttempts = sum(attempt))
+```
+
+**On a scrap piece of paper draw what you think DF3 would look like if all the lesson names were column names**
+
+Something like:
+
+| Hash          | Basic Building Blocks     | Logic |
+| ------------- |:-------------------------:| -----:|
+| 2864          | 2                         | 45    |
+| 4807          | 3                         | 100   |
+| 2864          | 45                        | 251   |
+
+
+Convert DF3 to this format.
+
+```{r}
+DF3spread = DF3 %>% spread(key = lesson_name, value = sumAttempts)
+```
+
+**Create a new data frame from DF1 called DF4 that only includes the variables `hash`, `lesson_name` and `correct`**
+
+```{r}
+DF4 = DF1 %>% select("hash", "lesson_name", "correct")
+```
+
+**Convert the correct variable so that `TRUE` is coded as the number 1 and `FALSE` is coded as 0**
+```{r}
+codedCorrect = ifelse(DF4$correct == TRUE, yes = 1, no = 0)
+#No need for additional ifelse for NA, since ifelse documentation says "Missing values in test give missing values in the result."
+#I checked and this appears to work.
+DF4 = cbind(DF4, codedCorrect)
+```
+
+**Create a new data frame called DF5 that provides a mean score for each student on each course**
+
+```{r}
+DF5 = DF4 %>% filter(!is.na(codedCorrect)) %>% group_by(hash, lesson_name) %>% summarise(meanScore = mean(codedCorrect))
+```
+
+**Extra credit Convert the datetime variable into month-day-year format and create a new data frame (DF6) that shows the average correct for each day**
+
+```{r}
+#Select only the relevant variables
+DF6 = DF1 %>% select("correct", "datetime")
+#Convert score to numbers - 1 for correct, 0 for incorrect
+DF6$correct = ifelse(DF6$correct == TRUE, yes = 1, no = 0)
+#Change datetime to proper datetime variables
+DF6$datetime = as.POSIXct(DF6$datetime, origin = "1970-01-01")
+#Omit time so group_by groups by day and not second
+DF6$datetime = lubridate::round_date(DF6$datetime, unit = 'day')
+
+DF6 = DF6 %>% filter(!is.na(correct)) %>%
+  group_by(datetime) %>%
+  summarise(averageCorrect = mean(correct))
+DF6
+
+```
+
diff --git a/Class-Activity-1-Response.html b/Class-Activity-1-Response.html
diff --git a/README.md b/README.md
@@ -1,39 +1,49 @@
-# Class Activity 1
-## Data Manipulation
+# Data Wrangling Basics in R
+
+This repo contains files for an in-class activity (class activity 1)
+introducing the class to data wrangling functions from the `dplyr` and `tidyr`
+packages.
+
+HUDK 4050 is the first of three core courses in the Learning Analytics MS at
+Teachers College, Columbia University focusing on the thinking, methods, and
+conventions in data science. Particular attention is given to the fields of
+Educational Data Mining and Learning Analytics. Refer to the
+[Syllabus](https://github.com/timothyLeeXQ/HUDK-4050-Syllabus) (forked from
+the [main repo](https://github.com/core-methods-in-edm/syllabus) which may
+contain updates for future class iterations) for more information on HUDK 4050.
+
+Other classes in the series are:
+* [HUDK 4051: Learning Analytics:
+ Process and Theory](https://github.com/timothyLeeXQ/HUDK-4051-Syllabus) ([Main
+ repo](https://github.com/la-process-and-theory/syllabus))
+* HUDK 5053: Feature Engineering Studio (Starting in May 2020.
+ [Main repo](https://github.com/feature-engineering-studio/syllabus))
+
+## Instructor Notes
+
+### Data Manipulation
 
 In this repository you will find data describing Swirl activity from the class so far this semester. Please connect RStudio to this repository.
 
-### Instructions
-  
+#### Instructions
+
 1. Open a new R Markdown file, please write and run all your commands from within the R Markdown document  
 2. Delete the contents of the Markdown file and insert a new code block
 3. Load the libraries  `tidyr` and `dplyr`
-4. Create a data frame from the `swirl-data.csv` file called `DF1`
-
-The variables are:
-
-`course_name` - the name of the R course the student attempted  
-`lesson_name` - the lesson name  
-`question_number` - the question number attempted
-`correct` - whether the question was answered correctly  
-`attempt` - how many times the student attempted the question  
-`skipped` - whether the student skipped the question  
-`datetime` - the date and time the student attempted the question  
-`hash` - anonymyzed student ID  
-
+4. Create a data frame from the `swirl-data.csv` file called `DF1`. The variables are:
+  - `course_name` - the name of the R course the student attempted  
+  - `lesson_name` - the lesson name  
+ - `question_number` - the question number attempted
+ - `correct` - whether the question was answered correctly  
+ - `attempt` - how many times the student attempted the question  
+ - `skipped` - whether the student skipped the question  
+ - `datetime` - the date and time the student attempted the question  
+ - `hash` - anonymised student ID  
 5. Create a new data frame that only includes the variables `hash`, `lesson_name` and `attempt` called `DF2`
-
 6. Use the `group_by` function to create a data frame that sums all the attempts for each `hash` by each `lesson_name` called `DF3`
-
 7. On a scrap piece of paper draw what you think `DF3` would look like if all the lesson names were column names
-
 8. Convert `DF3` to this format  
-
 9. Create a new data frame from `DF1` called `DF4` that only includes the variables `hash`, `lesson_name` and `correct`
-
 10. Convert the `correct` variable so that `TRUE` is coded as the **number** `1` and `FALSE` is coded as `0`  
-
 11. Create a new data frame called `DF5` that provides a mean score for each student on each course
-
 12. **Extra credit** Convert the `datetime` variable into month-day-year format and create a new data frame (`DF6`) that shows the average correct for each day
-
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		*.html linguist-detectable=false
		*.Rmd linguist-language=R