Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
*.html linguist-detectable=false
*.Rmd linguist-language=R
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
*.Rproj
100 changes: 100 additions & 0 deletions Class Activity 1 Response.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: "Class Activity 1"
Author: Timothy Lee
output: html_document
---


**Load the libraries tidyr and dplyr**

```{r echo = FALSE, message = FALSE}
library(tidyr)
library(dplyr)
```

**Create a data frame from the swirl-data.csv file called DF1**

```{r echo = FALSE}
DF1 = read.csv("C:/Users/Timothy/Google Drive/TC Stuff/Analytics/HUDK 4050 & 4051 - Learning Analytics/projects/class-activity-1/swirl-data.csv", header = TRUE)
```

**The variables are:**

* **`course_name` - the name of the R course the student attempted**
* **`lesson_name` - the lesson name**
* **`question_number` - the question number attempted correct - whether the question was answered correctly**
* **`attempt` - how many times the student attempted the question**
* **`skipped` - whether the student skipped the question**
* **`datetime` - the date and time the student attempted the question**
* **`hash` - anonymyzed student ID**

**Create a new data frame that only includes the variables `hash`, `lesson_name` and `attempt` called DF2**

```{r}
DF2 = DF1 %>% select("hash", "lesson_name", "attempt")
#DF2 = data.frame('Hash' = DF1$hash, 'LessonName' = DF1$lesson_name, 'Attempt' = DF1$attempt)
```

**Use the `group_by` function to create a data frame that sums all the attempts for each hash by each lesson_name called DF3**

Is this correct?
```{r}
DF3 = DF2 %>% filter(!is.na(attempt)) %>% group_by(hash, lesson_name) %>% summarise(sumAttempts = sum(attempt))
```

**On a scrap piece of paper draw what you think DF3 would look like if all the lesson names were column names**

Something like:

| Hash | Basic Building Blocks | Logic |
| ------------- |:-------------------------:| -----:|
| 2864 | 2 | 45 |
| 4807 | 3 | 100 |
| 2864 | 45 | 251 |


Convert DF3 to this format.

```{r}
DF3spread = DF3 %>% spread(key = lesson_name, value = sumAttempts)
```

**Create a new data frame from DF1 called DF4 that only includes the variables `hash`, `lesson_name` and `correct`**

```{r}
DF4 = DF1 %>% select("hash", "lesson_name", "correct")
```

**Convert the correct variable so that `TRUE` is coded as the number 1 and `FALSE` is coded as 0**
```{r}
codedCorrect = ifelse(DF4$correct == TRUE, yes = 1, no = 0)
#No need for additional ifelse for NA, since ifelse documentation says "Missing values in test give missing values in the result."
#I checked and this appears to work.
DF4 = cbind(DF4, codedCorrect)
```

**Create a new data frame called DF5 that provides a mean score for each student on each course**

```{r}
DF5 = DF4 %>% filter(!is.na(codedCorrect)) %>% group_by(hash, lesson_name) %>% summarise(meanScore = mean(codedCorrect))
```

**Extra credit Convert the datetime variable into month-day-year format and create a new data frame (DF6) that shows the average correct for each day**

```{r}
#Select only the relevant variables
DF6 = DF1 %>% select("correct", "datetime")
#Convert score to numbers - 1 for correct, 0 for incorrect
DF6$correct = ifelse(DF6$correct == TRUE, yes = 1, no = 0)
#Change datetime to proper datetime variables
DF6$datetime = as.POSIXct(DF6$datetime, origin = "1970-01-01")
#Omit time so group_by groups by day and not second
DF6$datetime = lubridate::round_date(DF6$datetime, unit = 'day')

DF6 = DF6 %>% filter(!is.na(correct)) %>%
group_by(datetime) %>%
summarise(averageCorrect = mean(correct))
DF6

```

473 changes: 473 additions & 0 deletions Class-Activity-1-Response.html

Large diffs are not rendered by default.

60 changes: 35 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,49 @@
# Class Activity 1
## Data Manipulation
# Data Wrangling Basics in R

This repo contains files for an in-class activity (class activity 1)
introducing the class to data wrangling functions from the `dplyr` and `tidyr`
packages.

HUDK 4050 is the first of three core courses in the Learning Analytics MS at
Teachers College, Columbia University focusing on the thinking, methods, and
conventions in data science. Particular attention is given to the fields of
Educational Data Mining and Learning Analytics. Refer to the
[Syllabus](https://github.com/timothyLeeXQ/HUDK-4050-Syllabus) (forked from
the [main repo](https://github.com/core-methods-in-edm/syllabus) which may
contain updates for future class iterations) for more information on HUDK 4050.

Other classes in the series are:
* [HUDK 4051: Learning Analytics:
Process and Theory](https://github.com/timothyLeeXQ/HUDK-4051-Syllabus) ([Main
repo](https://github.com/la-process-and-theory/syllabus))
* HUDK 5053: Feature Engineering Studio (Starting in May 2020.
[Main repo](https://github.com/feature-engineering-studio/syllabus))

## Instructor Notes

### Data Manipulation

In this repository you will find data describing Swirl activity from the class so far this semester. Please connect RStudio to this repository.

### Instructions
#### Instructions

1. Open a new R Markdown file, please write and run all your commands from within the R Markdown document
2. Delete the contents of the Markdown file and insert a new code block
3. Load the libraries `tidyr` and `dplyr`
4. Create a data frame from the `swirl-data.csv` file called `DF1`

The variables are:

`course_name` - the name of the R course the student attempted
`lesson_name` - the lesson name
`question_number` - the question number attempted
`correct` - whether the question was answered correctly
`attempt` - how many times the student attempted the question
`skipped` - whether the student skipped the question
`datetime` - the date and time the student attempted the question
`hash` - anonymyzed student ID

4. Create a data frame from the `swirl-data.csv` file called `DF1`. The variables are:
- `course_name` - the name of the R course the student attempted
- `lesson_name` - the lesson name
- `question_number` - the question number attempted
- `correct` - whether the question was answered correctly
- `attempt` - how many times the student attempted the question
- `skipped` - whether the student skipped the question
- `datetime` - the date and time the student attempted the question
- `hash` - anonymised student ID
5. Create a new data frame that only includes the variables `hash`, `lesson_name` and `attempt` called `DF2`

6. Use the `group_by` function to create a data frame that sums all the attempts for each `hash` by each `lesson_name` called `DF3`

7. On a scrap piece of paper draw what you think `DF3` would look like if all the lesson names were column names

8. Convert `DF3` to this format

9. Create a new data frame from `DF1` called `DF4` that only includes the variables `hash`, `lesson_name` and `correct`

10. Convert the `correct` variable so that `TRUE` is coded as the **number** `1` and `FALSE` is coded as `0`

11. Create a new data frame called `DF5` that provides a mean score for each student on each course

12. **Extra credit** Convert the `datetime` variable into month-day-year format and create a new data frame (`DF6`) that shows the average correct for each day