assignment3/Assignment 3.Rmd at master · Noah0811/assignment3 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
---
title: "Assignment 3: K Means Clustering"
---

In this assignment we will be applying the K-means clustering algorithm we looked at in class. At the following link you can find a description of K-means:

https://www.cs.uic.edu/~wilkinson/Applets/cluster.html


```{r}
library(tidyr)
library(dplyr)
library(klaR)
```

Now, upload the file "Class_Motivation.csv" from the Assignment 3 Repository as a data frame called "K1""
```{r}

K1 <- read.csv("Class_Motivation.csv",header = TRUE,stringsAsFactors = FALSE)

```

This file contains the self-reported motivation scores for a class over five weeks. We are going to look for patterns in motivation over this time and sort people into clusters based on those patterns.

!!But before we do that, we will need to manipulate the data frame into a structure that can be analyzed by our clustering algorithm.!!!

The algorithm will treat each row as a value belonging to a person, so we need to remove the id variable.

```{r}

K2 <- K1[,-1]
```

It is important to think about the !!meaning of missing values when clustering!!!. We could treat them as having meaning or we could remove those people who have them. Neither option is ideal. What problems do you foresee if we recode or remove these values? Write your answers below:

Problem for either way:
1.remove: the sample size will be affected, losing representation
2.retain: the estimated values could contribute statistical significance in a wrong way.


We will remove people with missing values for this assignment, but keep in mind the issues that you have identified.


```{r}

K3 <- na.omit(K2) #This command create a data frame with only those people with no missing values. It "omits" all rows with missing values, also known as a "listwise deletion". EG - It runs down the list deleting rows as it goes.

```

Another pre-processing step used in K-means is to !!standardize the values !!so that they have the same range. We do this because we want to !!treat each week as equally important!!! - if we do not standardise then the week with the largest range will have the greatest impact on which clusters are formed. We standardise the values by using the "scale()" command.

```{r}

K3 <- scale(K3) # scales the columns! to erase the differences incurred by individual differences

```


Now we will run the K-means clustering algorithm we talked about in class.
1) The algorithm starts by randomly choosing some starting values
2) Associates all observations near to those values with them
3) Calculates the mean of those clusters of values
4) Selects the observation closest to the mean of the cluster
5) Re-associates all observations closest to this observation
6) Continues this process until the clusters are no longer changing

Notice that in this case we have 5 variables and in class we only had 2. It is impossible to vizualise this process with 5 variables.

Also, we need to choose the number of clusters we think are in the data. We will start with 2.

```{r}

fit <- kmeans(K3, 2)

#We have created an object called "fit" that contains all the details of our clustering including which observations belong to each cluster.

#We can access the list of clusters by typing "fit$cluster", the top row corresponds to the original order the rows were in. Notice we have deleted some rows.

fit$cluster

#We can also attach these clusters to the original dataframe by using the "data.frame" command to create a new data frame called K4.

K4<-data.frame(K3,fit$cluster)

#Have a look at the K4 dataframe. Lets change the names of the variables to make it more convenient with the names() command.
names(K4)<- c("1", "2", "3", "4", "5", "Cluster")

```

Now we need to visualize the clusters we have created. To do so we want to play with the structure of our data. What would be most useful would be if we could visualize average motivation by cluster, by week. To do this we will need to convert our data from wide to long format. Remember your old friends tidyr and dplyr!

First lets use tidyr to convert from wide to long format.
```{r}
K5 <- gather(K4,Week,Motivation,-Cluster)
```

Now lets use dplyr to average our motivation values by week and by cluster.

```{r}
K6 <- K5 %>%
  group_by(Week,Cluster)%>%
  summarise(mean(Motivation))

names(K6)<-c("Week","Cluster","Motivation")

```

Now it's time to do some visualization:

https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html

And you can see the range of available graphics in ggplot here:

http://ggplot2.tidyverse.org/reference/index.html

We are going to create a line plot similar to the one created in the school dropout paper we looked at in class (Bowers, 2010). It will have motivation on the Y-axis and weeks on the X-axis. To do this we will want our weeks variables to be treated as a number, but because it was created from a variable name it is currently being treated as a character variable. You can see this if you click on the arrow on the left of K6 in the Data pane. Week is designated by "chr". To convert it to numeric, we use the as.numeric command.

Likewise, since "cluster" is not numeric but rather a categorical label we want to convert it from an "integer" format to a "factor" format so that ggplot does not treat it as a number. We can do this with the as.factor() command.

```{r}

K6$Week <- as.numeric(K6$Week)

K6$Cluster <- as.factor(K6$Cluster)

```

Now we can plot our line plot using the ggplot command, "ggplot()".

- The first argument in a ggplot is the dataframe we are using: K6
- Next is what is called an aesthetic (aes), the aesthetic tells ggplot which variables to use and how to use them. Here we are using the variables "week" and "avg" on the x and y axes and we are going color these variables using the "cluster" variable
- Then we are going to tell ggplot which type of plot we want to use by specifiying a "geom()", in this case a line plot: geom_line()
- Finally we are going to clean up our axes labels: xlab("Week") & ylab("Average Motivation")

```{r}
library(ggplot2)
ggplot(K6,aes(x=Week,y=Motivation,color=Cluster))+
  geom_line()+
  xlab("Week")+
  ylab("Average Motivation")

```

What patterns do you see in the plot?

There are two different trends in motivation in two clusters, one goes up smoothly, the other drops drastically after the third week after it reached the peak position. In general two clusters have right opposite trendings.

It would be useful to determine how many people are in each cluster. We can do this easily with dplyr.

```{r}
K7 <- table(K4$Cluster)
K7
#we have 15 students in cluster 1 and 8 in cluster 2
```

Look at the number of people in each cluster, now repeat this process for 3 rather than 2 clusters. Which cluster grouping do you think is more informative? Write your answer below:

```{r}
fit1 <- kmeans(K3, 3)#after scaling, make K=3, creating 3 clusters for this data

fit1$cluster #checking for the allocation of the data

#attach these clusters to the original dataframe by using the "data.frame" command to create a new data frame called KK4.
KK4<-data.frame(K3,fit1$cluster)

#change the names of the variables to make it more convenient with the names() command.
names(KK4)<- c("1", "2", "3", "4", "5", "Cluster")

KK5 <- gather(KK4,Week,Motivation,-Cluster)
# THE VARIBALE WEEK HERE IS CHARACTER,due to the "gather" function from "motivation1", and CLUSTER IS NUMERIC

#average our motivation values by week and by cluster
KK6 <- KK5 %>%
  group_by(Week,Cluster)%>%
  summarise(mean(Motivation))

#renaming the columns since the results above had a controversial column name as mean(Motivation) which will affect the follwing ploting in ggplot
names(KK6)<-c("Week","Cluster","Motivation")


KK6$Week <- as.numeric(KK6$Week)
KK6$Cluster <- as.factor(KK6$Cluster)
#MISS THIS PART IN THE BEGINNING: since "cluster" is not numeric but rather a categorical label we want to convert it from an "integer" format to a "factor" format !!!!!so that ggplot does not treat it as a number!!!!!!.

#draw the graph
ggplot(KK6,aes(x=Week,y=Motivation,color=Cluster))+
  geom_line()+
  xlab("Week")+
  ylab("Average Motivation")

KK7<-table(KK4$Cluster)
KK7 # There are 9 students in cluster 1, 7 in cluster 2 and 7 in cluster 3.


#ASSESSING THE QUALITY OF Kmean2 AND Kmean3
var_k2<-K5%>%
  group_by(Cluster)%>%
  summarise(var(Motivation))%>%
  dplyr::select(2)

var_k3<-KK5%>%
  group_by(Cluster)%>%
  summarise(var(Motivation))%>%
  dplyr::select(2)

paste("Decide: 3 clusters is better/more informative than 2 cluters. Answer:",sum(var_k2)-sum(var_k3)>0)
```

##Part II

Using the data collected for Assignment 2 (which classes students were in), cluster the students, then redraw the graph of the class but color the students according the cluster they are in.
```{r}
df<-read.csv("hudk4050-classes.csv")

#cleaning the data
#Tidying data by replacing all spaces and hyphens with ""
df<-do.call(cbind,lapply(df,function(x) toupper(gsub("\\s|-","",x))))

df<-data.frame(df)
df<-filter(df,df$First.Name!="ZIMO") #erase "ZIMO"'s dirty data
df<-unite(df,Full.Name,First.Name,Last.Name)


fit2<-kmodes(df,4)
df1<-data.frame(df,fit2$cluster)

df2<-df%>%
  gather(Class,Course,-Full.Name)%>%
  filter(Course!="")%>%
  dplyr:: select(1,3)

#Adding column Count:"1" for spread
df2$Count<-1

#Creating a matrix for person-course
person_course<-spread(df2,Course,Count,fill=0)

library(tibble)
person_course<-person_course %>%
  remove_rownames %>%
  column_to_rownames(var="Full.Name")

#Creating a matrix for person_person
person_course<-as.matrix(person_course)
course_person<-t(person_course)
person_person_matrix<-person_course%*%course_person

#Creating a grap from an adjacency matrix(person-person)
library(igraph)
g<-graph_from_adjacency_matrix(person_person_matrix,mode="directed",diag=FALSE)
plot(g,vertex.color=df1$fit2.cluster)
```

##Part III

In class activity 6 you clustered students in the class by the answers to a questionaire. Create a visualization that shows the overlap between these clusters and the clusters generated in part II.

```{r}
df_6<-read.csv("HUDK405019-clustering.csv",encoding ="UTF-8",stringsAsFactors = FALSE)
df_6<-df_6%>%
  unite(Full.Name,First.Name,Last.Name)

#Convert the index numbers of the data frame into the student names.

row.names(df_6) <- df_6$Full.Name

df_6$Full.Name <- NULL

#Wrangle data using dplyr to include only the numerical values.

#Remove location variables
df_7 <- dplyr::select(df_6, 1:11)

#Remove any characters
df_8<-lapply(df_7,function(x)gsub("[A-z]","",x))
df_8<-data.frame(df_8)

#Convert all variables to numeric
for (i in 1:dim(df_8)[2]){
  df_8[,i]<-as.numeric(as.character(df_8[,i]))
  #changing from factor;need to convert it into character first
}

#Scale the data so that no variable has undue influence
df_8 <- data.frame(scale(df_8))

#Replace missing values with average score EG - zero
df_8 <- df_8 %>% mutate_all(funs(ifelse(is.na(.) == TRUE, 0, .)))

#Handling columns with latitues and longtitudes
df_9 <- dplyr::select(df_6, 13:14)

#Change names for convenience
names(df_9) <- c("lattitude", "longitude")

#Remove any characters and common punctuation
df_9 <- df_9 %>% mutate_all(funs(gsub("[a-zA-Z]", "", .)))
df_9 <- df_9 %>% mutate_all(funs(sub("[?]", "", .)))

#Remove anything after the first non-numeric character in lattitude
df_9$lattitude <- sub(",.*$","", df_9$lattitude)
df_9$lattitude <- sub("°.*$","", df_9$lattitude)

#Remove anything before the first non-numeric character in longitude
df_9$longitude <- gsub(".*,","",df_9$longitude)
df_9$longitude <- sub("°.*$","", df_9$longitude)

#Convert all variables to numeric
df_9 <- df_9 %>% mutate_all(funs(as.numeric(.)))


fit3<-kmeans(df_8,4)

fit3$cluster

df_10<-data.frame(df_8,df_9,fit3$cluster)

#graph created in activity 6
library(ggplot2)
ggplot(df_10, aes(longitude, lattitude, color = as.factor(fit3.cluster))) +
  geom_point(size = 3)

#combining two cluster together
Full.Name<-toupper(row.names(df_6))
df_10<-data.frame(Full.Name,df_10)
df_11<-inner_join(df_10,df1,by="Full.Name")
df_11$Full.Name<-as.factor(1:36)


#overlaying two ggplots
#diamond:cluster 1 based on questionare
#filled circle:cluster2 based on social network
ggplot(df_11,aes(x=Full.Name,y=fit3.cluster,color=as.factor(fit3.cluster)))+
  geom_point(size=5,pch=5)+
  geom_point(data=df_11,aes(x=Full.Name,y=fit2.cluster,color=as.factor(fit2.cluster)),size=5,pch=19)+
  xlab("Students")+
  ylab("Clusters")+
  labs(color='Diamond:Cluster1;Circle:Cluster2')


```

## Please render your code as an .html file using knitr and Pull Resquest both your .Rmd file and .html files to the Assignment 3 repository.