-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpickCells.Rmd
330 lines (264 loc) · 14.1 KB
/
pickCells.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "Picking CCLE cell lines for pathway analysis"
output:
github_document
---
This is for an R01 application. We want to study cellular signaling using proteomics. The idea is that current knowledge about signaling is very fuzzy, due to the overlap of signaling components between pathways, e.g. both EGFR and FGFR signal through the MAPK signaling cascade.
We were criticized about how model CCLE cell lines are picked for the study. So I want to see whether I can find a way to choose the cell lines based on the gene expression level of the pathway components.
The read count RNA-seq data are downloaded from the CCLE. The data will have to be normalized before use. To do so, DESeq2 will be used.
```{r packages, message=FALSE, echo=FALSE}
# required packages
library(tidyverse)
library(ggthemes)
library(DESeq2)
library(robustbase)
library(ggpubr)
library(qdapTools)
library(colorRamps)
library(RColorBrewer)
library(gplots)
```
```{r import data}
# import the raw count data, there are two rows of unwanted materials, so I will have to skip it
raw.counts <- read.delim("../../CCLE_DepMap_18Q2_RNAseq_reads_20180502.gct",
header = TRUE, stringsAsFactors = FALSE,
skip = 2)
raw.counts[1:5, 1:5]
```
DESeq2 requires a matrix with the gene name as row names, the count matrix; and a separated matrix for annotation containing the column names (which is the cell lines in this case).
The count matrix,
```{r data import and normalization using DESeq2, message = FALSE}
# DESeq2 normalization.
# make the count data matrix
count.data <- as.matrix(raw.counts[, 3:1078])
row.names(count.data) <- raw.counts$Name # the row names are the Ensembl IDs
count.data[1:5, 1:5]
# annotation matrix, this contain a single column named condition, which is the name of the cell lines
```
And the annotation matrix,
```{r}
coldata <- as.matrix(data.frame(condition = names(raw.counts[, 3:1078])))
row.names(coldata) <- names(raw.counts[, 3:1078]) # also make the cell lines as the col.names
head(coldata, 3)
# create the DESeq object
dds <- DESeqDataSetFromMatrix(countData = count.data,
colData = coldata,
design = ~condition) # design is just the condition in this case
# then create the normalization factor using estimateSizeFactors()
dds <- estimateSizeFactors(dds)
# retrieve the normalized count using counts(, normalized = TRUE)
nor.CCLE <- raw.counts[, 1:2]
colnames(nor.CCLE) <- c("Ensembl", "Gene.names")
nor.CCLE <- cbind(nor.CCLE, data.frame(counts(dds, normalized = TRUE), stringsAsFactors = FALSE), row.names = NULL)
```
The normalized read counts,
```{r}
nor.CCLE[1:5,1:5]
```
Using the EGFR pathway as an example, I want to see what algorithm will provide the most obvious choice cell lines with a high expression of the gene members of a particular pathway. The data of pathway members are be downloaded from reactome.org as .tsv. And I want to try the 1) rank sum, 2) mean, 3) median and the 4) sum of gene expression.
```{r import the genes involved regulations according to Reactome, message = FALSE, warning=FALSE}
# import the EGFR pathway data and extract the member genes
egfr <- read.delim("../../signaling_genes/by_EGFR.tsv",
header = TRUE, stringsAsFactors = FALSE)
egfr <- egfr %>% separate(MoleculeName, into = c("Molecule" , "Gene.name"), sep = " ") %>%
filter(MoleculeType == "Proteins")
egfr <- egfr$Gene.name[!duplicated(egfr$Gene.name)]
```
```{r try ranking the pathway components}
# Testing which method is best for separating the cell lines
# copy the df into a new df, so the original will not be modified
rank.expression <- nor.CCLE
# Computing the rank sum:
# Rank order all of the gene expressions, producing a matrix
rank.expression <- apply(rank.expression[, 3:1078], 2,
FUN = rank, ties.method = "min")
# add the Ensembl IDs and gene names back to the matrix, produce a df
rank.expression <- cbind(nor.CCLE[, 1:2], rank.expression)
# then select genes that are only in the egfr pathway
egfr.rank <- rank.expression[rank.expression$Gene.names %in% egfr, ]
# add up the rank of the egfr pathway components
egfr.matrix <- data.frame(Ensembl = rank.expression$Ensembl, stringsAsFactors = FALSE)
egfr.matrix <- data.frame(Rank.sum = colSums(egfr.rank[, 3:1078]), stringsAsFactors = FALSE)
# Median, mean and sum
# make a df containig the normalized expression of egfr components
egfr.ccle <- nor.CCLE[nor.CCLE$Gene.names %in% egfr, ]
egfr.matrix$Median <- robustbase::colMedians(as.matrix(egfr.ccle[, 3:1078]))
egfr.matrix$Mean <- colMeans(as.matrix(egfr.ccle[, 3:1078]))
egfr.matrix$Sum <- colSums(as.matrix(egfr.ccle[, 3:1078]))
```
```{r clean up and visualize}
egfr.matrix$Cell.lines <- row.names(egfr.matrix)
egfr.matrix.long <- egfr.matrix %>% gather(key = "Methods", value = "Expressions", 1:4)
ggplot(egfr.matrix.long) +
geom_point(mapping = aes(x = Cell.lines, y = Expressions)) +
facet_wrap(~ Methods, scales = "free_y")
```
Based on the plots, mean and sum seem to give the best separation (they are basically the same). By comparing mean and sum to rank sum and median, it also indicates that the outlier cell lines have certain highly expressed genes within the pathway, but not the over-expression of all of the pathway component annotated by reactome.org. I don't think this will adversely affect the choice of cell lines. Because in the end, we want to see the over-activated and to find out the best marker(s) for a particular pathway.
The outlier cell lines will be selected as demonstrated in the following plot.
```{r extracting outlier cells}
cut.off <- quantile(egfr.matrix$Sum, 0.75) + 1.5 * IQR(egfr.matrix$Sum)
egfr.matrix$x.axis <- 1:nrow(egfr.matrix)
egfr.matrix$Picked <- egfr.matrix$Sum >= cut.off
scatter.plot <- ggscatter(egfr.matrix,
x = "x.axis", y = "Sum",
color = "Picked", palette = c("Grey", "Red"),
xlab = "Cell lines",
ylab = "Sum expression of EGFR pathway components")
box <- ggboxplot(egfr.matrix, x = 1, y = "Sum",
alpha = 0.5,
fill = "grey") + clean_theme()
combined.plot <- ggarrange(scatter.plot, box,
ncol = 2, align = "h",
widths = c(9, 1), heights = c(1,1),
common.legend = TRUE)
combined.plot
```
With the method figured out, I downloaded the data of 11 signaling pathways that we will propose to study from the reactome.org.
```{r reactome gene list}
# Reading and cleaning up of the reactome files
# read the files containing the signaling components into a list
signaling.files <- list.files(path = "../../signaling_genes", pattern = "by_",
full.names = TRUE) # full name extract the path
# use the file list created above to read the files into a list of df
# this dfs contain the data from reactome, which have to clean up further to extract the genes
gene.list <- lapply(signaling.files, read.delim, header = TRUE, stringsAsFactors = FALSE)
# extract the pathway names from the file list
pathway.names <- lapply(signaling.files, str_sub, 26, -5)
# add pathway names back to the file list
names(gene.list) <- pathway.names
# the reactome data actually contains a lot of information, which I don't need,
# so write a function and clean all of them up with lapply()
extract.genes <- function(x, MoleculeType, MoleculeName) {
x <- x %>%
filter(MoleculeType == "Proteins") %>%
separate(MoleculeName, into = c("Uniprot", "Gene.names"), sep = " ")
x <- x[!duplicated(x$Gene.names), ]
x <- x$Gene.names
}
# clean up the gene list with the new function
gene.list <- lapply(gene.list, FUN = extract.genes)
names(gene.list) <- str_c("Reactome.", letters[1:length(names(gene.list))])
```
There are 1016 genes or 736 unique genes in total. The smallest pathway contains 24 genes and 202 for the largest.
```{r}
gene.list.long <- list2df(gene.list)
names(gene.list.long)[1:2] <- c("Gene.names", "Pathways")
gene.list.long <- gene.list.long %>% separate(Pathways, into = c("Sources", "Pathways"), sep = "\\.")
ggplot(gene.list.long) +
geom_bar(mapping = aes(x = Pathways)) +
#facet_wrap(~ Sources) +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
panel.background = element_rect(fill = "white", colour = "grey50")) +
ylab("Gene count")
```
```{r match the signaling component and calculate the sum of expressions}
# create a new list to whole the list of df with pathway component expression values
ccle.pathways <- list()
# match the whole CCLE expression data with the list containing pathway components
# then add Ensembl IDs as the gene names
for (i in 1:length(gene.list)) {
ccle.pathways[[i]] <- nor.CCLE[nor.CCLE$Gene.names %in% gene.list[[i]],]
row.names(ccle.pathways[[i]]) <- ccle.pathways[[i]][,1]
}
# rename of the dfs in the list with each respective signaling pathway
names(ccle.pathways) <- names(gene.list)
# remove the columns containing Ensembl IDs and gene names, so that I can do calculations
ccle.pathways <- lapply(ccle.pathways, FUN = select, 3:1078)
# calculate the sums of expression of pathway component in each cell lines
sum.gene.expression <- lapply(ccle.pathways, function(x) data.frame(Sums = colSums(x)))
# convert the sum of expression list into a wide DF
cell.pathway.sum <- Reduce(f = function(x, y) cbind(x, y),
x = sum.gene.expression)
colnames(cell.pathway.sum) <- names(sum.gene.expression)
cell.pathway.sum$Cells <- rownames(cell.pathway.sum)
```
```{r calculate cut off and pick cell lines, echo=FALSE}
# I want to pick the cut off base on the boxplot,
# the whisker of the boxplot is 75 % quantile + 1.5 * IQR
# write a function to calculate the cut off for each pathway for picking the cells
cal.cut.off <- function(x) {
cut.off <- quantile(x[1][,1], 0.75) + 1.5 * IQR(x[1][,1])
}
# make the list containing the cut off values
cut.off.list <- lapply(sum.gene.expression, FUN = cal.cut.off)
# turn the list into a df
cut.off.list <- data.frame(Cut.off = unlist(cut.off.list))
########
# picking the cells here
########
# at this point, the dfs contain 2 columns,
# the sum of component expression and the quantile that the cell lines are sitting in each pathway
# make a list of df to hold the cells
cells.list <- list()
# the for loop to get the cells. Have to use two lists for this
# 1. the list containing the sum of gene expressions
# 2. the list containing the cut off
for (i in 1:length(sum.gene.expression)) {
cells.list[[i]] <- sum.gene.expression[[i]][sum.gene.expression[[i]][,1] >=
cut.off.list[i, 1], ,
drop = FALSE]
# add a column containing the pathway names, just to preserve it for later
cells.list[[i]][1:nrow(cells.list[[i]]), 2] <- "+"
colnames(cells.list[[i]])[2] <- names(sum.gene.expression[i])
# the cell line names are the row names now, I want to convert it into a column
cells.list[[i]][1:nrow(cells.list[[i]]), 3] <- row.names(cells.list[[i]])
cells.list[[i]] <- select(cells.list[[i]], Cells = 3, 2)
}
# reduce repeat the function in a sequential manner, until it finished the list, I think
# the input is a list, then I added the function with 2 inputs and run full_join()
cells.list <- purrr::reduce(cells.list, function(df1, df2) full_join(df1, df2, by = "Cells"))
# convert to long form to match with quantile data
cells.list.long.for.tissues <- cells.list %>%
gather(key = "Pathways", value = "Picked", -Cells) %>%
filter(Picked == "+") %>%
separate(col = Cells, into = c("Cells", "Tissues"), sep = "_", extra = "merge")
cells.list.long.for.merging.and.clustering <- cells.list %>%
gather(key = "Pathways", value = "Picked", -Cells) %>%
filter(Picked == "+") %>%
distinct(Cells)
# match picked cells with sum of expression data. For clustering
cells.list.sum <- left_join(cells.list.long.for.merging.and.clustering,
cell.pathway.sum, by = c("Cells"))
```
And there are 259 cell lines from 23 tissue types. The number of cell lines for each tissue type range from 3 and 4 for AUTONOMIC_GANGLIA and PROSTATE, respectively to 130 cell lines from fibroblast.
```{r plotting cells data, echo=FALSE}
ggplot(cells.list.long.for.tissues) +
geom_bar(mapping = aes(x = Tissues)) +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
panel.background = element_rect(fill = "white", colour = "grey50")) +
guides(fill=FALSE)
```
The pathways that can be studied in each tissue type are also different. Interestingly, HAEMATOPOIETIC_AND_LYMPHOID_TISSUE has all of the 11 pathways.
```{r pathway percent in each tissue type, echo=FALSE}
# I want to see which the percentage of each pathway in each tissue type
# then get a count table for the cell lines, basically how many cell line is in each pathway in each tissue type
x <- data.frame(table(cells.list.long.for.tissues[, 2:3]))
# then group the data by tissue type and calculate the percentage of pathway
y <- group_by(x, Tissues) %>% mutate(Percent = Freq/sum(Freq) * 100)
# Clean up the table
z <- y %>% mutate(Percent = ifelse(Percent == 0, NA, Percent)) %>%
select(-Freq) %>%
spread(key = Pathways, value = Percent)
# plot it out
ggplot(y) +
geom_bar(mapping = aes(x = Tissues, y = Percent, fill = Pathways),
stat = "identity", position = "stack") +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
panel.background = element_rect(fill = "white")) +
scale_fill_manual(values = colorRampPalette(brewer.pal(9, "Set1"))(11))
```
It is probably easier to pick cells after clustering, e.g. pathway h and pathway g are probably more active in the same cluster of cells.
```{r clustering}
# log2 transform the expression
log.sum <- cells.list.sum %>%
separate(col = Cells, into = c("Cells", "Tissues"), sep = "_", extra = "merge")
log.sum[, 3:13] <- apply(log.sum[, 3:13], 2, log2)
# normalize each pathway
log.sum[, 3:13] <- apply(log.sum[, 3:13], 2, scale)
colnames(log.sum)[3:13] <- letters[1:11]
heatmap.2(t(log.sum[,3:12]),
col = rev(redblue(300)),
trace = "none",
labCol = NA,
xlab = "Cell lines")
```