-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDemoReport.qmd
183 lines (133 loc) · 8.74 KB
/
DemoReport.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "Demo Report"
author: Solomon Chak # Change this to your name!
format: html
editor: visual
date: "`r format(Sys.time(), '%d %B, %Y')`"
toc: true
toc-depth: 3
toc_float: true
number-sections: true
number-depth: 2
---
# Introduction
See [here](https://solomonchak.github.io/R_Intro/Project_Day1.html) for the background of the project.
Here, we hypothesize that TEs are contributing to the differentiation between queens and workers in eusocial shrimps. We expect to find more TEs in differentially expressed (DE) genes than in non-DE genes.
::: {.callout-important appearance="minimal" icon="false"}
## Modify and include this paragraph in your report:
We expect ... \[describe the patterns in the data that would support the hypothesis.\]
Apart from the contribution of all TEs, we will also test whether \[your assigned TE\] has a significant contribution to worker-queen differentiation. \[Add 1-2 sentences about your assigned TE based on a [scholar search](https://scholar.google.com/) and add the proper in-text citation with a link to the publication.\]
:::
# Methods & Results
From the eusocial shrimp *Synalpheus elizabethae*, we obtained RNA-seq data from three queens and three workers. The transcriptome is assembled with Trinity ([Haas et al. 2013](https://www.nature.com/articles/nprot.2013.084)) and transposable elements were identified with Repeat Masker ([Tarailo-Graovac and Chen 2009](https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0410s25)). We performed differential gene expression analysis using Galaxy ([Blankenberg et al. 2010](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/0471142727.mb1910s89)) and DESeq2 ([Love et al. 2014](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8?ref=https://githubhelp.com)). All statistical analyses were performed in R and detailed below.
```{r, collapse = TRUE, warning = FALSE, error=FALSE, message = FALSE}
library("tidyr")
library("dplyr")
library("ggpubr")
library("stringr")
library("plyr")
```
## Differential expressed transcripts
Among 29,833 transcripts that were shared among the six samples, 122 are up regulated and 141 are down regulated.
```{r}
# Read Galaxy output
df = read.delim(file = "input.matrix.Q_vs_W.DESeq2.txt", row.names = NULL)
# rename the first column
names(df)[1] = "seqid"
# Report nrow
print(paste("Number of total transcript = ", nrow(df)))
# MA plot
ggmaplot(df, size = 0.5)
```
## Transcribed transposable elements
From the transcriptome, 8,970 transcripts contained transposable elements. The more abundant TE class was DNA transposons, followed by LTR and LINE, with SINE being the least abundant.
```{r}
# Load results from Repeat Masker
df_transcript_te = read.table("Elizabethae_Assembly_Trinity.fasta.out", header = F, skip = 3, fill = T ,col.names = c("SW_score", "perc_div", "perc_del", "perc_inc", "seqid", "start", "end", "left", "strand","repeat", "TE_class_family", "TE_start", "TE_end", "TE_left", "ID", "duplicate"), )
# Some columns have numbers in () that needs to be removed.
df_transcript_te$left = gsub("[()]","",df_transcript_te$left)
df_transcript_te$TE_start = gsub("[()]","",df_transcript_te$TE_start)
df_transcript_te$TE_left = gsub("[()]","",df_transcript_te$TE_left)
# Remove duplicates generated by Repeat Masker
df_transcript_te = filter(df_transcript_te, !duplicate=="*")
# Keep first entry for each seqid
df_transcript_te = group_modify((group_by(df_transcript_te, seqid)), ~.x[1,])
# Get TE_class & TE_family from TE_class_family
df_transcript_te$TE_class = str_split(df_transcript_te$TE_class_family, "/", simplify= T)[,1]
df_transcript_te$TE_family = str_split(df_transcript_te$TE_class_family, "/", simplify= T)[,2]
# Keep only the four major classes of TE (DNA, LINE, LTR and SINE).
df_transcript_te = filter(df_transcript_te, TE_class%in%c("DNA", "LINE", "LTR", "SINE"))
# Calculate TE length
df_transcript_te = mutate(df_transcript_te, TE_length = abs(end - start)+1)
# Report nrow
print(paste("Number of TEs identified = ", nrow(df_transcript_te)))
# Plot TE distribution
ggplot(df_transcript_te, aes(TE_class)) +
geom_bar() +
coord_flip() +
theme_bw(base_size=14) +
labs(y = "TE Class", x = "Count")
```
::: {.callout-important appearance="minimal" icon="false"}
## Your Report
Your report should skip 2.1 and 2.2 and begin with prediction 2, and then prediction 3. Use a section header that is informative (i.e., don't use "Prediction 2" as the title).
However, for Quarto to run, you still need to include all the code chunks from above. Modify the line \
{r}
to be\
{r, echo=FALSE, warning=FALSE, message=FALSE, fig.show='hide', results='hide'}
so that it will run behind the scene but not display the results or figures.
:::
## Are there more TEs in differentially expressed transcripts?
If TEs are contributing to the evolution of the worker-queen differentiation, we expect to find more TEs in differentially expressed (DE) genes than in non-DE genes. Therefore, we performed a Chi-sqared test of independence on the binomial predictor variable DE (DE vs. NS, for differentially expressed vs. non-significant) against the response variable TE (TE vs. not, for having TE vs. not).
The results showed that the variables DE and TE are independent (Chi-squared = 2.160, d.f. = 1, p = 0.142). This means that the proportions of transcripts having TEs or not are not statistically different between transcripts that are and are not differentially expressed between queens and workers
```{r}
# Combine the df and df_transcript_te data sets
df_tranTE = left_join(df, df_transcript_te, "seqid")
# Create DE and TE columns
df_tranTE = mutate(df_tranTE,
DE = case_when(padj < 0.05 ~ 'DE',
padj >= 0.05 ~ 'NS'),
TE = case_when(TE_class %in% c("DNA",
"LTR",
"LINE",
"SINE") ~ 'TE',
.default = 'Not')
)
# Chi-squared test
chisq.test(table(df_tranTE$TE, df_tranTE$DE))
# Plot
ggplot(data = df_tranTE, aes(x = TE, fill = TE)) +
geom_bar() +
facet_grid(DE~., scales = "free_y") +
labs(x = "Transposable Elements", y = "Count")
```
Further, we tested and showed that TE class (DNA, LTR, LINE, and SINE) did not affect the variable DE (Chi-squared = 1.204, d.f. = 3, p = 0.752).
```{r}
chisq.test(with(df_tranTE, table(TE_class, DE)))
ggplot(data = df_tranTE, aes(x = DE)) +
geom_bar() +
facet_wrap(~TE_class, scales = "free_y") +
labs(x = "", y = "Count")
```
# Conclusion
In summary, we found that transcripts that have TEs are not more differentially expressed. Also, the class of TE does not affect whether a transcribe is differentially expressed. These results do not support the hypothesis that TEs are contributing to the evolution of the worker-queen differentiation.
Although the results did not support the conclusion, the dichotomous classification of transcripts into merely differentially expressed or not may conceal important pattern in the strength of differential expression. Further analysis should consider using a continuous variable based on the fold change of differential expression.
::: {.callout-important appearance="minimal" icon="false"}
## Your conclusion
Your conclusion should briefly summarize the results from predictions 2 and 3. There is no need to report statistics here. Most importantly, you should relate your results to the hypothesis.
Include a brief discussion of unexpected results or discrepancy between results.
:::
# Grading rubric
::: {.callout-note appearance="minimal" icon="false"}
Total = 150 points.
| | Excellent | Accetpable | Does not meet expectation |
|---------------------------------------------------------|:---------:|:----------:|:-------------------------:|
| Introduction: Expected pattern | 5 | 3 | 1 |
| Introduction: TE background | 10 | 5 | 3 |
| Prediction 1 | 40 | 20 | 10 |
| Prediction 2 | 40 | 20 | 10 |
| Interpretation of p values | 20 | 10 | 5 |
| Relating results to hypothesis | 20 | 10 | 5 |
| Discussion | 10 | 5 | 3 |
| Clear annotation of codes, grammar and clarity of texts | 5 | 3 | 1 |
:::