DemoReport.qmd

---
title: "Demo Report"
author: Solomon Chak # Change this to your name!
format: html
editor: visual
date: "`r format(Sys.time(), '%d %B, %Y')`"
toc: true
toc-depth: 3
toc_float: true
number-sections: true
number-depth: 2
---

# Introduction

See [here](https://solomonchak.github.io/R_Intro/Project_Day1.html) for the background of the project.

Here, we hypothesize that TEs are contributing to the differentiation between queens and workers in eusocial shrimps. We expect to find more TEs in differentially expressed (DE) genes than in non-DE genes.

::: {.callout-important appearance="minimal" icon="false"}
## Modify and include this paragraph in your report:

We expect ... \[describe the patterns in the data that would support the hypothesis.\]

Apart from the contribution of all TEs, we will also test whether \[your assigned TE\] has a significant contribution to worker-queen differentiation. \[Add 1-2 sentences about your assigned TE based on a [scholar search](https://scholar.google.com/) and add the proper in-text citation with a link to the publication.\]
:::

# Methods & Results

From the eusocial shrimp *Synalpheus elizabethae*, we obtained RNA-seq data from three queens and three workers. The transcriptome is assembled with Trinity ([Haas et al. 2013](https://www.nature.com/articles/nprot.2013.084)) and transposable elements were identified with Repeat Masker ([Tarailo-Graovac and Chen 2009](https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0410s25)). We performed differential gene expression analysis using Galaxy ([Blankenberg et al. 2010](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/0471142727.mb1910s89)) and DESeq2 ([Love et al. 2014](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8?ref=https://githubhelp.com)). All statistical analyses were performed in R and detailed below.

```{r, collapse = TRUE, warning = FALSE, error=FALSE, message = FALSE}
library("tidyr")
library("dplyr")
library("ggpubr")
library("stringr")
library("plyr")
```

## Differential expressed transcripts

Among 29,833 transcripts that were shared among the six samples, 122 are up regulated and 141 are down regulated.

```{r}
# Read Galaxy output
df = read.delim(file = "input.matrix.Q_vs_W.DESeq2.txt", row.names = NULL)

# rename the first column
names(df)[1] = "seqid"

# Report nrow
print(paste("Number of total transcript = ", nrow(df)))

# MA plot
ggmaplot(df, size = 0.5) 
```

## Transcribed transposable elements 

From the transcriptome, 8,970 transcripts contained transposable elements. The more abundant TE class was DNA transposons, followed by LTR and LINE, with SINE being the least abundant.

```{r}
# Load results from Repeat Masker
df_transcript_te = read.table("Elizabethae_Assembly_Trinity.fasta.out", header = F, skip = 3, fill = T ,col.names = c("SW_score", "perc_div", "perc_del", "perc_inc", "seqid", "start", "end", "left", "strand","repeat", "TE_class_family", "TE_start", "TE_end", "TE_left", "ID", "duplicate"), )

# Some columns have numbers in () that needs to be removed.
df_transcript_te$left = gsub("[()]","",df_transcript_te$left)
df_transcript_te$TE_start = gsub("[()]","",df_transcript_te$TE_start)
df_transcript_te$TE_left = gsub("[()]","",df_transcript_te$TE_left)

# Remove duplicates generated by Repeat Masker
df_transcript_te = filter(df_transcript_te, !duplicate=="*")

# Keep first entry for each seqid
df_transcript_te = group_modify((group_by(df_transcript_te, seqid)), ~.x[1,])

# Get TE_class & TE_family from TE_class_family
df_transcript_te$TE_class = str_split(df_transcript_te$TE_class_family, "/", simplify= T)[,1] 
df_transcript_te$TE_family = str_split(df_transcript_te$TE_class_family, "/", simplify= T)[,2] 

# Keep only the four major classes of TE (DNA, LINE, LTR and SINE).
df_transcript_te =  filter(df_transcript_te, TE_class%in%c("DNA", "LINE", "LTR", "SINE"))

# Calculate TE length
df_transcript_te = mutate(df_transcript_te, TE_length = abs(end - start)+1)

# Report nrow
print(paste("Number of TEs identified = ", nrow(df_transcript_te)))

# Plot TE distribution
ggplot(df_transcript_te, aes(TE_class)) + 
  geom_bar() + 
  coord_flip()  + 
  theme_bw(base_size=14) +
  labs(y = "TE Class", x = "Count")
```

::: {.callout-important appearance="minimal" icon="false"}
## Your Report

Your report should skip 2.1 and 2.2 and begin with prediction 2, and then prediction 3. Use a section header that is informative (i.e., don't use "Prediction 2" as the title).

However, for Quarto to run, you still need to include all the code chunks from above. Modify the line \
{r}

to be\
{r, echo=FALSE, warning=FALSE, message=FALSE, fig.show='hide', results='hide'}

so that it will run behind the scene but not display the results or figures.
:::

## Are there more TEs in differentially expressed transcripts?

If TEs are contributing to the evolution of the worker-queen differentiation, we expect to find more TEs in differentially expressed (DE) genes than in non-DE genes. Therefore, we performed a Chi-sqared test of independence on the binomial predictor variable DE (DE vs. NS, for differentially expressed vs. non-significant) against the response variable TE (TE vs. not, for having TE vs. not).

The results showed that the variables DE and TE are independent (Chi-squared = 2.160, d.f. = 1, p = 0.142). This means that the proportions of transcripts having TEs or not are not statistically different between transcripts that are and are not differentially expressed between queens and workers

```{r}
# Combine the df and df_transcript_te data sets
df_tranTE = left_join(df, df_transcript_te, "seqid")

# Create DE and TE columns
df_tranTE = mutate(df_tranTE, 
                   DE = case_when(padj < 0.05 ~ 'DE', 
                                  padj >= 0.05 ~ 'NS'),
                   TE = case_when(TE_class %in% c("DNA", 
                                                  "LTR", 
                                                  "LINE", 
                                                  "SINE") ~ 'TE',
                                  .default = 'Not')
                   )

# Chi-squared test
chisq.test(table(df_tranTE$TE, df_tranTE$DE))

# Plot 
ggplot(data = df_tranTE, aes(x = TE, fill = TE)) + 
  geom_bar() + 
  facet_grid(DE~., scales = "free_y") +
  labs(x = "Transposable Elements", y = "Count")
```

Further, we tested and showed that TE class (DNA, LTR, LINE, and SINE) did not affect the variable DE (Chi-squared = 1.204, d.f. = 3, p = 0.752).

```{r}
chisq.test(with(df_tranTE, table(TE_class, DE)))

ggplot(data = df_tranTE, aes(x = DE)) + 
  geom_bar() + 
  facet_wrap(~TE_class, scales = "free_y") +
  labs(x = "", y = "Count")
```

# Conclusion

In summary, we found that transcripts that have TEs are not more differentially expressed. Also, the class of TE does not affect whether a transcribe is differentially expressed. These results do not support the hypothesis that TEs are contributing to the evolution of the worker-queen differentiation.

Although the results did not support the conclusion, the dichotomous classification of transcripts into merely differentially expressed or not may conceal important pattern in the strength of differential expression. Further analysis should consider using a continuous variable based on the fold change of differential expression.

::: {.callout-important appearance="minimal" icon="false"}
## Your conclusion

Your conclusion should briefly summarize the results from predictions 2 and 3. There is no need to report statistics here. Most importantly, you should relate your results to the hypothesis.

Include a brief discussion of unexpected results or discrepancy between results.
:::

# Grading rubric

::: {.callout-note appearance="minimal" icon="false"}
Total = 150 points.

|                                                         | Excellent | Accetpable | Does not meet expectation |
|---------------------------------------------------------|:---------:|:----------:|:-------------------------:|
| Introduction: Expected pattern                          |     5     |     3      |             1             |
| Introduction: TE background                             |    10     |     5      |             3             |
| Prediction 1                                            |    40     |     20     |            10             |
| Prediction 2                                            |    40     |     20     |            10             |
| Interpretation of p values                              |    20     |     10     |             5             |
| Relating results to hypothesis                          |    20     |     10     |             5             |
| Discussion                                              |    10     |     5      |             3             |
| Clear annotation of codes, grammar and clarity of texts |     5     |     3      |             1             |
:::