epiverse-trace · avallecam · May 1, 2025 · Apr 3, 2025 · Apr 3, 2025 · Apr 3, 2025
diff --git a/episodes/superspreading-estimate.Rmd b/episodes/superspreading-estimate.Rmd
@@ -99,10 +99,13 @@ Let's practice this using the `mers_korea_2015` linelist and contact data from t
 epi_contacts <-
   epicontacts::make_epicontacts(
     linelist = outbreaks::mers_korea_2015$linelist,
-    contacts = outbreaks::mers_korea_2015$contacts
+    contacts = outbreaks::mers_korea_2015$contacts,
+    directed = TRUE
   )
 ```
 
+With the argument `directed = TRUE` we configure a directed graph. These directions incorporate our hypothesis of the **infector-infectee** pair: from the probable source patient to the secondary case.
+
 ```{r,eval=FALSE}
 # visualise contact network
 epicontacts::vis_epicontacts(epi_contacts)
@@ -130,7 +133,7 @@ withr::with_envvar(c(OPENSSL_CONF = file.path("/dev", "null")), {
 
 Contact data from a transmission chain can provide information on which infected individuals came into contact with others. We expect to have the infector (`from`) and the infectee (`to`) plus additional columns of variables related to their contact, such as location (`exposure`) and date of contact.
 
-Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact dataset is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other.
+Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact data frame is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other.
 
 To ensure these unique pairs, we can check on replicates for infectees:
 
@@ -144,47 +147,95 @@ epi_contacts %>%
 
 :::::::::::::::::::::::::::
 
-When each infector-infectee row is unique, the number of entries per infector corresponds to the number of secondary cases generated by that individual.
+Our goal is to get the number of secondary cases caused by the observed infected individuals. In the contact data frame, when each infector-infectee pair is unique, the number of rows per infector corresponds to the number of secondary cases generated by that individual.
 
 ```{r}
-# count secondary cases per infector
-infector_secondary <- epi_contacts %>%
+# count secondary cases per infector in contacts
+epi_contacts %>%
   purrr::pluck("contacts") %>%
   dplyr::count(from, name = "secondary_cases")
 ```
 
-But this output only contains number of secondary cases for reported infectors, not for each of the individuals in the whole `epicontacts` object.
+But this output only contains the number of secondary cases for reported infectors in the contact data, not for **all** the individuals in the whole `<epicontacts>` object.
 
-To get this, first, we can use `epicontacts::get_id()` to get the full list of unique identifiers ("id") from the `epicontacts` class object. Second, join it with the count secondary cases per infector stored in the `infector_secondary` object. Third, replace the missing values with `0` to express no report of secondary cases from them.
+Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` gets the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)). 
+
+Also, the argument `only_linelist = TRUE` will only include individuals in the linelist data frame. During outbreak investigations, we expect a registry of **all** the observed infected individuals in the linelist data. However, anyone not linked with a potential infector or infectee will not appear in the contact data. Thus, the argument `only_linelist = TRUE` will protect us against missing this later set of individuals when counting the number of secondary cases caused by all the observed infected individuals. They will appear in the `<integer>` vector output as `0` secondary cases. 
 
 ```{r,message=FALSE,warning=FALSE}
-all_secondary <- epi_contacts %>%
-  # extract ids in contact *and* linelist using "which" argument
-  epicontacts::get_id(which = "all") %>%
-  # transform vector to dataframe to use left_join()
-  tibble::enframe(name = NULL, value = "from") %>%
-  # join count secondary cases per infectee
-  dplyr::left_join(infector_secondary) %>%
-  # infectee with missing secondary cases are replaced with zero
-  tidyr::replace_na(
-    replace = list(secondary_cases = 0)
-  )
+# Count secondary cases per subject in contacts and linelist
+all_secondary_cases <- epicontacts::get_degree(
+  x = epi_contacts,
+  type = "out",
+  only_linelist = TRUE
+)
 ```
 
-From a histogram of the `all_secondary` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases.
+::::::::::::::::::::: caution
+
+At `epicontacts::get_degree()` we use the `only_linelist = TRUE` argument.
+This is to count the number of secondary cases caused by all the observed infected individuals,
+which includes individuals in contacts and linelist data frames.
+
+:::::::::::::::::::::
+
+::::::::::::::::::::: spoiler
+
+### When to use 'only_linelist = FALSE'?
+
+The assumption that 
+"the linelist will include all individuals in contacts and linelist"
+may not work for all situations.
+
+For example, if during the registry of observed infections, 
+the contact data included more subjects than the ones available in the linelist data,
+then you need to consider only the individuals from the contact data.
+In that situation,
+at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
+
+Find here a printed [reproducible example](https://reprex.tidyverse.org/):
+
+```r
+# Three subjects on linelist
+sample_linelist <- tibble::tibble(
+  id = c("id1", "id2", "id3")
+)
+
+# Four infector-infectee pairs with Five subjects in contact data
+sample_contact <- tibble::tibble(
+  from = c("id1","id1","id2","id4"),
+  to = c("id2","id3","id4","id5")
+)
+
+# make an epicontacts object
+sample_net <- epicontacts::make_epicontacts(
+  linelist = sample_linelist,
+  contacts = sample_contact,
+  directed = TRUE
+)
+
+# count secondary cases per subject from linelist only
+epicontacts::get_degree(x = sample_net, type = "out", only_linelist = TRUE)
+#> id1 id2 id3 
+#>   2   1   0
 
-```{r,echo=FALSE,eval=FALSE}
-# arrange in descendant order of secondary cases
-all_secondary %>%
-  dplyr::arrange(dplyr::desc(secondary_cases))
+# count secondary cases per subject from contact only
+epicontacts::get_degree(x = sample_net, type = "out", only_linelist = FALSE)
+#> id1 id2 id4 id3 id5 
+#>   2   1   1   0   0
 ```
 
+:::::::::::::::::::::
+
+From a histogram of the `all_secondary_cases` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases.
+
 <!-- Visualizing the number of secondary cases on a histogram will help us to relate this with the statistical distribution to fit: -->
 
 ```{r}
 ## plot the distribution
-all_secondary %>%
-  ggplot(aes(secondary_cases)) +
+all_secondary_cases %>%
+  tibble::enframe() %>%
+  ggplot(aes(value)) +
   geom_histogram(binwidth = 1) +
   labs(
     x = "Number of secondary cases",
@@ -279,11 +330,11 @@ In epidemiology, [negative binomial](https://en.wikipedia.org/wiki/Negative_bino
 
 Calculate the distribution of secondary cases for Ebola using the `ebola_sim_clean` object from `{outbreaks}` package.
 
-Is the offspring distribution of Ebola skewed or overdispersed?
+- Is the offspring distribution of Ebola skewed or overdispersed?
 
 :::::::::::::::::: hint
 
-**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session!
+⚠️ **Optional step:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may take several minutes and use significant memory for large outbreaks such as the Ebola linelist. If you're on an older or slower computer, you can skip this step.
 
 ::::::::::::::::::
 
@@ -293,31 +344,22 @@ Is the offspring distribution of Ebola skewed or overdispersed?
 ## first, make an epicontacts object
 ebola_contacts <-
   epicontacts::make_epicontacts(
-    linelist = ebola_sim_clean$linelist,
-    contacts = ebola_sim_clean$contacts
+    linelist = outbreaks::ebola_sim_clean$linelist,
+    contacts = outbreaks::ebola_sim_clean$contacts,
+    directed = TRUE
   )
 
-# count secondary cases
-
-ebola_infector_secondary <- ebola_contacts %>%
-  purrr::pluck("contacts") %>%
-  dplyr::count(from, name = "secondary_cases")
-
-ebola_secondary <- ebola_contacts %>%
-  # extract ids in contact *and* linelist using "which" argument
-  epicontacts::get_id(which = "all") %>%
-  # transform vector to dataframe to use left_join()
-  tibble::enframe(name = NULL, value = "from") %>%
-  # join count secondary cases per infectee
-  dplyr::left_join(ebola_infector_secondary) %>%
-  # infectee with missing secondary cases are replaced with zero
-  tidyr::replace_na(
-    replace = list(secondary_cases = 0)
-  )
+# count secondary cases per subject in contacts and linelist
+ebola_secondary <- epicontacts::get_degree(
+  x = ebola_contacts,
+  type = "out",
+  only_linelist = TRUE
+)
 
 ## plot the distribution
 ebola_secondary %>%
-  ggplot(aes(secondary_cases)) +
+  tibble::enframe() %>%
+  ggplot(aes(value)) +
   geom_histogram(binwidth = 1) +
   labs(
     x = "Number of secondary cases",
@@ -343,8 +385,7 @@ library(fitdistrplus)
 
 ```{r}
 ## fit distribution
-offspring_fit <- all_secondary %>%
-  dplyr::pull(secondary_cases) %>%
+offspring_fit <- all_secondary_cases %>%
   fitdistrplus::fitdist(distr = "nbinom")
 
 offspring_fit
@@ -392,10 +433,10 @@ fit_density <-
 # plot offspring distribution with density fit
 ggplot() +
   geom_histogram(
-    data = all_secondary,
+    data = all_secondary_cases %>% tibble::enframe(),
     mapping =
       aes(
-        x = secondary_cases,
+        x = value,
         y = after_stat(density)
       ), fill = "white", color = "black",
     binwidth = 1
@@ -441,11 +482,11 @@ When $k$ approaches infinity ($k \rightarrow \infty$) the variance equals the me
 
 ::::::::::::::::::::::: challenge
 
-Use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package.
+From the previous challenge, use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package.
 
-Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution.
+Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution. Try to estimate the uncertainty of the dispersion parameter from the Standard Error to 95% Confidence Intervals.
 
-Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission?
+- Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission?
 
 :::::::::::::: hint
 
@@ -457,7 +498,6 @@ Review how we fitted a negative binomial distribution using the `fitdistrplus::f
 
 ```{r}
 ebola_offspring <- ebola_secondary %>%
-  dplyr::pull(secondary_cases) %>%
   fitdistrplus::fitdist(distr = "nbinom")
 
 ebola_offspring
@@ -467,11 +507,11 @@ ebola_offspring
 ## extract the "size" parameter
 ebola_mid <- ebola_offspring$estimate[["size"]]
 
-## calculate the 95% confidence intervals using the standard error estimate and
+## calculate the 95% confidence intervals using the
+## standard error estimate and
 ## the 0.025 and 0.975 quantiles of the normal distribution.
 
 ebola_lower <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.025)
-
 ebola_upper <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.975)
 
 # ebola_mid
@@ -480,7 +520,7 @@ ebola_upper <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.975)
 ```
 
 From the number secondary cases distribution we estimated a dispersion parameter $k$ of
-`r round(ebola_mid, 3)`, with a 95% Confidence Interval from `r round(ebola_lower, 3)` to `r round(ebola_upper, 3)`.
+`r round(ebola_mid, 2)`, with a 95% Confidence Interval from `r round(ebola_lower, 2)` to `r round(ebola_upper, 2)`.
 
 For dispersion parameter estimates higher than one we get low distribution variance, hence, low individual-level variation in transmission.