diff --git a/episodes/superspreading-estimate.Rmd b/episodes/superspreading-estimate.Rmd index 2a5d2f74..0f59abee 100644 --- a/episodes/superspreading-estimate.Rmd +++ b/episodes/superspreading-estimate.Rmd @@ -99,10 +99,13 @@ Let's practice this using the `mers_korea_2015` linelist and contact data from t epi_contacts <- epicontacts::make_epicontacts( linelist = outbreaks::mers_korea_2015$linelist, - contacts = outbreaks::mers_korea_2015$contacts + contacts = outbreaks::mers_korea_2015$contacts, + directed = TRUE ) ``` +With the argument `directed = TRUE` we configure a directed graph. These directions incorporate our hypothesis of the **infector-infectee** pair: from the probable source patient to the secondary case. + ```{r,eval=FALSE} # visualise contact network epicontacts::vis_epicontacts(epi_contacts) @@ -130,7 +133,7 @@ withr::with_envvar(c(OPENSSL_CONF = file.path("/dev", "null")), { Contact data from a transmission chain can provide information on which infected individuals came into contact with others. We expect to have the infector (`from`) and the infectee (`to`) plus additional columns of variables related to their contact, such as location (`exposure`) and date of contact. -Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact dataset is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other. +Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact data frame is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other. To ensure these unique pairs, we can check on replicates for infectees: @@ -144,47 +147,95 @@ epi_contacts %>% ::::::::::::::::::::::::::: -When each infector-infectee row is unique, the number of entries per infector corresponds to the number of secondary cases generated by that individual. +Our goal is to get the number of secondary cases caused by the observed infected individuals. In the contact data frame, when each infector-infectee pair is unique, the number of rows per infector corresponds to the number of secondary cases generated by that individual. ```{r} -# count secondary cases per infector -infector_secondary <- epi_contacts %>% +# count secondary cases per infector in contacts +epi_contacts %>% purrr::pluck("contacts") %>% dplyr::count(from, name = "secondary_cases") ``` -But this output only contains number of secondary cases for reported infectors, not for each of the individuals in the whole `epicontacts` object. +But this output only contains the number of secondary cases for reported infectors in the contact data, not for **all** the individuals in the whole `` object. -To get this, first, we can use `epicontacts::get_id()` to get the full list of unique identifiers ("id") from the `epicontacts` class object. Second, join it with the count secondary cases per infector stored in the `infector_secondary` object. Third, replace the missing values with `0` to express no report of secondary cases from them. +Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` gets the **out-degree** of each **node** in the contact network from the `` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)). + +Also, the argument `only_linelist = TRUE` will only include individuals in the linelist data frame. During outbreak investigations, we expect a registry of **all** the observed infected individuals in the linelist data. However, anyone not linked with a potential infector or infectee will not appear in the contact data. Thus, the argument `only_linelist = TRUE` will protect us against missing this later set of individuals when counting the number of secondary cases caused by all the observed infected individuals. They will appear in the `` vector output as `0` secondary cases. ```{r,message=FALSE,warning=FALSE} -all_secondary <- epi_contacts %>% - # extract ids in contact *and* linelist using "which" argument - epicontacts::get_id(which = "all") %>% - # transform vector to dataframe to use left_join() - tibble::enframe(name = NULL, value = "from") %>% - # join count secondary cases per infectee - dplyr::left_join(infector_secondary) %>% - # infectee with missing secondary cases are replaced with zero - tidyr::replace_na( - replace = list(secondary_cases = 0) - ) +# Count secondary cases per subject in contacts and linelist +all_secondary_cases <- epicontacts::get_degree( + x = epi_contacts, + type = "out", + only_linelist = TRUE +) ``` -From a histogram of the `all_secondary` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases. +::::::::::::::::::::: caution + +At `epicontacts::get_degree()` we use the `only_linelist = TRUE` argument. +This is to count the number of secondary cases caused by all the observed infected individuals, +which includes individuals in contacts and linelist data frames. + +::::::::::::::::::::: + +::::::::::::::::::::: spoiler + +### When to use 'only_linelist = FALSE'? + +The assumption that +"the linelist will include all individuals in contacts and linelist" +may not work for all situations. + +For example, if during the registry of observed infections, +the contact data included more subjects than the ones available in the linelist data, +then you need to consider only the individuals from the contact data. +In that situation, +at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument. + +Find here a printed [reproducible example](https://reprex.tidyverse.org/): + +```r +# Three subjects on linelist +sample_linelist <- tibble::tibble( + id = c("id1", "id2", "id3") +) + +# Four infector-infectee pairs with Five subjects in contact data +sample_contact <- tibble::tibble( + from = c("id1","id1","id2","id4"), + to = c("id2","id3","id4","id5") +) + +# make an epicontacts object +sample_net <- epicontacts::make_epicontacts( + linelist = sample_linelist, + contacts = sample_contact, + directed = TRUE +) + +# count secondary cases per subject from linelist only +epicontacts::get_degree(x = sample_net, type = "out", only_linelist = TRUE) +#> id1 id2 id3 +#> 2 1 0 -```{r,echo=FALSE,eval=FALSE} -# arrange in descendant order of secondary cases -all_secondary %>% - dplyr::arrange(dplyr::desc(secondary_cases)) +# count secondary cases per subject from contact only +epicontacts::get_degree(x = sample_net, type = "out", only_linelist = FALSE) +#> id1 id2 id4 id3 id5 +#> 2 1 1 0 0 ``` +::::::::::::::::::::: + +From a histogram of the `all_secondary_cases` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases. + ```{r} ## plot the distribution -all_secondary %>% - ggplot(aes(secondary_cases)) + +all_secondary_cases %>% + tibble::enframe() %>% + ggplot(aes(value)) + geom_histogram(binwidth = 1) + labs( x = "Number of secondary cases", @@ -279,11 +330,11 @@ In epidemiology, [negative binomial](https://en.wikipedia.org/wiki/Negative_bino Calculate the distribution of secondary cases for Ebola using the `ebola_sim_clean` object from `{outbreaks}` package. -Is the offspring distribution of Ebola skewed or overdispersed? +- Is the offspring distribution of Ebola skewed or overdispersed? :::::::::::::::::: hint -**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session! +⚠️ **Optional step:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may take several minutes and use significant memory for large outbreaks such as the Ebola linelist. If you're on an older or slower computer, you can skip this step. :::::::::::::::::: @@ -293,31 +344,22 @@ Is the offspring distribution of Ebola skewed or overdispersed? ## first, make an epicontacts object ebola_contacts <- epicontacts::make_epicontacts( - linelist = ebola_sim_clean$linelist, - contacts = ebola_sim_clean$contacts + linelist = outbreaks::ebola_sim_clean$linelist, + contacts = outbreaks::ebola_sim_clean$contacts, + directed = TRUE ) -# count secondary cases - -ebola_infector_secondary <- ebola_contacts %>% - purrr::pluck("contacts") %>% - dplyr::count(from, name = "secondary_cases") - -ebola_secondary <- ebola_contacts %>% - # extract ids in contact *and* linelist using "which" argument - epicontacts::get_id(which = "all") %>% - # transform vector to dataframe to use left_join() - tibble::enframe(name = NULL, value = "from") %>% - # join count secondary cases per infectee - dplyr::left_join(ebola_infector_secondary) %>% - # infectee with missing secondary cases are replaced with zero - tidyr::replace_na( - replace = list(secondary_cases = 0) - ) +# count secondary cases per subject in contacts and linelist +ebola_secondary <- epicontacts::get_degree( + x = ebola_contacts, + type = "out", + only_linelist = TRUE +) ## plot the distribution ebola_secondary %>% - ggplot(aes(secondary_cases)) + + tibble::enframe() %>% + ggplot(aes(value)) + geom_histogram(binwidth = 1) + labs( x = "Number of secondary cases", @@ -343,8 +385,7 @@ library(fitdistrplus) ```{r} ## fit distribution -offspring_fit <- all_secondary %>% - dplyr::pull(secondary_cases) %>% +offspring_fit <- all_secondary_cases %>% fitdistrplus::fitdist(distr = "nbinom") offspring_fit @@ -392,10 +433,10 @@ fit_density <- # plot offspring distribution with density fit ggplot() + geom_histogram( - data = all_secondary, + data = all_secondary_cases %>% tibble::enframe(), mapping = aes( - x = secondary_cases, + x = value, y = after_stat(density) ), fill = "white", color = "black", binwidth = 1 @@ -441,11 +482,11 @@ When $k$ approaches infinity ($k \rightarrow \infty$) the variance equals the me ::::::::::::::::::::::: challenge -Use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package. +From the previous challenge, use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package. -Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution. +Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution. Try to estimate the uncertainty of the dispersion parameter from the Standard Error to 95% Confidence Intervals. -Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission? +- Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission? :::::::::::::: hint @@ -457,7 +498,6 @@ Review how we fitted a negative binomial distribution using the `fitdistrplus::f ```{r} ebola_offspring <- ebola_secondary %>% - dplyr::pull(secondary_cases) %>% fitdistrplus::fitdist(distr = "nbinom") ebola_offspring @@ -467,11 +507,11 @@ ebola_offspring ## extract the "size" parameter ebola_mid <- ebola_offspring$estimate[["size"]] -## calculate the 95% confidence intervals using the standard error estimate and +## calculate the 95% confidence intervals using the +## standard error estimate and ## the 0.025 and 0.975 quantiles of the normal distribution. ebola_lower <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.025) - ebola_upper <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.975) # ebola_mid @@ -480,7 +520,7 @@ ebola_upper <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.975) ``` From the number secondary cases distribution we estimated a dispersion parameter $k$ of -`r round(ebola_mid, 3)`, with a 95% Confidence Interval from `r round(ebola_lower, 3)` to `r round(ebola_upper, 3)`. +`r round(ebola_mid, 2)`, with a 95% Confidence Interval from `r round(ebola_lower, 2)` to `r round(ebola_upper, 2)`. For dispersion parameter estimates higher than one we get low distribution variance, hence, low individual-level variation in transmission.