Skip to content

use epicontacts::get_degree() to replace wrangling steps from epicontacts to fitdistrplus #169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 1, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 98 additions & 58 deletions episodes/superspreading-estimate.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -99,10 +99,13 @@ Let's practice this using the `mers_korea_2015` linelist and contact data from t
epi_contacts <-
epicontacts::make_epicontacts(
linelist = outbreaks::mers_korea_2015$linelist,
contacts = outbreaks::mers_korea_2015$contacts
contacts = outbreaks::mers_korea_2015$contacts,
directed = TRUE
)
```

With the argument `directed = TRUE` we configure a directed graph. These directions incorporate our hypothesis of the **infector-infectee** pair: from the probable source patient to the secondary case.

```{r,eval=FALSE}
# visualise contact network
epicontacts::vis_epicontacts(epi_contacts)
Expand Down Expand Up @@ -130,7 +133,7 @@ withr::with_envvar(c(OPENSSL_CONF = file.path("/dev", "null")), {

Contact data from a transmission chain can provide information on which infected individuals came into contact with others. We expect to have the infector (`from`) and the infectee (`to`) plus additional columns of variables related to their contact, such as location (`exposure`) and date of contact.

Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact dataset is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other.
Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact data frame is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other.

To ensure these unique pairs, we can check on replicates for infectees:

Expand All @@ -144,47 +147,95 @@ epi_contacts %>%

:::::::::::::::::::::::::::

When each infector-infectee row is unique, the number of entries per infector corresponds to the number of secondary cases generated by that individual.
Our goal is to get the number of secondary cases caused by the observed infected individuals. In the contact data frame, when each infector-infectee pair is unique, the number of rows per infector corresponds to the number of secondary cases generated by that individual.

```{r}
# count secondary cases per infector
infector_secondary <- epi_contacts %>%
# count secondary cases per infector in contacts
epi_contacts %>%
purrr::pluck("contacts") %>%
dplyr::count(from, name = "secondary_cases")
```

But this output only contains number of secondary cases for reported infectors, not for each of the individuals in the whole `epicontacts` object.
But this output only contains the number of secondary cases for reported infectors in the contact data, not for **all** the individuals in the whole `<epicontacts>` object.

To get this, first, we can use `epicontacts::get_id()` to get the full list of unique identifiers ("id") from the `epicontacts` class object. Second, join it with the count secondary cases per infector stored in the `infector_secondary` object. Third, replace the missing values with `0` to express no report of secondary cases from them.
Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` gets the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)).

Also, the argument `only_linelist = TRUE` will only include individuals in the linelist data frame. During outbreak investigations, we expect a registry of **all** the observed infected individuals in the linelist data. However, anyone not linked with a potential infector or infectee will not appear in the contact data. Thus, the argument `only_linelist = TRUE` will protect us against missing this later set of individuals when counting the number of secondary cases caused by all the observed infected individuals. They will appear in the `<integer>` vector output as `0` secondary cases.

```{r,message=FALSE,warning=FALSE}
all_secondary <- epi_contacts %>%
# extract ids in contact *and* linelist using "which" argument
epicontacts::get_id(which = "all") %>%
# transform vector to dataframe to use left_join()
tibble::enframe(name = NULL, value = "from") %>%
# join count secondary cases per infectee
dplyr::left_join(infector_secondary) %>%
# infectee with missing secondary cases are replaced with zero
tidyr::replace_na(
replace = list(secondary_cases = 0)
)
# Count secondary cases per subject in contacts and linelist
all_secondary_cases <- epicontacts::get_degree(
x = epi_contacts,
type = "out",
only_linelist = TRUE
)
```

From a histogram of the `all_secondary` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases.
::::::::::::::::::::: caution

At `epicontacts::get_degree()` we use the `only_linelist = TRUE` argument.
This is to count the number of secondary cases caused by all the observed infected individuals,
which includes individuals in contacts and linelist data frames.

:::::::::::::::::::::

::::::::::::::::::::: spoiler

### When to use 'only_linelist = FALSE'?

The assumption that
"the linelist will include all individuals in contacts and linelist"
may not work for all situations.

For example, if during the registry of observed infections,
the contact data included more subjects than the ones available in the linelist data,
then you need to consider only the individuals from the contact data.
In that situation,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.

Find here a printed [reproducible example](https://reprex.tidyverse.org/):

```r
# Three subjects on linelist
sample_linelist <- tibble::tibble(
id = c("id1", "id2", "id3")
)

# Four infector-infectee pairs with Five subjects in contact data
sample_contact <- tibble::tibble(
from = c("id1","id1","id2","id4"),
to = c("id2","id3","id4","id5")
)

# make an epicontacts object
sample_net <- epicontacts::make_epicontacts(
linelist = sample_linelist,
contacts = sample_contact,
directed = TRUE
)

# count secondary cases per subject from linelist only
epicontacts::get_degree(x = sample_net, type = "out", only_linelist = TRUE)
#> id1 id2 id3
#> 2 1 0

```{r,echo=FALSE,eval=FALSE}
# arrange in descendant order of secondary cases
all_secondary %>%
dplyr::arrange(dplyr::desc(secondary_cases))
# count secondary cases per subject from contact only
epicontacts::get_degree(x = sample_net, type = "out", only_linelist = FALSE)
#> id1 id2 id4 id3 id5
#> 2 1 1 0 0
```

:::::::::::::::::::::

From a histogram of the `all_secondary_cases` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases.

<!-- Visualizing the number of secondary cases on a histogram will help us to relate this with the statistical distribution to fit: -->

```{r}
## plot the distribution
all_secondary %>%
ggplot(aes(secondary_cases)) +
all_secondary_cases %>%
tibble::enframe() %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 1) +
labs(
x = "Number of secondary cases",
Expand Down Expand Up @@ -279,11 +330,11 @@ In epidemiology, [negative binomial](https://en.wikipedia.org/wiki/Negative_bino

Calculate the distribution of secondary cases for Ebola using the `ebola_sim_clean` object from `{outbreaks}` package.

Is the offspring distribution of Ebola skewed or overdispersed?
- Is the offspring distribution of Ebola skewed or overdispersed?

:::::::::::::::::: hint

**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session!
⚠️ **Optional step:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may take several minutes and use significant memory for large outbreaks such as the Ebola linelist. If you're on an older or slower computer, you can skip this step.

::::::::::::::::::

Expand All @@ -293,31 +344,22 @@ Is the offspring distribution of Ebola skewed or overdispersed?
## first, make an epicontacts object
ebola_contacts <-
epicontacts::make_epicontacts(
linelist = ebola_sim_clean$linelist,
contacts = ebola_sim_clean$contacts
linelist = outbreaks::ebola_sim_clean$linelist,
contacts = outbreaks::ebola_sim_clean$contacts,
directed = TRUE
)

# count secondary cases

ebola_infector_secondary <- ebola_contacts %>%
purrr::pluck("contacts") %>%
dplyr::count(from, name = "secondary_cases")

ebola_secondary <- ebola_contacts %>%
# extract ids in contact *and* linelist using "which" argument
epicontacts::get_id(which = "all") %>%
# transform vector to dataframe to use left_join()
tibble::enframe(name = NULL, value = "from") %>%
# join count secondary cases per infectee
dplyr::left_join(ebola_infector_secondary) %>%
# infectee with missing secondary cases are replaced with zero
tidyr::replace_na(
replace = list(secondary_cases = 0)
)
# count secondary cases per subject in contacts and linelist
ebola_secondary <- epicontacts::get_degree(
x = ebola_contacts,
type = "out",
only_linelist = TRUE
)

## plot the distribution
ebola_secondary %>%
ggplot(aes(secondary_cases)) +
tibble::enframe() %>%
ggplot(aes(value)) +
geom_histogram(binwidth = 1) +
labs(
x = "Number of secondary cases",
Expand All @@ -343,8 +385,7 @@ library(fitdistrplus)

```{r}
## fit distribution
offspring_fit <- all_secondary %>%
dplyr::pull(secondary_cases) %>%
offspring_fit <- all_secondary_cases %>%
fitdistrplus::fitdist(distr = "nbinom")

offspring_fit
Expand Down Expand Up @@ -392,10 +433,10 @@ fit_density <-
# plot offspring distribution with density fit
ggplot() +
geom_histogram(
data = all_secondary,
data = all_secondary_cases %>% tibble::enframe(),
mapping =
aes(
x = secondary_cases,
x = value,
y = after_stat(density)
), fill = "white", color = "black",
binwidth = 1
Expand Down Expand Up @@ -441,11 +482,11 @@ When $k$ approaches infinity ($k \rightarrow \infty$) the variance equals the me

::::::::::::::::::::::: challenge

Use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package.
From the previous challenge, use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package.

Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution.
Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution. Try to estimate the uncertainty of the dispersion parameter from the Standard Error to 95% Confidence Intervals.

Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission?
- Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission?

:::::::::::::: hint

Expand All @@ -457,7 +498,6 @@ Review how we fitted a negative binomial distribution using the `fitdistrplus::f

```{r}
ebola_offspring <- ebola_secondary %>%
dplyr::pull(secondary_cases) %>%
fitdistrplus::fitdist(distr = "nbinom")

ebola_offspring
Expand All @@ -467,11 +507,11 @@ ebola_offspring
## extract the "size" parameter
ebola_mid <- ebola_offspring$estimate[["size"]]

## calculate the 95% confidence intervals using the standard error estimate and
## calculate the 95% confidence intervals using the
## standard error estimate and
## the 0.025 and 0.975 quantiles of the normal distribution.

ebola_lower <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.025)

ebola_upper <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.975)

# ebola_mid
Expand All @@ -480,7 +520,7 @@ ebola_upper <- ebola_mid + ebola_offspring$sd[["size"]] * qnorm(0.975)
```

From the number secondary cases distribution we estimated a dispersion parameter $k$ of
`r round(ebola_mid, 3)`, with a 95% Confidence Interval from `r round(ebola_lower, 3)` to `r round(ebola_upper, 3)`.
`r round(ebola_mid, 2)`, with a 95% Confidence Interval from `r round(ebola_lower, 2)` to `r round(ebola_upper, 2)`.

For dispersion parameter estimates higher than one we get low distribution variance, hence, low individual-level variation in transmission.

Expand Down
Loading