You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: superspreading-estimate.md
+72-59
Original file line number
Diff line number
Diff line change
@@ -92,10 +92,13 @@ Let's practice this using the `mers_korea_2015` linelist and contact data from t
92
92
epi_contacts<-
93
93
epicontacts::make_epicontacts(
94
94
linelist=outbreaks::mers_korea_2015$linelist,
95
-
contacts=outbreaks::mers_korea_2015$contacts
95
+
contacts=outbreaks::mers_korea_2015$contacts,
96
+
directed=TRUE
96
97
)
97
98
```
98
99
100
+
With the argument `directed = TRUE` we configure a directed graph. These directions incorporate our hypothesis of the **infector-infectee** pair: from the probable source patient to the secondary case.
Contact data from a transmission chain can provide information on which infected individuals came into contact with others. We expect to have the infector (`from`) and the infectee (`to`) plus additional columns of variables related to their contact, such as location (`exposure`) and date of contact.
112
115
113
-
Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact dataset is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other.
116
+
Following [tidy data](https://tidyr.tidyverse.org/articles/tidy-data.html#tidy-data) principles, the observation unit in our contact data frame is the **infector-infectee** pair. Although one infector can infect multiple infectees, from contact tracing investigations we may record contacts linked to more than one infector (e.g. within a household). But we should expect to have unique infector-infectee pairs, because typically each infected person will have acquired the infection from one other.
114
117
115
118
To ensure these unique pairs, we can check on replicates for infectees:
116
119
@@ -137,62 +140,83 @@ epi_contacts %>%
137
140
138
141
:::::::::::::::::::::::::::
139
142
140
-
When each infector-infectee row is unique, the number of entries per infector corresponds to the number of secondary cases generated by that individual.
143
+
Our goal is to get the number of secondary cases caused by the observed infected individuals. At the contact data frame, when each infector-infectee pair is unique, the number of rows per infector corresponds to the number of secondary cases generated by that individual.
141
144
142
145
143
146
```r
144
-
# count secondary cases per infector
145
-
infector_secondary<-epi_contacts %>%
147
+
# count secondary cases per infector in contacts
148
+
epi_contacts %>%
146
149
purrr::pluck("contacts") %>%
147
150
dplyr::count(from, name="secondary_cases")
148
151
```
149
152
150
-
But this output only contains number of secondary cases for reported infectors, not for each of the individuals in the whole `epicontacts` object.
153
+
```output
154
+
from secondary_cases
155
+
1 SK_1 26
156
+
2 SK_11 1
157
+
3 SK_118 1
158
+
4 SK_12 1
159
+
5 SK_123 1
160
+
6 SK_14 38
161
+
7 SK_15 4
162
+
8 SK_16 21
163
+
9 SK_6 2
164
+
10 SK_76 2
165
+
11 SK_87 1
166
+
```
167
+
168
+
But this output only contains the number of secondary cases for reported infectors in the contact data, not for each of the individuals in the whole `<epicontacts>` object.
151
169
152
-
To get this, first, we can use `epicontacts::get_id()` to get the full list of unique identifiers ("id") from the `epicontacts` class object. Second, join it with the count secondary cases per infector stored in the `infector_secondary` object. Third, replace the missing values with `0` to express no report of secondary cases from them.
170
+
Instead, we can use `epicontacts::get_degree()` to get the **out-degree**of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)).
153
171
154
172
155
173
```r
156
-
all_secondary<-epi_contacts %>%
157
-
# extract ids in contact *and* linelist using "which" argument
158
-
epicontacts::get_id(which="all") %>%
159
-
# transform vector to dataframe to use left_join()
160
-
tibble::enframe(name=NULL, value="from") %>%
161
-
# join count secondary cases per infectee
162
-
dplyr::left_join(infector_secondary) %>%
163
-
# infectee with missing secondary cases are replaced with zero
164
-
tidyr::replace_na(
165
-
replace=list(secondary_cases=0)
166
-
)
174
+
# Count secondary cases per subject in contacts and linelist
175
+
all_secondary<-epicontacts::get_degree(
176
+
x=epi_contacts,
177
+
type="out",
178
+
only_linelist=TRUE
179
+
)
167
180
```
168
181
169
-
From a histogram of the `all_secondary` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases.
182
+
::::::::::::::::::::: caution
183
+
184
+
At `epicontacts::get_degree()` we use the `only_linelist = TRUE` argument.
185
+
This is to count the number of secondary cases caused by the observed infected individuals,
186
+
which includes subjects in contacts and linelist data frames.
170
187
188
+
This assumption may not work for your all situations.
189
+
If you need to consider only the subjects from the contact data,
190
+
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
171
191
192
+
:::::::::::::::::::::
193
+
194
+
From a histogram of the `all_secondary` object, we can identify the **individual-level variation** in the number of secondary cases. Three cases were related to more than 20 secondary cases, while the complementary cases with less than five or zero secondary cases.
172
195
173
196
<!-- Visualizing the number of secondary cases on a histogram will help us to relate this with the statistical distribution to fit: -->
The number of secondary cases can be used to _empirically_ estimate the **offspring distribution**, which is the number of secondary _infections_ caused by each case. One candidate statistical distribution used to model the offspring distribution is the **negative binomial** distribution with two parameters:
190
214
191
215
-**Mean**, which represents the $R_{0}$, the average number of (secondary) cases produced by a single individual in an entirely susceptible population, and
192
216
193
217
-**Dispersion**, expressed as $k$, which represents the individual-level variation in transmission by single individuals.
From the histogram and density plot, we can identify that the offspring distribution is highly skewed or **overdispersed**. In this framework, the superspreading events (SSEs) are not arbitrary or exceptional, but simply realizations from the right-hand tail of the offspring distribution, which we can quantify and analyse ([Lloyd-Smith et al., 2005](https://www.nature.com/articles/nature04153)).
198
222
@@ -227,11 +251,11 @@ In epidemiology, [negative binomial](https://en.wikipedia.org/wiki/Negative_bino
227
251
228
252
Calculate the distribution of secondary cases for Ebola using the `ebola_sim_clean` object from `{outbreaks}` package.
229
253
230
-
Is the offspring distribution of Ebola skewed or overdispersed?
254
+
-Is the offspring distribution of Ebola skewed or overdispersed?
231
255
232
256
:::::::::::::::::: hint
233
257
234
-
**Note:** This dataset has 5829 cases. Running `epicontacts::vis_epicontacts()` may overload your session!
258
+
**Note:** This dataset has 5829 cases. Running `epicontacts::vis_epicontacts()` may overload your session! Try to avoid this step.
235
259
236
260
::::::::::::::::::
237
261
@@ -243,38 +267,29 @@ Is the offspring distribution of Ebola skewed or overdispersed?
243
267
ebola_contacts<-
244
268
epicontacts::make_epicontacts(
245
269
linelist=ebola_sim_clean$linelist,
246
-
contacts=ebola_sim_clean$contacts
270
+
contacts=ebola_sim_clean$contacts,
271
+
directed=TRUE
247
272
)
248
273
249
-
# count secondary cases
250
-
251
-
ebola_infector_secondary<-ebola_contacts %>%
252
-
purrr::pluck("contacts") %>%
253
-
dplyr::count(from, name="secondary_cases")
254
-
255
-
ebola_secondary<-ebola_contacts %>%
256
-
# extract ids in contact *and* linelist using "which" argument
257
-
epicontacts::get_id(which="all") %>%
258
-
# transform vector to dataframe to use left_join()
259
-
tibble::enframe(name=NULL, value="from") %>%
260
-
# join count secondary cases per infectee
261
-
dplyr::left_join(ebola_infector_secondary) %>%
262
-
# infectee with missing secondary cases are replaced with zero
263
-
tidyr::replace_na(
264
-
replace=list(secondary_cases=0)
265
-
)
274
+
# count secondary cases per subject in contacts and linelist
From a visual inspection, the distribution of secondary cases for the Ebola data set in `ebola_sim_clean` shows an skewed distribution with secondary cases equal or lower than 6. We need to complement this observation with a statistical analysis to evaluate for overdispersion.
280
295
@@ -296,7 +311,6 @@ library(fitdistrplus)
296
311
```r
297
312
## fit distribution
298
313
offspring_fit<-all_secondary %>%
299
-
dplyr::pull(secondary_cases) %>%
300
314
fitdistrplus::fitdist(distr="nbinom")
301
315
302
316
offspring_fit
@@ -328,7 +342,7 @@ From the number secondary cases distribution we estimated a dispersion parameter
328
342
329
343
We can overlap the estimated density values of the fitted negative binomial distribution and the histogram of the number of secondary cases:
@@ -346,11 +360,11 @@ When $k$ approaches infinity ($k \rightarrow \infty$) the variance equals the me
346
360
347
361
::::::::::::::::::::::: challenge
348
362
349
-
Use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package.
363
+
From the previous challenge, use the distribution of secondary cases from the `ebola_sim_clean` object from `{outbreaks}` package.
350
364
351
-
Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution.
365
+
Fit a negative binomial distribution to estimate the mean and dispersion parameter of the offspring distribution. Try to estimate the uncertainty of the dispersion parameter from the Standard Error to 95% Confidence Intervals.
352
366
353
-
Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission?
367
+
-Does the estimated dispersion parameter of Ebola provide evidence of an individual-level variation in transmission?
354
368
355
369
:::::::::::::: hint
356
370
@@ -363,7 +377,6 @@ Review how we fitted a negative binomial distribution using the `fitdistrplus::f
363
377
364
378
```r
365
379
ebola_offspring<-ebola_secondary %>%
366
-
dplyr::pull(secondary_cases) %>%
367
380
fitdistrplus::fitdist(distr="nbinom")
368
381
369
382
ebola_offspring
@@ -372,21 +385,21 @@ ebola_offspring
372
385
```output
373
386
Fitting of the distribution ' nbinom ' by maximum likelihood
374
387
Parameters:
375
-
estimate Std. Error
376
-
size 2.353899 0.250124609
377
-
mu 0.539300 0.009699219
388
+
estimate Std. Error
389
+
size 0.8539443 0.072505326
390
+
mu 0.3675993 0.009497097
378
391
```
379
392
380
393
381
394
```r
382
395
## extract the "size" parameter
383
396
ebola_mid<-ebola_offspring$estimate[["size"]]
384
397
385
-
## calculate the 95% confidence intervals using the standard error estimate and
398
+
## calculate the 95% confidence intervals using the
399
+
## standard error estimate and
386
400
## the 0.025 and 0.975 quantiles of the normal distribution.
The probability of having clusters of five people is 1.8%. At this stage, given this offspring distribution parameters, a backward strategy may not increase the probability of contain and quarantine more onward cases.
@@ -559,7 +572,7 @@ stats::qpois(
559
572
```
560
573
561
574
```output
562
-
[1] 3
575
+
[1] 2
563
576
```
564
577
565
578
Compare this values with the ones reported by [Lloyd-Smith et al., 2005](https://www.nature.com/articles/nature04153). See figure below:
0 commit comments