Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use epicontacts::get_degree() to replace wrangling steps from epicontacts to fitdistrplus #169

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

avallecam
Copy link
Member

@avallecam avallecam commented Apr 3, 2025

This aims to simply the epicontacts to fitdistrplus connection, as discussed in epiverse-trace/superspreading#121 (comment)

Also, fix #170 to use only_linelist = TRUE to count the secondary cases from observed infections without onward transmission (infectees).

This also edits some text to define graph concepts and facilitate readability in unrelated sections.

- simply the epicontacts to fitdistrplus connection
- use only_linelist = TRUE  for cases without infectees
- edit some text to facilitate readability
Copy link

github-actions bot commented Apr 3, 2025

Thank you!

Thank you for your pull request 😃

🤖 This automated message can help you check the rendered files in your submission for clarity. If you have any questions, please feel free to open an issue in {sandpaper}.

If you have files that automatically render output (e.g. R Markdown), then you should check for the following:

  • 🎯 correct output
  • 🖼️ correct figures
  • ❓ new warnings
  • ‼️ new errors

Rendered Changes

🔍 Inspect the changes: https://github.com/epiverse-trace/tutorials-middle/compare/md-outputs..md-outputs-PR-169

The following changes were observed in the rendered markdown documents:

 ...eading-estimate-rendered-unnamed-chunk-10-1.png | Bin 17026 -> 9096 bytes
 ...-estimate-rendered-unnamed-chunk-13-1.png (new) | Bin 0 -> 19283 bytes
 ...reading-estimate-rendered-unnamed-chunk-4-1.png | Bin 6139379 -> 6140517 bytes
 ...g-estimate-rendered-unnamed-chunk-8-1.png (new) | Bin 0 -> 9426 bytes
 ...reading-estimate-rendered-unnamed-chunk-9-1.png | Bin 9426 -> 17026 bytes
 md5sum.txt                                         |   2 +-
 network.html                                       |   6 +-
 superspreading-estimate.md                         | 135 +++++++++++----------
 webshot.png                                        | Bin 6139379 -> 6140517 bytes
 9 files changed, 78 insertions(+), 65 deletions(-)
What does this mean?

If you have source files that require output and figures to be generated (e.g. R Markdown), then it is important to make sure the generated figures and output are reproducible.

This output provides a way for you to inspect the output in a diff-friendly manner so that it's easy to see the changes that occur due to new software versions or randomisation.

⏱️ Updated at 2025-04-08 18:28:15 +0000

github-actions bot pushed a commit that referenced this pull request Apr 3, 2025
github-actions bot pushed a commit that referenced this pull request Apr 3, 2025
github-actions bot pushed a commit that referenced this pull request Apr 3, 2025
@avallecam avallecam mentioned this pull request Apr 3, 2025
15 tasks
github-actions bot pushed a commit that referenced this pull request Apr 3, 2025
Copy link
Member Author

@avallecam avallecam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor edits

@avallecam avallecam requested a review from joshwlambert April 3, 2025 19:43
github-actions bot pushed a commit that referenced this pull request Apr 3, 2025
@avallecam avallecam marked this pull request as ready for review April 3, 2025 20:40

To get this, first, we can use `epicontacts::get_id()` to get the full list of unique identifiers ("id") from the `epicontacts` class object. Second, join it with the count secondary cases per infector stored in the `infector_secondary` object. Third, replace the missing values with `0` to express no report of secondary cases from them.
Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` get the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)). Also, the argument `only_linelist = TRUE` include individuals in contacts and linelist data frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the argument only_linelist = TRUE include individuals in contacts and linelist data frames.

I think you could be a bit clearer on exactly what this means. Is it only including individuals that are in both the contacts and line list data frames, or only the line list irrespective of the contacts data? And why would these two datasets contain different individuals (i.e. is it more likely that the line list or the contacts data is missing individuals)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Suggesting one more paragraph here:

Suggested change
Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` get the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)). Also, the argument `only_linelist = TRUE` include individuals in contacts and linelist data frames.
Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` gets the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)).
Also, the argument `only_linelist = TRUE` will only include individuals in the linelist data frame. During outbreak investigations, we expect a registry of **all** the observed infected individuals in the linelist data. However, anyone not linked with a potential infector or infectee will not appear in the contact data. Thus, the argument `only_linelist = TRUE` will protect us against missing this later set of individuals when counting the number of secondary cases caused by all the observed infected individuals. They will appear in the `<integer>` vector output as `0` secondary cases.

Comment on lines +178 to +180
This assumption may not work for all situations.
If you need to consider only the individuals from the contact data,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is not 100% clear to me. If possible could you expand a bit more on what this means and in what situation the reader might want to use only_linelist = FALSE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one sentence with an example:

Suggested change
This assumption may not work for all situations.
If you need to consider only the individuals from the contact data,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
This assumption may not work for all situations.
For example, if during the registry of observed infections,
the contact data included more subjects than the ones available in the linelist data,
then you need to consider only the individuals from the contact data.
In that situation,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshwlambert would you agree to add this reprex to make the situation more visible? I can add it as a spoiler callout block to make it expandable on demand.

# Three subjects on linelist
sample_linelist <- tibble::tibble(
  id = c("id1", "id2", "id3")
)

# Four infector-infectee pairs with Five subjects in contact data
sample_contact <- tibble::tibble(
  from = c("id1","id1","id2","id4"),
  to = c("id2","id3","id4","id5")
)

# make an epicontacts object
sample_net <- epicontacts::make_epicontacts(
  linelist = sample_linelist,
  contacts = sample_contact,
  directed = TRUE
)

# count secondary cases per subject from linelist only
epicontacts::get_degree(x = sample_net, type = "out", only_linelist = TRUE)
#> id1 id2 id3 
#>   2   1   0

# count secondary cases per subject from contact only
epicontacts::get_degree(x = sample_net, type = "out", only_linelist = FALSE)
#> id1 id2 id4 id3 id5 
#>   2   1   1   0   0

Created on 2025-04-08 with reprex v2.1.1

Copy link
Member Author

@avallecam avallecam Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we suggest at {epicontacts} to have a Venn diagram-like summary table with the output of id's on linelist/common/contacts?

To test as drafted below, but with all the crossings? As a quality control step:

sample_linelist <- tibble::tibble(
  id = c("id1", "id2", "id3")
)

sample_contact <- tibble::tibble(
  from = c("id1","id1","id2","id4"),
  to = c("id2","id3","id4","id5")
)

sample_net <- epicontacts::make_epicontacts(
  linelist = sample_linelist,
  contacts = sample_contact,
  directed = TRUE
)

epi_contacts <- epicontacts::make_epicontacts(
  linelist = outbreaks::mers_korea_2015$linelist,
  contacts = outbreaks::mers_korea_2015$contacts,
  directed = TRUE
)

test_venn <- function(x) {
  ids_linelist <- epicontacts::get_id(x = x, which = "linelist")
  ids_contacts <- epicontacts::get_id(x = x, which = "all")
  
  out <- length(unique(ids_linelist)) >= length(unique(ids_contacts))
  
  return(out)
}

test_venn(x = sample_net)
#> [1] FALSE
test_venn(x = epi_contacts)
#> [1] TRUE

Created on 2025-04-08 with reprex v2.1.1


:::::::::::::::::: hint

**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session!
**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session! Try to avoid this step.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid this step.

I'm not sure if we want to provide code in the tutorials that we advice readers not to run. I think this could potentially be reworded. Such as:

⚠️ Optional Step:
epicontacts::vis_epicontacts() provides an interactive network of the outbreak and may take several minutes and use significant memory for large outbreaks such as the Ebola line list.
If you're on an older or slower computer, you can skip this step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Thanks for proposing an edit. I adapted it here:

Suggested change
**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session! Try to avoid this step.
⚠️ **Optional step:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may take several minutes and use significant memory for large outbreaks such as the Ebola linelist. If you're on an older or slower computer, you can skip this step.

Copy link
Member

@joshwlambert joshwlambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @avallecam! I read through the .Rmd file changes and everything looks good. I've left a few comments on the file. I haven't rendered the tutorial to see how it looks as a web page but happy to take a look once this is merged and live and I'll open an issue if I spot anything that needs changing/fixing.

Co-authored-by: Joshua Lambert <[email protected]>
github-actions bot pushed a commit that referenced this pull request Apr 8, 2025
Copy link
Member Author

@avallecam avallecam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @joshwlambert. Could you take a look at the edits in response to your questions?


:::::::::::::::::: hint

**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session!
**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session! Try to avoid this step.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Thanks for proposing an edit. I adapted it here:

Suggested change
**Note:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may overload your session! Try to avoid this step.
⚠️ **Optional step:** This dataset has `r nrow(ebola_sim_clean$linelist)` cases. Running `epicontacts::vis_epicontacts()` may take several minutes and use significant memory for large outbreaks such as the Ebola linelist. If you're on an older or slower computer, you can skip this step.


To get this, first, we can use `epicontacts::get_id()` to get the full list of unique identifiers ("id") from the `epicontacts` class object. Second, join it with the count secondary cases per infector stored in the `infector_secondary` object. Third, replace the missing values with `0` to express no report of secondary cases from them.
Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` get the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)). Also, the argument `only_linelist = TRUE` include individuals in contacts and linelist data frames.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Suggesting one more paragraph here:

Suggested change
Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` get the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)). Also, the argument `only_linelist = TRUE` include individuals in contacts and linelist data frames.
Instead, from `{epicontacts}` we can use the function `epicontacts::get_degree()`. The argument `type = "out"` gets the **out-degree** of each **node** in the contact network from the `<epicontacts>` class object. In a directed network, the out-degree is the number of outgoing edges (infectees) emanating from a node (infector) ([Nykamp DQ, accessed: 2025](https://mathinsight.org/definition/node_degree)).
Also, the argument `only_linelist = TRUE` will only include individuals in the linelist data frame. During outbreak investigations, we expect a registry of **all** the observed infected individuals in the linelist data. However, anyone not linked with a potential infector or infectee will not appear in the contact data. Thus, the argument `only_linelist = TRUE` will protect us against missing this later set of individuals when counting the number of secondary cases caused by all the observed infected individuals. They will appear in the `<integer>` vector output as `0` secondary cases.

Comment on lines +178 to +180
This assumption may not work for all situations.
If you need to consider only the individuals from the contact data,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one sentence with an example:

Suggested change
This assumption may not work for all situations.
If you need to consider only the individuals from the contact data,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
This assumption may not work for all situations.
For example, if during the registry of observed infections,
the contact data included more subjects than the ones available in the linelist data,
then you need to consider only the individuals from the contact data.
In that situation,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.

Comment on lines +178 to +180
This assumption may not work for all situations.
If you need to consider only the individuals from the contact data,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshwlambert would you agree to add this reprex to make the situation more visible? I can add it as a spoiler callout block to make it expandable on demand.

# Three subjects on linelist
sample_linelist <- tibble::tibble(
  id = c("id1", "id2", "id3")
)

# Four infector-infectee pairs with Five subjects in contact data
sample_contact <- tibble::tibble(
  from = c("id1","id1","id2","id4"),
  to = c("id2","id3","id4","id5")
)

# make an epicontacts object
sample_net <- epicontacts::make_epicontacts(
  linelist = sample_linelist,
  contacts = sample_contact,
  directed = TRUE
)

# count secondary cases per subject from linelist only
epicontacts::get_degree(x = sample_net, type = "out", only_linelist = TRUE)
#> id1 id2 id3 
#>   2   1   0

# count secondary cases per subject from contact only
epicontacts::get_degree(x = sample_net, type = "out", only_linelist = FALSE)
#> id1 id2 id4 id3 id5 
#>   2   1   1   0   0

Created on 2025-04-08 with reprex v2.1.1

Comment on lines +178 to +180
This assumption may not work for all situations.
If you need to consider only the individuals from the contact data,
at `epicontacts::get_degree()` we use the `only_linelist = FALSE` argument.
Copy link
Member Author

@avallecam avallecam Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we suggest at {epicontacts} to have a Venn diagram-like summary table with the output of id's on linelist/common/contacts?

To test as drafted below, but with all the crossings? As a quality control step:

sample_linelist <- tibble::tibble(
  id = c("id1", "id2", "id3")
)

sample_contact <- tibble::tibble(
  from = c("id1","id1","id2","id4"),
  to = c("id2","id3","id4","id5")
)

sample_net <- epicontacts::make_epicontacts(
  linelist = sample_linelist,
  contacts = sample_contact,
  directed = TRUE
)

epi_contacts <- epicontacts::make_epicontacts(
  linelist = outbreaks::mers_korea_2015$linelist,
  contacts = outbreaks::mers_korea_2015$contacts,
  directed = TRUE
)

test_venn <- function(x) {
  ids_linelist <- epicontacts::get_id(x = x, which = "linelist")
  ids_contacts <- epicontacts::get_id(x = x, which = "all")
  
  out <- length(unique(ids_linelist)) >= length(unique(ids_contacts))
  
  return(out)
}

test_venn(x = sample_net)
#> [1] FALSE
test_venn(x = epi_contacts)
#> [1] TRUE

Created on 2025-04-08 with reprex v2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use only_linelist = TRUE at epicontacts::get_degree()
2 participants