From ecda6af58af8ef2822e2cd9d802b40136c59c5b3 Mon Sep 17 00:00:00 2001 From: dsweber2 Date: Tue, 12 Mar 2024 17:33:14 -0700 Subject: [PATCH 1/9] very rough draft of nwss docs --- docs/api/covidcast-signals/nwss.md | 139 +++++++++++++++++++++++++++++ 1 file changed, 139 insertions(+) create mode 100644 docs/api/covidcast-signals/nwss.md diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md new file mode 100644 index 000000000..76e59f4d9 --- /dev/null +++ b/docs/api/covidcast-signals/nwss.md @@ -0,0 +1,139 @@ +--- +title: NWSS +parent: Data Sources and Signals +grand_parent: COVIDcast Main Endpoint +--- + +# National Wastewater Surveillance System (NWSS) +{: .no_toc} + +* **Source name:** `nwss-wastewater` +* **Earliest issue available:** DATE RELEASED TO API +* **Number of data revisions since 19 May 2020:** 0 +* **Date of last change:** Never +* **Available for:** state, nation (see [geography coding docs](../covidcast_geography.md)) +* **Time type:** day (see [date format docs](../covidcast_times.md)) +* **License:** [LICENSE NAME](../covidcast_licensing.md#APPLICABLE-SECTION) + +The National Wastewater Surveillance System (nwss) is a CDC led effort to track the presence of SARS-CoV-2 in wastewater throughout the United States. +For more, see their [official page](https://www.cdc.gov/nwss/index.html). +The project was launched in September 2020 and continues to today. +## Table of contents +{: .no_toc .text-delta} + +1. TOC {:toc} + +## +The original source is measured at the level of [sample-site](https://www.cdc.gov/nwss/sampling.html), either at or upstream from a regional wastewater treatment plants (these are roughly at the county-level, but are not coterminous with counties); presently we do not provide the site-level or county-aggregated data, but will do so within the coming months. + +To obtain either the state or national level signals from the sample-site level signals, we aggregate using population-served weighted sums, where the total population for a given `geo_value` is the total population served at sites with non-missing values. For example, if there are 3 sample sites in Washington on a particular day, then the numerator is the corresponding site population, and the denominator the sum of the population served by the 3 sample sites (rather than the entire population of the state). + +This source is derived from two API's: the NWSS [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and the NWSS [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). +These sources combine several signals each, which we have broken into separate signals. +They vary across the post-processing method, the normalization method, and the underlying data provider. +## Post Processing methods +| Post-processing method | Description | +|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| pcr\_conc\_smoothed | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on normalization scheme below). This is then normalized as described below, then smoothed using a cubic spline, then aggregated to the given `geo_type` with a population-weighted sum. | +| detect\_prop\_15d | The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the prior 15-day period. The detection proportion is the 15-day rolling sum of SARS-CoV-2 detections divided by the 15-day rolling sum of the number of tests for each sample site and multiplying by 100, aggregated with a population weighted sum. The result is a percentage. | +| percentile | This metric shows whether SARS-CoV-2 virus levels at a site are currently higher or lower than past historical levels at the same site. 0% means levels are the lowest they have been at the site; 100% means levels are the highest they have been at the site. Note that at either the state or national `geo_types`, this is not the percentile for overall state levels, but the average percentile across sites, weighted by population. | +| ptc\_15d | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. | + + +## Normalization methods +| Normalization method | Description | +|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Flow normalization | This uses viral concentration times flow rate, divided by population served, which results in viral gene copies per person per day. It's applied to concentrations measured from unconcentrated wastewater (also labeled `raw_wastewater` in the socrata dataset's `key_plot_id`). It tracks the total number of individuals whose shedding behavior has changed. | +| microbial | This divides by the concentration of one of several potential fecal biomarkers commonly found throughout the population. These include molecular indicators of either viruses or bacteria. The most common viral indicator comes from the pepper mild mottle virus (pmmov), a virus that infects plants that is commonly found in pepper products. The most common bacterial indicators come from Bacteroides HF183 and Lachnospiraceae Lachno3, both common gut bacteria. It's applied to sludge samples (also labeled `post_grit _removal` in the socrata dataset's `key_plot_id`), which have been concentrated in preparation for treatment. The resulting value is unitless, and tracks the proportion of individuals who's shedding behavior has changed. | +## Providers +In the fall of 2023, there was a major shift from Biobot as the primary commercial p +| Provider | Available | Description | +|-------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| cdc\_verily | 2023/10/30-Today | Data analyzed by [verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC | +| nwss | 2020/06/21-Today | Data reported by the respective State, Territorial and Local public health agencies; the actual processing may be done by a private lab such as verily or biobot, or it may be done directly by the agency, or it may be processed by a partnering university. | +| wws | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit. | +| cdc\_biobot | 2020/06/21(?)-2023/10/30 | Data analyzed by [Biobot](https://biobot.io/) on behalf of the CDC. | + +These contain various derived signals described more below, as well as +aggregating across different methods of normalizing: + +Flow normalization: + +Microbial normalization: + +To derive the state level data from the site-level data, we do a weighted sum +using the populations served. + +| Signal |
**Earliest date available:** | +|-----------------------------------------------|------------------------------------| +| `pcr_con_smoothed_cdc_verily_flow_population` | | +| | | + +## Estimation + +Describe how any relevant quantities are estimated---both statistically and in +terms of the underlying features or inputs. (For example, if a signal is based +on hospitalizations, what specific types of hospitalization are counted?) + +If you need mathematics, we use KaTeX; you can see its supported LaTeX +[here](https://katex.org/docs/supported.html). Inline math is done with *double* +dollar signs, as in $$x = y/z$$, and display math by placing them with +surrounding blank lines, as in + +$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}. $$ + +Note that the blank lines are essential. + +### Standard Error + +If this signal is a random variable, e.g. if it is a survey or based on +proportion estimates, describe here how its standard error is reported and the +nature of the random variation. + +### Smoothing + +If the smoothing is unusual or involves extra steps beyond a simple moving +average, describe it here. + +## Limitations + + +NWSS is still expanding to get coverage nationwide: to get an idea of the extent of counties currently covered, see [this map](https://www.cdc.gov/nwss/progress.html). +At a particular point in time. + + +Any limitations in the interpretation of this signal, such as limits in its +geographic coverage, limits in its interpretation (symptoms in a survey aren't +always caused by COVID, our healthcare partner only is part of the market, there +may be a demographic bias in respondents, etc.), known inaccuracies, etc. + +## Missingness + +Describe *all* situations under which a value may not be reported, and what that +means. If the signal ever reports NA, describe what that means and how it is +different from missingness. For example: + +When fewer than 100 survey responses are received in a geographic area on a +specific day, no data is reported for that area on that day; an API query for +all reported geographic areas on that day will not include it. + +## Lag and Backfill + +None of the signals suffer from significant lag; most are a day old + +If this signal is reported with a consistent lag, describe it here. + +If this signal is regularly backfilled, describe the reason and nature of the +backfill here. + + +Due to the [cubic spline smoothing](https://en.wikipedia.org/wiki/Smoothing_spline), any of the `pcr_conc_smoothed` signals may change in the 4th and smaller significant figure at intervals of approximately 1-3 days; outside of the most recent data where smoothing directly relies on newly added data-points, these revisions should have approximately a mean of zero. To reduce some of this variation, we have rounded to the 4th significant figure; this digit may still vary. + + +Whenever a new maximum occurs at a location, all historical data for a `percentile` signal has the potential to shift, potentially quite drastically. + +## Source and Licensing + +If the signal has specific licensing or sourcing that should be acknowledged, +describe it here. Also, include links to source websites for data that is +scraped or received from another source. From b422e1b0262d6fe8c6350d6a4b17e60a08d54014 Mon Sep 17 00:00:00 2001 From: dsweber2 Date: Thu, 28 Mar 2024 10:18:47 -0700 Subject: [PATCH 2/9] A complete draft --- docs/api/covidcast-signals/nwss.md | 175 +++++++++++++++-------------- 1 file changed, 88 insertions(+), 87 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index 76e59f4d9..cd9401734 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -17,123 +17,124 @@ grand_parent: COVIDcast Main Endpoint The National Wastewater Surveillance System (nwss) is a CDC led effort to track the presence of SARS-CoV-2 in wastewater throughout the United States. For more, see their [official page](https://www.cdc.gov/nwss/index.html). -The project was launched in September 2020 and continues to today. +The project was launched in September 2020 and is ongoing. The original data for this source is provided un-versioned via the socrata api as the [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). +This source modifies this data in two ways: first, it splits the data based on provider and normalization, and then it aggregates across `geo_values` so that the signals are available at the state and national level. ## Table of contents {: .no_toc .text-delta} 1. TOC {:toc} -## +## Description The original source is measured at the level of [sample-site](https://www.cdc.gov/nwss/sampling.html), either at or upstream from a regional wastewater treatment plants (these are roughly at the county-level, but are not coterminous with counties); presently we do not provide the site-level or county-aggregated data, but will do so within the coming months. -To obtain either the state or national level signals from the sample-site level signals, we aggregate using population-served weighted sums, where the total population for a given `geo_value` is the total population served at sites with non-missing values. For example, if there are 3 sample sites in Washington on a particular day, then the numerator is the corresponding site population, and the denominator the sum of the population served by the 3 sample sites (rather than the entire population of the state). - -This source is derived from two API's: the NWSS [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and the NWSS [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). -These sources combine several signals each, which we have broken into separate signals. -They vary across the post-processing method, the normalization method, and the underlying data provider. -## Post Processing methods -| Post-processing method | Description | -|------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| pcr\_conc\_smoothed | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on normalization scheme below). This is then normalized as described below, then smoothed using a cubic spline, then aggregated to the given `geo_type` with a population-weighted sum. | -| detect\_prop\_15d | The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the prior 15-day period. The detection proportion is the 15-day rolling sum of SARS-CoV-2 detections divided by the 15-day rolling sum of the number of tests for each sample site and multiplying by 100, aggregated with a population weighted sum. The result is a percentage. | -| percentile | This metric shows whether SARS-CoV-2 virus levels at a site are currently higher or lower than past historical levels at the same site. 0% means levels are the lowest they have been at the site; 100% means levels are the highest they have been at the site. Note that at either the state or national `geo_types`, this is not the percentile for overall state levels, but the average percentile across sites, weighted by population. | -| ptc\_15d | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. | - - -## Normalization methods -| Normalization method | Description | -|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Flow normalization | This uses viral concentration times flow rate, divided by population served, which results in viral gene copies per person per day. It's applied to concentrations measured from unconcentrated wastewater (also labeled `raw_wastewater` in the socrata dataset's `key_plot_id`). It tracks the total number of individuals whose shedding behavior has changed. | -| microbial | This divides by the concentration of one of several potential fecal biomarkers commonly found throughout the population. These include molecular indicators of either viruses or bacteria. The most common viral indicator comes from the pepper mild mottle virus (pmmov), a virus that infects plants that is commonly found in pepper products. The most common bacterial indicators come from Bacteroides HF183 and Lachnospiraceae Lachno3, both common gut bacteria. It's applied to sludge samples (also labeled `post_grit _removal` in the socrata dataset's `key_plot_id`), which have been concentrated in preparation for treatment. The resulting value is unitless, and tracks the proportion of individuals who's shedding behavior has changed. | -## Providers -In the fall of 2023, there was a major shift from Biobot as the primary commercial p -| Provider | Available | Description | -|-------------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| cdc\_verily | 2023/10/30-Today | Data analyzed by [verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC | -| nwss | 2020/06/21-Today | Data reported by the respective State, Territorial and Local public health agencies; the actual processing may be done by a private lab such as verily or biobot, or it may be done directly by the agency, or it may be processed by a partnering university. | -| wws | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit. | -| cdc\_biobot | 2020/06/21(?)-2023/10/30 | Data analyzed by [Biobot](https://biobot.io/) on behalf of the CDC. | - -These contain various derived signals described more below, as well as -aggregating across different methods of normalizing: - -Flow normalization: - -Microbial normalization: - -To derive the state level data from the site-level data, we do a weighted sum -using the populations served. - -| Signal |
**Earliest date available:** | -|-----------------------------------------------|------------------------------------| -| `pcr_con_smoothed_cdc_verily_flow_population` | | -| | | - +To generate either the state or national level signals from the sample-site level signals, we aggregate using population-served weighted sums, where the total population for a given `geo_value` is the total population served at sites with non-missing values. +## Signal features +The signals vary across the underlying data provider, the normalization method, and the post-processing method. +### Providers +As a coordinating body, the NWSS receives wastewater data through a number of providers, which have changed as the project has evolved. +Most recently, in the fall of 2023, there was a major shift in the primary direct commercial provider for the NWSS from Biobot to Verily. +Presently, the Biobot data is not present at either socrata endpoint; we are providing a fixed snapshot that was present in November 2023. +The available column below indicates the first date that any location had data from that source. +| Provider | Available | Description | +|--------------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `cdc_verily` | 2023/10/30-Today | Data analyzed by [Verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC directly. | +| `nwss` | 2020/06/21-Today | Data reported by the respective State, Territorial and Local Public Health Agencies; the actual processing may be done by a private lab such as Verily or Biobot, or the agency, or a partnering university. | +| `wws` | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit, and then shared with the NWSS. | +| `cdc_biobot` | 2020/06/21-2023/10/30 | Data analyzed by [Biobot](https://biobot.io/) on behalf of the CDC. | +### Normalization methods +Direct viral concentration is not a clear indicator of the number and severity of cases in the sewershed. +In mixed drainage/sewage systems for example, the effluent will be significantly diluted whenever there is rain. +There are a number of methods to normalize viral concentration to get a better indicator: +| Normalization method | Description | +|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `flow_population` | This uses $$\frac{v\cdot r}{p}$$, where $$v$$ is viral concentration, $$r$$ is flow rate, and $$p$$ is population served. The resulting units are viral gene copies per person per day. It's applied to concentrations measured from unconcentrated wastewater (also labeled `raw_wastewater` in the socrata dataset's `key_plot_id`). It tracks the total number of individuals whose shedding behavior has changed. | +| `microbial` | This divides by the concentration of one of several potential fecal biomarkers commonly found throughout the population. These include molecular indicators of either viruses or bacteria. The most common viral indicator comes from the pepper mild mottle virus (pmmov), a virus that infects plants that is commonly found in pepper products. The most common bacterial indicators come from Bacteroides HF183 and Lachnospiraceae Lachno3, both common gut bacteria. It's applied to sludge samples (also labeled `post_grit _removal` in the socrata dataset's `key_plot_id`), which have been concentrated in preparation for treatment. The resulting value is unitless, and tracks the proportion of individuals who's shedding behavior has changed. | +### Post Processing methods +Regardless of normalization method, the daily wastewater data is noisy; to make the indicators more useful, the NWSS has provided different methods of post-processing the data: +| Post-processing method | Description | +|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `pcr_conc_smoothed` | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on the [normalization method](#Normalization-methods)). This is then [smoothed](#Smoothing) using a cubic spline, then [aggregated](#Aggregation) to the given `geo_type` with a population-weighted sum. | +| `detect_prop_15d` | The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the prior 15-day period. The detection proportion is the 15-day rolling sum of SARS-CoV-2 detections divided by the 15-day rolling sum of the number of tests for each sample site and multiplying by 100, aggregated with a population weighted sum. The result is a percentage. | +| ptc\_15d | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. Note that for county and higher level `geo_type`s, this is an average of the percentage change at each site, weighted by population, rather than the percentage change for the entire region. We recommend caution in the use of these signals at aggregated `geo_type`s. | +| `percentile` | This metric shows whether SARS-CoV-2 virus levels at a site are currently higher or lower than past historical levels at the same site. 20% means that 80% of observed values are higher than this value, while 20% are lower. 0% means levels are the lowest they have been at the site, while 100% means they are the highest. Note that at county or higher level `geo_type`s, this is not the percentile for overall state levels, but the *average percentile across sites*, weighted by population, which makes it difficult to meaningfully interpret. We do not recommended its use outside of the site level. | + +### Full signal list +Not every triple of post processing method, provider, and normalization actually contains data. Here is a complete list of the actual signals: +| Signal | | +| `prc_conc_smoothed_cdc_verily_flow_population` | | +| `prc_conc_smoothed_cdc_verily_microbial` | | +| `prc_conc_smoothed_cdc_nwss_flow_population` | | +| `prc_conc_smoothed_cdc_verily_microbial` | | +| `prc_conc_smoothed_cdc_wws_microbial` | | +| `detetct_prop_15d_cdc_verily_flow_population` | | +| `detetct_prop_15d_cdc_verily_microbial` | | +| `detetct_prop_15d_cdc_nwss_flow_population` | | +| `detetct_prop_15d_cdc_verily_microbial` | | +| `detetct_prop_15d_cdc_wws_microbial` | | +| `ptc_15d_cdc_verily_flow_population` | | +| `ptc_15d_cdc_verily_microbial` | | +| `ptc_15d_cdc_nwss_flow_population` | | +| `ptc_15d_cdc_verily_microbial` | | +| `ptc_15d_cdc_wws_microbial` | | +| `percentile_cdc_verily_flow_population` | | +| `percentile_cdc_verily_microbial` | | +| `percentile_cdc_nwss_flow_population` | | +| `percentile_cdc_verily_microbial` | | +| `percentile_cdc_wws_microbial` | | + +What is missing? `wws` only normalizes using `microbial` measures, so the 5 different post-processing methods are not present for `*_wws_flow_population`. ## Estimation +### Aggregation +For any given day and signal, we do a population weighted sum of the target signal, with the weight depending only on non-missing and non-zero sample sites for that day in particular. For example, say $$p_i$$ is the population at sample site $$i$$; if there are 3 sample sites in Washington on a particular day, then the weight at site $$i$$ is $$\frac{p_i}{p_1+p_2+p_3}$$. If the next day there are only 2 sample locations, the weight at site $$i$$ becomes $$\frac{p_i}{p_1+p_2}$$. -Describe how any relevant quantities are estimated---both statistically and in -terms of the underlying features or inputs. (For example, if a signal is based -on hospitalizations, what specific types of hospitalization are counted?) +For `ptc_15d`, the percent change in RNA levels, this average is somewhat difficult to interpret, since it is the average of the percentage changes, rather than the percentage change in the average. +For example, say we have two sample sites, with initial levels at 10 and 100, and both see a increase of 10[^1]. +Their respective percentage increases are 10% and 100%, so the average percent increase is 55%. +Contrast this with the average level, which goes from 55 to 65, for an 18% percent increase in the average[^2]. -If you need mathematics, we use KaTeX; you can see its supported LaTeX -[here](https://katex.org/docs/supported.html). Inline math is done with *double* -dollar signs, as in $$x = y/z$$, and display math by placing them with -surrounding blank lines, as in +`percentile` has a similar difficulty, although it is a more involved calculation that uses all other values in the time series, rather than just a 15 day window. -$$ \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}. $$ -Note that the blank lines are essential. -### Standard Error +### Smoothing -If this signal is a random variable, e.g. if it is a survey or based on -proportion estimates, describe here how its standard error is reported and the -nature of the random variation. +The `pcr_conc_smoothed` based signals all use smoothed splines to generate the averaged value at each location. Specifically, the smoothing uses [`smooth.spline`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/smooth.spline) from R, with `spar=0.5`, a smoothing parameter. Specifically, if the value at time $$t_i$$ is given by $$y_i$$, for a function $$f$$ represented by splines this minimizes -### Smoothing +$$ \sum_i (y_i - f(t_i))^2 + 16 \frac{\mathrm{tr}(X'X)}{\mathrm{tr}(\Sigma)} \int \big|f^{''}(z)\big|^2~\mathrm{d}z$$ -If the smoothing is unusual or involves extra steps beyond a simple moving -average, describe it here. +where $$X_{ij} = B_j(t_i)$$ and $$\Sigma_{ij} = \int B_j(t_i)$$ where $$B_j$$ is $$j$$th spline evaluated at $$t_i$$. -## Limitations +It is important to note that this means that the value at any point in time depends on future values up to and including the present day, so its use in historical evaluation is suspect. +This is only true for `pcr_conc_smoothed`; all other smoothing is done via 15 day rolling sums. +## Limitations -NWSS is still expanding to get coverage nationwide: to get an idea of the extent of counties currently covered, see [this map](https://www.cdc.gov/nwss/progress.html). -At a particular point in time. +The NWSS is still expanding to get coverage nationwide: to get an idea of the extent of counties currently covered, see [this map](https://www.cdc.gov/nwss/progress.html). -Any limitations in the interpretation of this signal, such as limits in its -geographic coverage, limits in its interpretation (symptoms in a survey aren't -always caused by COVID, our healthcare partner only is part of the market, there -may be a demographic bias in respondents, etc.), known inaccuracies, etc. +Standard errors and sample sizes are not applicable to these signals. ## Missingness -Describe *all* situations under which a value may not be reported, and what that -means. If the signal ever reports NA, describe what that means and how it is -different from missingness. For example: - -When fewer than 100 survey responses are received in a geographic area on a -specific day, no data is reported for that area on that day; an API query for -all reported geographic areas on that day will not include it. +If a sample site has too few individuals, the NWSS does not provide the detailed data, so we cannot include it in our aggregations. ## Lag and Backfill -None of the signals suffer from significant lag; most are a day old - -If this signal is reported with a consistent lag, describe it here. - -If this signal is regularly backfilled, describe the reason and nature of the -backfill here. +Due to collection, shipping, processing and reporting time, these signals are subject to some lag. +Typically, this is between 4-6 days. +The raw signals are not typically subject to revisions or backfill, but due to the [cubic spline smoothing](#Smoothing), any of the `pcr_conc_smoothed` signals may change in the 4th and smaller significant figure at intervals of approximately 1-3 days; outside of the most recent data where smoothing directly relies on newly added data-points, these revisions should have approximately a mean of zero. +To reduce some of this variation, we have rounded to the 4th significant figure. +There is also the possibility of the post-processing being subject to significant revisions if bugs are discovered. -Due to the [cubic spline smoothing](https://en.wikipedia.org/wiki/Smoothing_spline), any of the `pcr_conc_smoothed` signals may change in the 4th and smaller significant figure at intervals of approximately 1-3 days; outside of the most recent data where smoothing directly relies on newly added data-points, these revisions should have approximately a mean of zero. To reduce some of this variation, we have rounded to the 4th significant figure; this digit may still vary. +Finally, whenever a new maximum occurs at a location, all historical data for a `percentile` signal has the potential to shift, potentially drastically. +## Source and Licensing -Whenever a new maximum occurs at a location, all historical data for a `percentile` signal has the potential to shift, potentially quite drastically. +This indicator aggregates data originating from the [NWSS](https://www.cdc.gov/nwss/index.html). +The site-level data is provided un-versioned via the socrata api as the [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). +As described in the [Provider section](#Providers), the NWSS is aggregating data from [Verily](https://verily.com/solutions/public-health/wastewater), State Territorial and Local public health agencies, [Wastewater Scan](https://www.wastewaterscan.org/en), and [Biobot](https://biobot.io/). -## Source and Licensing -If the signal has specific licensing or sourcing that should be acknowledged, -describe it here. Also, include links to source websites for data that is -scraped or received from another source. +[^1]: These are not realistic values; they are merely chosen to make the example simple to explain. +[^2]: Further complicating this is that the percent change is measured via linear regression, rather than a simple difference. However, this doesn't change the fundamental interpretation, it merely makes the slope more robust to noise. From 871b6862ebd80d2e6b35182a6babb4ca8d723191 Mon Sep 17 00:00:00 2001 From: dsweber2 Date: Thu, 28 Mar 2024 16:26:39 -0700 Subject: [PATCH 3/9] actually previewed, various bugs. Added pop & date info --- docs/api/covidcast-signals/nwss.md | 92 ++++++++++++++++-------------- 1 file changed, 50 insertions(+), 42 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index cd9401734..495c6d05f 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -18,16 +18,18 @@ grand_parent: COVIDcast Main Endpoint The National Wastewater Surveillance System (nwss) is a CDC led effort to track the presence of SARS-CoV-2 in wastewater throughout the United States. For more, see their [official page](https://www.cdc.gov/nwss/index.html). The project was launched in September 2020 and is ongoing. The original data for this source is provided un-versioned via the socrata api as the [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). -This source modifies this data in two ways: first, it splits the data based on provider and normalization, and then it aggregates across `geo_values` so that the signals are available at the state and national level. -## Table of contents -{: .no_toc .text-delta} +This source modifies the original data in two ways: first, it splits the data based on provider, normalization, and post-processing method, and then it aggregates across `geo_values` so that the signals are available at the state and national level. -1. TOC {:toc} -## Description The original source is measured at the level of [sample-site](https://www.cdc.gov/nwss/sampling.html), either at or upstream from a regional wastewater treatment plants (these are roughly at the county-level, but are not coterminous with counties); presently we do not provide the site-level or county-aggregated data, but will do so within the coming months. To generate either the state or national level signals from the sample-site level signals, we aggregate using population-served weighted sums, where the total population for a given `geo_value` is the total population served at sites with non-missing values. +## Table of contents +{: .no_toc .text-delta} + +1. TOC +{:toc} + ## Signal features The signals vary across the underlying data provider, the normalization method, and the post-processing method. ### Providers @@ -35,54 +37,60 @@ As a coordinating body, the NWSS receives wastewater data through a number of pr Most recently, in the fall of 2023, there was a major shift in the primary direct commercial provider for the NWSS from Biobot to Verily. Presently, the Biobot data is not present at either socrata endpoint; we are providing a fixed snapshot that was present in November 2023. The available column below indicates the first date that any location had data from that source. + | Provider | Available | Description | |--------------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `cdc_verily` | 2023/10/30-Today | Data analyzed by [Verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC directly. | -| `nwss` | 2020/06/21-Today | Data reported by the respective State, Territorial and Local Public Health Agencies; the actual processing may be done by a private lab such as Verily or Biobot, or the agency, or a partnering university. | +| `nwss` | 2020/06/21-Today | Data reported by the respective State, Territorial and Local Public Health Agencies; the actual processing may be done by a private lab such as Verily or Biobot, or the agency itself, or a partnering university. | | `wws` | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit, and then shared with the NWSS. | | `cdc_biobot` | 2020/06/21-2023/10/30 | Data analyzed by [Biobot](https://biobot.io/) on behalf of the CDC. | + ### Normalization methods Direct viral concentration is not a clear indicator of the number and severity of cases in the sewershed. In mixed drainage/sewage systems for example, the effluent will be significantly diluted whenever there is rain. -There are a number of methods to normalize viral concentration to get a better indicator: +There are two general categories of methods used to normalize viral concentration to get a better indicator in the NWSS datasets: + | Normalization method | Description | |----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `flow_population` | This uses $$\frac{v\cdot r}{p}$$, where $$v$$ is viral concentration, $$r$$ is flow rate, and $$p$$ is population served. The resulting units are viral gene copies per person per day. It's applied to concentrations measured from unconcentrated wastewater (also labeled `raw_wastewater` in the socrata dataset's `key_plot_id`). It tracks the total number of individuals whose shedding behavior has changed. | | `microbial` | This divides by the concentration of one of several potential fecal biomarkers commonly found throughout the population. These include molecular indicators of either viruses or bacteria. The most common viral indicator comes from the pepper mild mottle virus (pmmov), a virus that infects plants that is commonly found in pepper products. The most common bacterial indicators come from Bacteroides HF183 and Lachnospiraceae Lachno3, both common gut bacteria. It's applied to sludge samples (also labeled `post_grit _removal` in the socrata dataset's `key_plot_id`), which have been concentrated in preparation for treatment. The resulting value is unitless, and tracks the proportion of individuals who's shedding behavior has changed. | + ### Post Processing methods Regardless of normalization method, the daily wastewater data is noisy; to make the indicators more useful, the NWSS has provided different methods of post-processing the data: + | Post-processing method | Description | |------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `pcr_conc_smoothed` | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on the [normalization method](#Normalization-methods)). This is then [smoothed](#Smoothing) using a cubic spline, then [aggregated](#Aggregation) to the given `geo_type` with a population-weighted sum. | +| `pcr_conc_smoothed` | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on the [normalization method](#normalization-methods)). This is then [smoothed](#smoothing) using a cubic spline, then [aggregated](#aggregation) to the given `geo_type` with a population-weighted sum. | | `detect_prop_15d` | The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the prior 15-day period. The detection proportion is the 15-day rolling sum of SARS-CoV-2 detections divided by the 15-day rolling sum of the number of tests for each sample site and multiplying by 100, aggregated with a population weighted sum. The result is a percentage. | -| ptc\_15d | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. Note that for county and higher level `geo_type`s, this is an average of the percentage change at each site, weighted by population, rather than the percentage change for the entire region. We recommend caution in the use of these signals at aggregated `geo_type`s. | +| `ptc_15d` | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. Note that for county and higher level `geo_type`s, this is an average of the percentage change at each site, weighted by population, rather than the percentage change for the entire region. We recommend caution in the use of these signals at aggregated `geo_type`s. | | `percentile` | This metric shows whether SARS-CoV-2 virus levels at a site are currently higher or lower than past historical levels at the same site. 20% means that 80% of observed values are higher than this value, while 20% are lower. 0% means levels are the lowest they have been at the site, while 100% means they are the highest. Note that at county or higher level `geo_type`s, this is not the percentile for overall state levels, but the *average percentile across sites*, weighted by population, which makes it difficult to meaningfully interpret. We do not recommended its use outside of the site level. | ### Full signal list -Not every triple of post processing method, provider, and normalization actually contains data. Here is a complete list of the actual signals: -| Signal | | -| `prc_conc_smoothed_cdc_verily_flow_population` | | -| `prc_conc_smoothed_cdc_verily_microbial` | | -| `prc_conc_smoothed_cdc_nwss_flow_population` | | -| `prc_conc_smoothed_cdc_verily_microbial` | | -| `prc_conc_smoothed_cdc_wws_microbial` | | -| `detetct_prop_15d_cdc_verily_flow_population` | | -| `detetct_prop_15d_cdc_verily_microbial` | | -| `detetct_prop_15d_cdc_nwss_flow_population` | | -| `detetct_prop_15d_cdc_verily_microbial` | | -| `detetct_prop_15d_cdc_wws_microbial` | | -| `ptc_15d_cdc_verily_flow_population` | | -| `ptc_15d_cdc_verily_microbial` | | -| `ptc_15d_cdc_nwss_flow_population` | | -| `ptc_15d_cdc_verily_microbial` | | -| `ptc_15d_cdc_wws_microbial` | | -| `percentile_cdc_verily_flow_population` | | -| `percentile_cdc_verily_microbial` | | -| `percentile_cdc_nwss_flow_population` | | -| `percentile_cdc_verily_microbial` | | -| `percentile_cdc_wws_microbial` | | - -What is missing? `wws` only normalizes using `microbial` measures, so the 5 different post-processing methods are not present for `*_wws_flow_population`. +Not every triple of post processing method, provider, and normalization actually contains data. Here is a complete list of the actual signals, with the total population at all sites which report that signal, as of March 28, 2024: + +| Signal | First Date available | Population served on 03-28-24 | +| `prc_conc_smoothed_cdc_verily_flow_population` | 2023-11-13 | 3,321,873 | +| `prc_conc_smoothed_cdc_verily_microbial` | 2023-11-09 | 430,000 | +| `prc_conc_smoothed_cdc_nwss_flow_population` | 2020-07-05 | 800,704 | +| `prc_conc_smoothed_cdc_nwss_microbial` | 2020-08-25 | 822,992 | +| `prc_conc_smoothed_cdc_wws_microbial` | 2022-01-09 | 10,643,646 | +| `detect_prop_15d_cdc_verily_flow_population` | 2023-11-13 | 3,321,873 | +| `detect_prop_15d_cdc_verily_microbial` | 2024-01-09 | 527,710 | +| `detect_prop_15d_cdc_nwss_flow_population` | 2020-07-05 | 800,704 | +| `detect_prop_15d_cdc_nwss_flow_microbial` | 2021-08-25 | 3,046,772 | +| `detect_prop_15d_cdc_wws_microbial` | 2022-01-09 | 42,185,773 | +| `ptc_15d_cdc_verily_flow_population` | 2023-11-15 | 3,321,873 | +| `ptc_15d_cdc_verily_microbial` | 2024-01-11 | 448,710 | +| `ptc_15d_cdc_nwss_flow_population` | 2020-07-12 | 800,704 | +| `ptc_15d_cdc_nwss_microbial` | 2021-08-26 | 2,780,752 | +| `ptc_15d_cdc_wws_microbial` | 2022-01-10 | 41,965,773 | +| `percentile_cdc_verily_flow_population` | 2023-11-13 | 3,321,873 | +| `percentile_cdc_verily_microbial` | 2024-01-09 | 527,710 | +| `percentile_cdc_nwss_flow_population` | 2021-12-02 | 800,704 | +| `percentile_cdc_nwss_microbial` | 2021-12-05 | 3,104,972 | +| `percentile_cdc_wws_microbial` | 2022-01-09 | 42,185,773 | + +What is missing? `wws` only normalizes using `microbial` measures, so the 5 different post-processing methods are not present for `*_wws_flow_population`. TODO need to work out which Biobot's we've got ## Estimation ### Aggregation For any given day and signal, we do a population weighted sum of the target signal, with the weight depending only on non-missing and non-zero sample sites for that day in particular. For example, say $$p_i$$ is the population at sample site $$i$$; if there are 3 sample sites in Washington on a particular day, then the weight at site $$i$$ is $$\frac{p_i}{p_1+p_2+p_3}$$. If the next day there are only 2 sample locations, the weight at site $$i$$ becomes $$\frac{p_i}{p_1+p_2}$$. @@ -98,19 +106,19 @@ Contrast this with the average level, which goes from 55 to 65, for an 18% perce ### Smoothing -The `pcr_conc_smoothed` based signals all use smoothed splines to generate the averaged value at each location. Specifically, the smoothing uses [`smooth.spline`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/smooth.spline) from R, with `spar=0.5`, a smoothing parameter. Specifically, if the value at time $$t_i$$ is given by $$y_i$$, for a function $$f$$ represented by splines this minimizes +The `pcr_conc_smoothed` based signals all use smoothed splines to generate the averaged value at each location. Specifically, the smoothing uses [`smooth.spline`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/smooth.spline) from R, with `spar=0.5`, a smoothing parameter. If the value at time $$t_i$$ is given by $$y_i$$, for a function $$f$$ represented by splines this minimizes $$ \sum_i (y_i - f(t_i))^2 + 16 \frac{\mathrm{tr}(X'X)}{\mathrm{tr}(\Sigma)} \int \big|f^{''}(z)\big|^2~\mathrm{d}z$$ -where $$X_{ij} = B_j(t_i)$$ and $$\Sigma_{ij} = \int B_j(t_i)$$ where $$B_j$$ is $$j$$th spline evaluated at $$t_i$$. +where $$X_{ij} = B_j(t_i)$$ and $$\Sigma_{ij} = \int B_j(t_i)$$ with $$B_j$$ the $$j$$th spline evaluated at $$t_i$$. -It is important to note that this means that the value at any point in time depends on future values up to and including the present day, so its use in historical evaluation is suspect. -This is only true for `pcr_conc_smoothed`; all other smoothing is done via 15 day rolling sums. +It is important to note that this means that the value at any point in time *depends on future values* up to and including the present day, so its use in historical evaluation is suspect. +This is only true for `pcr_conc_smoothed`; all other smoothing is done via simple 15 day trailing rolling sums. ## Limitations -The NWSS is still expanding to get coverage nationwide: to get an idea of the extent of counties currently covered, see [this map](https://www.cdc.gov/nwss/progress.html). +The NWSS is still expanding to get coverage nationwide, so it is currently an uneven sample; the largest signals above cover ~42 million people as of March 2024. Around 80% of the US is served by municipal wastewater collection systems, or around 272 million. Standard errors and sample sizes are not applicable to these signals. @@ -123,17 +131,17 @@ If a sample site has too few individuals, the NWSS does not provide the detailed Due to collection, shipping, processing and reporting time, these signals are subject to some lag. Typically, this is between 4-6 days. -The raw signals are not typically subject to revisions or backfill, but due to the [cubic spline smoothing](#Smoothing), any of the `pcr_conc_smoothed` signals may change in the 4th and smaller significant figure at intervals of approximately 1-3 days; outside of the most recent data where smoothing directly relies on newly added data-points, these revisions should have approximately a mean of zero. +The raw signals are not typically subject to revisions or backfill, but due to the [cubic spline smoothing](#smoothing), any of the `pcr_conc_smoothed` signals may change in the 4th and smaller significant figure at intervals of approximately 1-3 days; outside of the most recent data where smoothing directly relies on newly added data-points, these revisions should have approximately a mean of zero. To reduce some of this variation, we have rounded to the 4th significant figure. -There is also the possibility of the post-processing being subject to significant revisions if bugs are discovered. +There is also the possibility of the post-processing being subject to significant revisions if bugs are discovered, though this occurs on an ad-hoc basis. -Finally, whenever a new maximum occurs at a location, all historical data for a `percentile` signal has the potential to shift, potentially drastically. +The `percentile` data is subject to potentially more extensive revision if the signal undergoes significant distributional shift. This results in less jitter in the signal but the meaning of the percentiles can change quite drastically if e.g. a new wave infects an order of magnitude more people. ## Source and Licensing This indicator aggregates data originating from the [NWSS](https://www.cdc.gov/nwss/index.html). The site-level data is provided un-versioned via the socrata api as the [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). -As described in the [Provider section](#Providers), the NWSS is aggregating data from [Verily](https://verily.com/solutions/public-health/wastewater), State Territorial and Local public health agencies, [Wastewater Scan](https://www.wastewaterscan.org/en), and [Biobot](https://biobot.io/). +As described in the [Provider section](#providers), the NWSS is aggregating data from [Verily](https://verily.com/solutions/public-health/wastewater), State Territorial and Local public health agencies, [Wastewater Scan](https://www.wastewaterscan.org/en), and [Biobot](https://biobot.io/). [^1]: These are not realistic values; they are merely chosen to make the example simple to explain. From 1b214fb6d537fa34550839c7a77afc9393aa6955 Mon Sep 17 00:00:00 2001 From: dsweber2 Date: Thu, 16 May 2024 16:02:48 -0500 Subject: [PATCH 4/9] don't actually have biobot --- docs/api/covidcast-signals/nwss.md | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index 495c6d05f..a38febe98 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -34,16 +34,14 @@ To generate either the state or national level signals from the sample-site leve The signals vary across the underlying data provider, the normalization method, and the post-processing method. ### Providers As a coordinating body, the NWSS receives wastewater data through a number of providers, which have changed as the project has evolved. -Most recently, in the fall of 2023, there was a major shift in the primary direct commercial provider for the NWSS from Biobot to Verily. -Presently, the Biobot data is not present at either socrata endpoint; we are providing a fixed snapshot that was present in November 2023. +Most recently, in the fall of 2023, there was a major shift in the primary direct commercial provider for the NWSS from [Biobot](https://biobot.io/) to Verily. The available column below indicates the first date that any location had data from that source. -| Provider | Available | Description | -|--------------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `cdc_verily` | 2023/10/30-Today | Data analyzed by [Verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC directly. | +| Provider | Available | Description | +|--------------|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `cdc_verily` | 2023/10/30-Today | Data analyzed by [Verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC directly. | | `nwss` | 2020/06/21-Today | Data reported by the respective State, Territorial and Local Public Health Agencies; the actual processing may be done by a private lab such as Verily or Biobot, or the agency itself, or a partnering university. | -| `wws` | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit, and then shared with the NWSS. | -| `cdc_biobot` | 2020/06/21-2023/10/30 | Data analyzed by [Biobot](https://biobot.io/) on behalf of the CDC. | +| `wws` | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit, and then shared with the NWSS. | ### Normalization methods Direct viral concentration is not a clear indicator of the number and severity of cases in the sewershed. @@ -90,7 +88,7 @@ Not every triple of post processing method, provider, and normalization actually | `percentile_cdc_nwss_microbial` | 2021-12-05 | 3,104,972 | | `percentile_cdc_wws_microbial` | 2022-01-09 | 42,185,773 | -What is missing? `wws` only normalizes using `microbial` measures, so the 5 different post-processing methods are not present for `*_wws_flow_population`. TODO need to work out which Biobot's we've got +What is missing? `wws` only normalizes using `microbial` measures, so the 5 different post-processing methods are not present for `*_wws_flow_population`. ## Estimation ### Aggregation For any given day and signal, we do a population weighted sum of the target signal, with the weight depending only on non-missing and non-zero sample sites for that day in particular. For example, say $$p_i$$ is the population at sample site $$i$$; if there are 3 sample sites in Washington on a particular day, then the weight at site $$i$$ is $$\frac{p_i}{p_1+p_2+p_3}$$. If the next day there are only 2 sample locations, the weight at site $$i$$ becomes $$\frac{p_i}{p_1+p_2}$$. @@ -141,7 +139,7 @@ The `percentile` data is subject to potentially more extensive revision if the s This indicator aggregates data originating from the [NWSS](https://www.cdc.gov/nwss/index.html). The site-level data is provided un-versioned via the socrata api as the [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). -As described in the [Provider section](#providers), the NWSS is aggregating data from [Verily](https://verily.com/solutions/public-health/wastewater), State Territorial and Local public health agencies, [Wastewater Scan](https://www.wastewaterscan.org/en), and [Biobot](https://biobot.io/). +As described in the [Provider section](#providers), the NWSS is aggregating data from [Verily](https://verily.com/solutions/public-health/wastewater), State Territorial and Local public health agencies, [Wastewater Scan](https://www.wastewaterscan.org/en). [^1]: These are not realistic values; they are merely chosen to make the example simple to explain. From 5e5e89134129b790495a211e8d6aa75d8ad55ab8 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Mon, 10 Jun 2024 17:14:38 -0400 Subject: [PATCH 5/9] intro and license --- docs/api/covidcast-signals/nwss.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index a38febe98..d4c5880b8 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -8,22 +8,21 @@ grand_parent: COVIDcast Main Endpoint {: .no_toc} * **Source name:** `nwss-wastewater` -* **Earliest issue available:** DATE RELEASED TO API +* **Earliest issue available:** TODO: DATE RELEASED TO API * **Number of data revisions since 19 May 2020:** 0 * **Date of last change:** Never * **Available for:** state, nation (see [geography coding docs](../covidcast_geography.md)) * **Time type:** day (see [date format docs](../covidcast_times.md)) -* **License:** [LICENSE NAME](../covidcast_licensing.md#APPLICABLE-SECTION) +* **License:** [Public Domain US Government](https://www.usa.gov/government-works) -The National Wastewater Surveillance System (nwss) is a CDC led effort to track the presence of SARS-CoV-2 in wastewater throughout the United States. -For more, see their [official page](https://www.cdc.gov/nwss/index.html). -The project was launched in September 2020 and is ongoing. The original data for this source is provided un-versioned via the socrata api as the [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). -This source modifies the original data in two ways: first, it splits the data based on provider, normalization, and post-processing method, and then it aggregates across `geo_values` so that the signals are available at the state and national level. +The [National Wastewater Surveillance System (NWSS)](https://www.cdc.gov/nwss/index.html) is a CDC-led effort to track the presence of SARS-CoV-2 in wastewater throughout the United States. +The project was launched in September 2020 and is ongoing. The source data for this source is provided un-versioned via the Socrata API as the [NWSS Public SARS-CoV-2 Concentration in Wastewater Data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data) and the [NWSS Public SARS-CoV-2 Wastewater Metric Data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) containing signals derived from the concentration data. +Delphi modifies the source data in two ways: we split the data into separate signals based on provider, normalization, and post-processing method, and then aggregate it to higher geographic levels. +The source data is reported per [sampling site](https://www.cdc.gov/nwss/sampling.html), either at or upstream from a regional wastewater treatment plants (these are roughly at the county-level, but are not coterminous with counties). Presently we do not provide the site-level or county-aggregated data, but will do so within the coming months. -The original source is measured at the level of [sample-site](https://www.cdc.gov/nwss/sampling.html), either at or upstream from a regional wastewater treatment plants (these are roughly at the county-level, but are not coterminous with counties); presently we do not provide the site-level or county-aggregated data, but will do so within the coming months. +State and national level signals are calculated as weighted means of the site data. The weight assigned to a given input value is the population served by the corresponding wastewater treatment plant. Sites with missing values are implicitly treated as having a weight of 0 or a value equal to the group mean. -To generate either the state or national level signals from the sample-site level signals, we aggregate using population-served weighted sums, where the total population for a given `geo_value` is the total population served at sites with non-missing values. ## Table of contents {: .no_toc .text-delta} @@ -141,6 +140,8 @@ This indicator aggregates data originating from the [NWSS](https://www.cdc.gov/n The site-level data is provided un-versioned via the socrata api as the [metric data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Wastewater-Metric-Data/2ew6-ywp6/about_data) and [concentration data](https://data.cdc.gov/Public-Health-Surveillance/NWSS-Public-SARS-CoV-2-Concentration-in-Wastewater/g653-rqe2/about_data). As described in the [Provider section](#providers), the NWSS is aggregating data from [Verily](https://verily.com/solutions/public-health/wastewater), State Territorial and Local public health agencies, [Wastewater Scan](https://www.wastewaterscan.org/en). +This data was originally published by the CDC, and is made available here as a convenience to the forecasting community under the terms of the original license, which is [U.S. Government Public Domain](https://www.usa.gov/government-copyright). + [^1]: These are not realistic values; they are merely chosen to make the example simple to explain. [^2]: Further complicating this is that the percent change is measured via linear regression, rather than a simple difference. However, this doesn't change the fundamental interpretation, it merely makes the slope more robust to noise. From e4a574c24463d13b257fd1fb33b4162360173bcc Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Mon, 10 Jun 2024 17:30:40 -0400 Subject: [PATCH 6/9] provider table --- docs/api/covidcast-signals/nwss.md | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index d4c5880b8..7cb26f3e9 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -31,16 +31,18 @@ State and national level signals are calculated as weighted means of the site da ## Signal features The signals vary across the underlying data provider, the normalization method, and the post-processing method. + ### Providers -As a coordinating body, the NWSS receives wastewater data through a number of providers, which have changed as the project has evolved. -Most recently, in the fall of 2023, there was a major shift in the primary direct commercial provider for the NWSS from [Biobot](https://biobot.io/) to Verily. -The available column below indicates the first date that any location had data from that source. - -| Provider | Available | Description | -|--------------|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `cdc_verily` | 2023/10/30-Today | Data analyzed by [Verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC directly. | -| `nwss` | 2020/06/21-Today | Data reported by the respective State, Territorial and Local Public Health Agencies; the actual processing may be done by a private lab such as Verily or Biobot, or the agency itself, or a partnering university. | -| `wws` | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit, and then shared with the NWSS. | +The NWSS acts as a coordinating body, receiving wastewater data through a number of providers. Data providers can change as the project has evolved. +Most recently, in autumn 2023, the primary direct commercial provider for the NWSS changed from [Biobot](https://biobot.io/) to [Verily](https://publichealth.verily.com/). +The following table shows the history of data providers: + +| Provider | Available | Description | +|-|-|-| +| `cdc_verily` | 2023/10/30-Today | Data analyzed by [Verily](https://verily.com/solutions/public-health/wastewater) on behalf of the CDC directly. | +| `nwss` | 2020/06/21-Today | Data reported by the respective state, territorial, and local public health agencies; the actual processing may be done by a private lab such as Verily or Biobot, or the agency itself, or a partnering university. | +| `wws` | 2021/12/26-Today | Data analyzed by [Wastewater Scan](https://www.wastewaterscan.org/en), a Stanford/Emory nonprofit, and then shared with the NWSS. | +| `biobot` | ??-?? | Data analyzed by [Biobot](https://biobot.io/) and then shared with the NWSS. | ### Normalization methods Direct viral concentration is not a clear indicator of the number and severity of cases in the sewershed. From e45bce5473d2c9cc0ee8adf1a03eb814a9d4ee62 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Tue, 11 Jun 2024 10:35:40 -0400 Subject: [PATCH 7/9] normalization method table --- docs/api/covidcast-signals/nwss.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index 7cb26f3e9..94f867836 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -45,14 +45,15 @@ The following table shows the history of data providers: | `biobot` | ??-?? | Data analyzed by [Biobot](https://biobot.io/) and then shared with the NWSS. | ### Normalization methods -Direct viral concentration is not a clear indicator of the number and severity of cases in the sewershed. -In mixed drainage/sewage systems for example, the effluent will be significantly diluted whenever there is rain. -There are two general categories of methods used to normalize viral concentration to get a better indicator in the NWSS datasets: - -| Normalization method | Description | -|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `flow_population` | This uses $$\frac{v\cdot r}{p}$$, where $$v$$ is viral concentration, $$r$$ is flow rate, and $$p$$ is population served. The resulting units are viral gene copies per person per day. It's applied to concentrations measured from unconcentrated wastewater (also labeled `raw_wastewater` in the socrata dataset's `key_plot_id`). It tracks the total number of individuals whose shedding behavior has changed. | -| `microbial` | This divides by the concentration of one of several potential fecal biomarkers commonly found throughout the population. These include molecular indicators of either viruses or bacteria. The most common viral indicator comes from the pepper mild mottle virus (pmmov), a virus that infects plants that is commonly found in pepper products. The most common bacterial indicators come from Bacteroides HF183 and Lachnospiraceae Lachno3, both common gut bacteria. It's applied to sludge samples (also labeled `post_grit _removal` in the socrata dataset's `key_plot_id`), which have been concentrated in preparation for treatment. The resulting value is unitless, and tracks the proportion of individuals who's shedding behavior has changed. | +Direct viral concentration is not a robust indicator of the number and severity of cases in the sewershed. +In wastewater systems that mix drainage and sewage, for example, the effluent will be significantly diluted whenever there is rain. +In order to produce indicators that are more strongly related to COVID levels, signals are corrected in a few different ways. +The two approaches used in the NWSS datasets to normalize viral concentration are as follows: + +| Normalization method | Description | +|-|-| +| `flow_population` | This is calculated as $$\frac{v\cdot r}{p}$$, where $$v$$ is measured viral concentration, $$r$$ is measured flow rate, and $$p$$ is population served. This normalization method is applied to concentrations $$v$$ measured from raw (unconcentrated) wastewater (labeled `raw_wastewater` in the Socrata dataset's `key_plot_id` field). The resulting value is in units of viral gene copies per person per day. It tracks the total number of individuals whose shedding behavior has changed. | +| `microbial` | This divides a measurement by the concentration of one of several potential fecal biomarkers. These are molecular indicators of either viruses or bacteria commonly found throughout the population. The most common viral indicator comes from the pepper mild mottle virus (PMMoV), a virus that infects plants and is commonly found in pepper products. The most common bacterial indicators come from Bacteroides HF183 and Lachnospiraceae Lachno3, both common gut bacteria. This normalization method is applied to sludge samples (labeled `post_grit_removal` in the Socrata dataset's `key_plot_id` field), which have been concentrated in preparation for treatment. The resulting value is unitless, and tracks the proportion of individuals who's shedding behavior has changed. | ### Post Processing methods Regardless of normalization method, the daily wastewater data is noisy; to make the indicators more useful, the NWSS has provided different methods of post-processing the data: From 279d34a6f5abc70b74e7bc0377c08607b34b67cc Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Tue, 11 Jun 2024 11:04:53 -0400 Subject: [PATCH 8/9] formatting --- docs/api/covidcast-signals/nwss.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index 94f867836..0a5e32aac 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -56,19 +56,20 @@ The two approaches used in the NWSS datasets to normalize viral concentration ar | `microbial` | This divides a measurement by the concentration of one of several potential fecal biomarkers. These are molecular indicators of either viruses or bacteria commonly found throughout the population. The most common viral indicator comes from the pepper mild mottle virus (PMMoV), a virus that infects plants and is commonly found in pepper products. The most common bacterial indicators come from Bacteroides HF183 and Lachnospiraceae Lachno3, both common gut bacteria. This normalization method is applied to sludge samples (labeled `post_grit_removal` in the Socrata dataset's `key_plot_id` field), which have been concentrated in preparation for treatment. The resulting value is unitless, and tracks the proportion of individuals who's shedding behavior has changed. | ### Post Processing methods -Regardless of normalization method, the daily wastewater data is noisy; to make the indicators more useful, the NWSS has provided different methods of post-processing the data: +Regardless of normalization method, the daily wastewater data is noisy; to make the indicators more useful, the NWSS has provided versions of the data that are post-processed in different ways. -| Post-processing method | Description | -|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `pcr_conc_smoothed` | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on the [normalization method](#normalization-methods)). This is then [smoothed](#smoothing) using a cubic spline, then [aggregated](#aggregation) to the given `geo_type` with a population-weighted sum. | -| `detect_prop_15d` | The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the prior 15-day period. The detection proportion is the 15-day rolling sum of SARS-CoV-2 detections divided by the 15-day rolling sum of the number of tests for each sample site and multiplying by 100, aggregated with a population weighted sum. The result is a percentage. | -| `ptc_15d` | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. Note that for county and higher level `geo_type`s, this is an average of the percentage change at each site, weighted by population, rather than the percentage change for the entire region. We recommend caution in the use of these signals at aggregated `geo_type`s. | -| `percentile` | This metric shows whether SARS-CoV-2 virus levels at a site are currently higher or lower than past historical levels at the same site. 20% means that 80% of observed values are higher than this value, while 20% are lower. 0% means levels are the lowest they have been at the site, while 100% means they are the highest. Note that at county or higher level `geo_type`s, this is not the percentile for overall state levels, but the *average percentile across sites*, weighted by population, which makes it difficult to meaningfully interpret. We do not recommended its use outside of the site level. | +| Post-processing method | Description | +|-|-| +| `pcr_conc_smoothed` | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on the [normalization method](#normalization-methods)). This is then [smoothed](#smoothing) using a cubic spline, then [aggregated](#aggregation) to the given `geo_type` with a population-weighted sum. | +| `detect_prop_15d` | The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the prior 15-day period. The detection proportion is the 15-day rolling sum of SARS-CoV-2 detections divided by the 15-day rolling sum of the number of tests for each sample site and multiplying by 100, aggregated with a population weighted sum. The result is a percentage. | +| `ptc_15d` | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. Note that for county and higher level `geo_type`s, this is an average of the percentage change at each site, weighted by population, rather than the percentage change for the entire region. We recommend caution in the use of these signals at aggregated `geo_type`s. | +| `percentile` | This metric shows whether SARS-CoV-2 virus levels at a site are currently higher or lower than past historical levels at the same site. 20% means that 80% of observed values are higher than this value, while 20% are lower. 0% means levels are the lowest they have been at the site, while 100% means they are the highest. Note that at county or higher level `geo_type`s, this is not the percentile for overall state levels, but the *average percentile across sites*, weighted by population, which makes it difficult to meaningfully interpret. We do not recommended its use outside of the site level. | ### Full signal list Not every triple of post processing method, provider, and normalization actually contains data. Here is a complete list of the actual signals, with the total population at all sites which report that signal, as of March 28, 2024: | Signal | First Date available | Population served on 03-28-24 | +|-|-|-| | `prc_conc_smoothed_cdc_verily_flow_population` | 2023-11-13 | 3,321,873 | | `prc_conc_smoothed_cdc_verily_microbial` | 2023-11-09 | 430,000 | | `prc_conc_smoothed_cdc_nwss_flow_population` | 2020-07-05 | 800,704 | @@ -91,7 +92,9 @@ Not every triple of post processing method, provider, and normalization actually | `percentile_cdc_wws_microbial` | 2022-01-09 | 42,185,773 | What is missing? `wws` only normalizes using `microbial` measures, so the 5 different post-processing methods are not present for `*_wws_flow_population`. + ## Estimation + ### Aggregation For any given day and signal, we do a population weighted sum of the target signal, with the weight depending only on non-missing and non-zero sample sites for that day in particular. For example, say $$p_i$$ is the population at sample site $$i$$; if there are 3 sample sites in Washington on a particular day, then the weight at site $$i$$ is $$\frac{p_i}{p_1+p_2+p_3}$$. If the next day there are only 2 sample locations, the weight at site $$i$$ becomes $$\frac{p_i}{p_1+p_2}$$. @@ -102,11 +105,9 @@ Contrast this with the average level, which goes from 55 to 65, for an 18% perce `percentile` has a similar difficulty, although it is a more involved calculation that uses all other values in the time series, rather than just a 15 day window. - - ### Smoothing -The `pcr_conc_smoothed` based signals all use smoothed splines to generate the averaged value at each location. Specifically, the smoothing uses [`smooth.spline`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/smooth.spline) from R, with `spar=0.5`, a smoothing parameter. If the value at time $$t_i$$ is given by $$y_i$$, for a function $$f$$ represented by splines this minimizes +The `pcr_conc_smoothed` based signals all use smoothed splines to generate an average value at each location. Specifically, the smoothing uses [`smooth.spline`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/smooth.spline) from R, with `spar=0.5`, a smoothing parameter. If the value at time $$t_i$$ is given by $$y_i$$, for a function $$f$$ represented by splines this minimizes $$ \sum_i (y_i - f(t_i))^2 + 16 \frac{\mathrm{tr}(X'X)}{\mathrm{tr}(\Sigma)} \int \big|f^{''}(z)\big|^2~\mathrm{d}z$$ @@ -117,7 +118,6 @@ This is only true for `pcr_conc_smoothed`; all other smoothing is done via simpl ## Limitations - The NWSS is still expanding to get coverage nationwide, so it is currently an uneven sample; the largest signals above cover ~42 million people as of March 2024. Around 80% of the US is served by municipal wastewater collection systems, or around 272 million. Standard errors and sample sizes are not applicable to these signals. From f14edaa41b5263825cb2f08795aa1a31ef7eb667 Mon Sep 17 00:00:00 2001 From: Nat DeFries <42820733+nmdefries@users.noreply.github.com> Date: Mon, 17 Jun 2024 15:37:39 -0400 Subject: [PATCH 9/9] todos --- docs/api/covidcast-signals/nwss.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/docs/api/covidcast-signals/nwss.md b/docs/api/covidcast-signals/nwss.md index 0a5e32aac..d805ccf60 100644 --- a/docs/api/covidcast-signals/nwss.md +++ b/docs/api/covidcast-signals/nwss.md @@ -32,9 +32,11 @@ State and national level signals are calculated as weighted means of the site da ## Signal features The signals vary across the underlying data provider, the normalization method, and the post-processing method. +PCR used to measure viral concentrations + ### Providers The NWSS acts as a coordinating body, receiving wastewater data through a number of providers. Data providers can change as the project has evolved. -Most recently, in autumn 2023, the primary direct commercial provider for the NWSS changed from [Biobot](https://biobot.io/) to [Verily](https://publichealth.verily.com/). +Most recently, in autumn 2023, the primary direct commercial provider for the NWSS changed from [Biobot](https://biobot.io/) to [Verily](https://publichealth.verily.com/). Measurement method and thus meaning varies by provider. Data from different providers varies widely in magnitude for the same nominal reporting units. The following table shows the history of data providers: | Provider | Available | Description | @@ -60,11 +62,13 @@ Regardless of normalization method, the daily wastewater data is noisy; to make | Post-processing method | Description | |-|-| -| `pcr_conc_smoothed` | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on the [normalization method](#normalization-methods)). This is then [smoothed](#smoothing) using a cubic spline, then [aggregated](#aggregation) to the given `geo_type` with a population-weighted sum. | +| `pcr_conc_smoothed` | PCR measurements (the exact method depends on the data provider), initially provided in virus concentrations per volume of input (type depends on the [normalization method](#normalization-methods)). This is then [smoothed](#smoothing) using a cubic spline over 14 days, then [aggregated](#aggregation) to the given `geo_type` with a population-weighted sum. | | `detect_prop_15d` | The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the prior 15-day period. The detection proportion is the 15-day rolling sum of SARS-CoV-2 detections divided by the 15-day rolling sum of the number of tests for each sample site and multiplying by 100, aggregated with a population weighted sum. The result is a percentage. | -| `ptc_15d` | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration, expressed as a percentage change. Note that for county and higher level `geo_type`s, this is an average of the percentage change at each site, weighted by population, rather than the percentage change for the entire region. We recommend caution in the use of these signals at aggregated `geo_type`s. | +| `ptc_15d` | The percent change in SARS-CoV-2 RNA levels over the 15 days preceding `timestamp`. It is the coefficient of the linear regression of the log-transformed unsmoothed PCR concentration versus time, expressed as a percentage change. Note that for county and higher level `geo_type`s, this is an average of the percentage change at each site, weighted by population, rather than the percentage change for the entire region. We recommend caution in the use of these signals at aggregated `geo_type`s. | | `percentile` | This metric shows whether SARS-CoV-2 virus levels at a site are currently higher or lower than past historical levels at the same site. 20% means that 80% of observed values are higher than this value, while 20% are lower. 0% means levels are the lowest they have been at the site, while 100% means they are the highest. Note that at county or higher level `geo_type`s, this is not the percentile for overall state levels, but the *average percentile across sites*, weighted by population, which makes it difficult to meaningfully interpret. We do not recommended its use outside of the site level. | +TODO: don't aggregate percentile + ### Full signal list Not every triple of post processing method, provider, and normalization actually contains data. Here is a complete list of the actual signals, with the total population at all sites which report that signal, as of March 28, 2024: @@ -103,6 +107,8 @@ For example, say we have two sample sites, with initial levels at 10 and 100, an Their respective percentage increases are 10% and 100%, so the average percent increase is 55%. Contrast this with the average level, which goes from 55 to 65, for an 18% percent increase in the average[^2]. +TODO + `percentile` has a similar difficulty, although it is a more involved calculation that uses all other values in the time series, rather than just a 15 day window. ### Smoothing @@ -122,6 +128,8 @@ The NWSS is still expanding to get coverage nationwide, so it is currently an un Standard errors and sample sizes are not applicable to these signals. +TODO: cubic spline method may change over time + ## Missingness If a sample site has too few individuals, the NWSS does not provide the detailed data, so we cannot include it in our aggregations.