Skip to content

"mutation" mode for statistics #2982

Closed as not planned
Closed as not planned
@petrelharp

Description

@petrelharp

In #2948 @tforest is working on adding time windows to statistics. How this should work for mode="branch" is clear; however, it is not clear for mode="site". (It doesn't make sense to have time windows for mode="node", btw.) The reason is that site mode sums over all alleles (thus mimicing what happens with real data); and if there's been multiple mutations, there could be more than one mutation at different times that led to the same allele.

So, what we'd really like to do, and what people probably mostly imagine is happening with mode="site", is to sum over all mutations, rather than alleles. This would be essentially equivalent to mode="branch", but measuring branch area by counting how many mutations are on it instead of doing span times length. So - we're proposing adding a new mode, called "mutation" to statistics, that does this.

In a bit more detail. Recall that to compute a statistic we have some weights and a summary function, $f()$. For a node $n$ let $w_T(n)$ be the total weight of all samples below the node in tree $T$. Then the branch-mode statistic is
$$\sum_T |T| \sum_n \ell_T(n) f(w_T(n)) ,$$
where $|T|$ is the span of the tree and $\ell_T(n)$ is the length of the branch above $n$ in tree $T$. And, if we define $w_s(a)$ to be the total weight of all samples carrying allele $a$ at site $s$, then the site-mode stat is
$$\sum_s \sum_a f(w_s(a)) .$$
We can rewrite this as a sum over trees, pedantically, as
$$\sum_T \sum_{s \in T} \sum_{a \in s} f(\sum_{d_s(m) = a} w_T(n_m)),$$
where the sums are over, respectively, trees; sites in that tree; and alleles at that site; while "$\sum_{d_s(m) = a}$" means "the sum over all mutations at site $s$ whose derived allele is $a$"; and $n_m$ is the node for mutation $m$. This looks complicated but it's just what you'd imagine. So, we're proposing that mutation-mode just doesn't group together the distinct mutations producing the same allele:
$$\sum_T \sum_{s \in T} \sum_{a \in s} \sum_{d_s(m) = a} f(w_T(n_m)) .$$
If polarised=False, then the sum would also include "the root", i.e., include a term for $f(\bar w - \sum_m w_T(n_m))$, as if there were a mutation to the ancestral state at the root as well. (And, maybe the stat should be polarised by default, unlike other stats?)

There have been some other requests for (essentially) this, mostly around divergence. By trying to compute exactly-what-you-get-from-sequences (in mode="site") we have in a sense removed the advantage of the tree sequence that lets us distinguish one from multiple mutations (in principle).

Finally: the branch stat is the expected value of the site stat, given the trees, under infinite sites neutral mutations. This mode would remove that last caveat: the branch stat is the expected value of the mutation stat, as long as mutations are neutral (and Poisson).

Metadata

Metadata

Assignees

No one assigned

    Labels

    futureIssues that are closed as they are not planned in the medium-term, but which are still desirable.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions