Description
In #2948 @tforest is working on adding time windows to statistics. How this should work for mode="branch"
is clear; however, it is not clear for mode="site"
. (It doesn't make sense to have time windows for mode="node"
, btw.) The reason is that site mode sums over all alleles (thus mimicing what happens with real data); and if there's been multiple mutations, there could be more than one mutation at different times that led to the same allele.
So, what we'd really like to do, and what people probably mostly imagine is happening with mode="site"
, is to sum over all mutations, rather than alleles. This would be essentially equivalent to mode="branch"
, but measuring branch area by counting how many mutations are on it instead of doing span times length. So - we're proposing adding a new mode, called "mutation"
to statistics, that does this.
In a bit more detail. Recall that to compute a statistic we have some weights and a summary function,
where
We can rewrite this as a sum over trees, pedantically, as
where the sums are over, respectively, trees; sites in that tree; and alleles at that site; while "$\sum_{d_s(m) = a}$" means "the sum over all mutations at site
If polarised=False
, then the sum would also include "the root", i.e., include a term for
There have been some other requests for (essentially) this, mostly around divergence. By trying to compute exactly-what-you-get-from-sequences (in mode="site"
) we have in a sense removed the advantage of the tree sequence that lets us distinguish one from multiple mutations (in principle).
Finally: the branch stat is the expected value of the site stat, given the trees, under infinite sites neutral mutations. This mode would remove that last caveat: the branch stat is the expected value of the mutation stat, as long as mutations are neutral (and Poisson).