Skip to content

Commit 4ef2d28

Browse files
authored
Merge pull request #106 from hyanwong/tskit-3-7
Some tutorial updates for tskit 0.3.7
2 parents f2c969a + bdc2b21 commit 4ef2d28

File tree

6 files changed

+136
-27
lines changed

6 files changed

+136
-27
lines changed

_config.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Book settings
22
# Learn more at https://jupyterbook.org/customize/config.html
33

4-
title: Tskit Tutorials
4+
title: Tree Sequence Tutorials
55
author: Tskit Developers
6-
# logo: logo.png
6+
logo: _static/tskit_logo.svg
77

88
# Force re-execution of notebooks on each build.
99
# See https://jupyterbook.org/content/execute.html

_static/tskit_logo.svg

+30
Loading

basics.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -119,9 +119,9 @@ SVG(ts.draw_svg(y_axis=True, y_gridlines=True, time_scale="rank"))
119119
Each tree records the lines of descent along which a piece of DNA has been
120120
inherited (ignore for the moment the red symbols, which represent a mutation).
121121
For example, the first tree tells us that DNA from ancestral genome 7 duplicated
122-
to produce two lineages, which ended up in sample genome 1 and sample genome 4 in the
123-
current population (since this pattern is seen in all trees, this must have happened
124-
for all the DNA along the entire 1000 base pair genome).
122+
to produce two lineages, which ended up in genomes 1 and 4, both of which exist in the
123+
current population. In fact, since this pattern is seen in all trees, these particular
124+
lines of inheritance were taken by all the DNA in this 1000 base pair genome.
125125

126126

127127
(sec_basics_terminology_nodes)=
@@ -274,17 +274,17 @@ for mutation in ts.mutations():
274274

275275
The mutation can have a {attr}`~Mutation.time` or if, as in this case, the times of
276276
mutations in the tree sequence are unknown, all mutations can have the special NaN value
277-
``tskit.UNKNOWN_TIME``. Notice that the genomic position of the mutation is not included.
278-
Instead, that is a property of the _site_ to which the mutation refers, in this case,
279-
site ID 0 (which happens to be at position 257):
277+
{data}`tskit.UNKNOWN_TIME`. Notice that the genomic position of the mutation is not
278+
included. Instead, that is a property of the _site_ to which the mutation refers, in
279+
this case, site ID 0 (which happens to be at position 751):
280280

281281
```{code-cell} ipython3
282282
print(ts.site(0)) # For convenience, the Python API also returns the mutations at the site
283283
```
284284

285285
In the plot above, since the the only mutation is above node 8 in the last tree, and has
286286
a {attr}`~Mutation.derived_state` of "G", we know that the samples descending from node
287-
8 in the first tree (sample genomes 2, 4, and 5) have a "G" at {attr}`~Site.position` 257,
287+
8 in the first tree (sample genomes 2, 3, and 5) have a "G" at {attr}`~Site.position` 751,
288288
while the others have the {attr}`~Site.ancestral_state` of "T". This means that Ada is
289289
homozygous for "T", Bob is homozygous for "G", and Cat is heterozygous "T/G".
290290
In other words the ancestral state and the details of any mutations at that site,

getting_started.md

+8-8
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ contains 40 *sample nodes*, one for each genome.
7676

7777
## Processing trees
7878

79-
A common idiom is to iterate over all the {class}`tree objects<Tree>` in a tree
79+
A common idiom is to iterate over all the {class}`Tree` objects in a tree
8080
sequence. This process underlies many tree sequence algorithms, including those we'll
8181
encounter later in this tutorial for calculating
8282
{ref}`population genetic statistics<tskit:sec_stats>`.
@@ -137,7 +137,7 @@ Now that we know all trees have coalesced, we know that at each position in the
137137
all the 40 sample nodes must have one most recent common ancestor (MRCA). Below, we
138138
iterate over the trees, finding the IDs of the root (MRCA) node for each tree. The
139139
time of this root node can be found via the {meth}`tskit.TreeSequence.node` method, which
140-
returns a {class}`node object<Node>` whose attributes include the node time:
140+
returns a {class}`Node` object whose attributes include the node time:
141141

142142
```{code-cell} ipython3
143143
import matplotlib.pyplot as plt
@@ -194,7 +194,6 @@ for {meth}`~TreeSequence.simplify`.
194194
It can often be helpful to slim down a tree sequence so that it represents the genealogy
195195
of a smaller subset of the original samples. This can be done using the powerful
196196
{meth}`TreeSequence.simplify` method.
197-
.
198197

199198
The {meth}`TreeSequence.draw_svg` method allows us to draw
200199
more than one tree: either the entire tree sequence, or
@@ -233,11 +232,11 @@ sites or mutations in your analyses.
233232
:::
234233

235234
For many purposes it may be better to focus on the genealogy of your samples, rather than
236-
the sites and
237-
mutations that
235+
the {ref}`sites<sec_data_model_definitions_site>` and
236+
{ref}`mutations<sec_data_model_definitions_mutation>` that
238237
{ref}`define <sec_what_is_dna_data>` the genome sequence itself. Nevertheless,
239-
{program}`tskit` also provides efficient ways to return {class}`site objects<Site>` and
240-
{class}`mutation objects<Mutation>` from a tree sequence.
238+
{program}`tskit` also provides efficient ways to return {class}`Site` object and
239+
{class}`Mutation` objects from a tree sequence.
241240
For instance, under the finite sites model of mutation that we used above, multiple mutations
242241
can occur at some sites, and we can identify them by iterating over the sites using the
243242
{meth}`TreeSequence.sites` method:
@@ -512,7 +511,8 @@ in rough order of importance:
512511
* {meth}`~TreeSequence.simplify()` reduces the number of sample nodes in the tree
513512
sequence to a specified subset
514513
* {meth}`~TreeSequence.keep_intervals()` (or its complement,
515-
{meth}`~TreeSequence.delete_intervals()`) deletes unwanted parts of the genome
514+
{meth}`~TreeSequence.delete_intervals()`) removes genetic information from
515+
specific regions of the genome
516516
* {meth}`~TreeSequence.draw_svg()` plots tree sequences (and {meth}`Tree.draw_svg()`
517517
plots trees)
518518
* {meth}`~TreeSequence.at()` returns a tree at a particular genomic position

intro.md

+1-3
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,7 @@ kernelspec:
1313

1414
(sec_intro)=
1515

16-
# Tree sequence tutorials
17-
18-
**Welcome!**
16+
# Welcome!
1917

2018
This site contains a number of tutorials to develop your understanding of
2119
[tree sequences](https://tskit.dev/learn.html#what) and software programs,

viz.md

+88-7
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,14 @@ def viz_ts():
3636
ts_small.dump("data/viz_ts_small.trees")
3737
3838
ts_small_mutated = msprime.sim_mutations(ts_small, rate=1e-7, random_seed=342)
39+
# 3rd tree should have first site with 2 muts
40+
first_site_tree_2 = next(ts_small_mutated.at_index(2).sites())
41+
assert len(first_site_tree_2.mutations) == 2
42+
# mutation 8 should be above node 16 in the 1st tree
43+
assert ts_small_mutated.site(8).mutations[0].id == 8
44+
assert ts_small_mutated.site(8).mutations[0].node == 16
3945
ts_small_mutated.dump("data/viz_ts_small_mutated.trees")
4046
41-
4247
def viz_root_mut():
4348
"""
4449
Taken from the drawing unit tests
@@ -758,7 +763,8 @@ y_ticks = np.delete(y_ticks, np.argwhere(np.ediff1d(y_ticks) <= 0.01))
758763
759764
svg = ts.draw_svg(size=(1000, 350), y_axis=True, y_gridlines=True, y_ticks=y_ticks, style=(
760765
".tree .lab {font-family: sans-serif}"
761-
".x-axis .tick .lab {text-anchor: start; transform: rotate(90deg) translate(8px)}"
766+
# Normal X axis tick labels have dominant baseline: hanging, but it needs centring when rotated
767+
".x-axis .tick .lab {text-anchor: start; dominant-baseline: central; transform: rotate(90deg)}"
762768
".y-axis .grid {stroke: #DDDDDD}"
763769
+ ".tree :not(.leaf).node > .lab {transform: translate(0,0); text-anchor:middle; fill: white}"
764770
+ ".tree :not(.leaf).node > .sym {transform: scale(3.5)}"
@@ -789,11 +795,12 @@ SVG(ts.draw_svg(style=css_string, time_scale="rank", x_lim=[0, 30]))
789795
```
790796

791797
#### Leaf, sample & isolated nodes
792-
By default, sample nodes are square and non-sample nodes circular. However, neither need
793-
to be at time 0. Moreover, leaves need not be samples, and samples need not be leaves.
794-
Here we change the previous tree sequence to make some leaves non-samples and some samples
795-
internal nodes. To highlight the change, we have plotted sample nodes in green, and
796-
leaf nodes (if not samples) in blue.
798+
By default, sample nodes are square and non-sample nodes circular (at the moment this
799+
can't easily be changed). However, neither need to be at specific times: sample nodes can
800+
be at times other than 0, and nonsample nodes can be at time 0. Moreover, leaves need not
801+
be samples, and samples need not be leaves. Here we change the previous tree sequence to
802+
make some leaves non-samples and some samples internal nodes. To highlight the change,
803+
we have plotted sample nodes in green, and leaf nodes (if not samples) in blue.
797804

798805
```{code-cell} ipython3
799806
:"tags": ["hide-input"]
@@ -812,6 +819,80 @@ means that there can be sample nodes which are "isolated" in a tree. These are d
812819
unconnected to the main topology in one or more trees (e.g. nodes 7 and 8 above).
813820
:::
814821

822+
#### Tick labels and gridlines
823+
824+
Y tick labels can be specified explicitly, which allows time scales to be plotted
825+
e.g. in years even if the tree sequence ticks in generations. The grid lines associated
826+
with each y tick can also be changed or even hidden individually using the CSS
827+
[nth-child pseudo-selector](https://www.w3.org/TR/2018/REC-selectors-3-20181106/#nth-child-pseudo),
828+
where tickmarks are indexed from the bottom. This is used in
829+
the {ref}`sec_msprime_introgression` tutorial to show lines behind the trees at specific,
830+
important times. Below we show a slightly simpler example than in that tutorial,
831+
keeping node and mutation symbols in black, but colouring gridlines instead:
832+
833+
```{code-cell} ipython3
834+
:"tags": ["hide-input"]
835+
# Function from the introgression tutorial - see there for justification
836+
import msprime
837+
time_units = 1000 / 25 # Conversion factor for kya to generations
838+
839+
def run_simulation(sequence_length, random_seed=None):
840+
demography = msprime.Demography()
841+
# The same size for all populations; highly unrealistic!
842+
Ne = 10**4
843+
demography.add_population(name="Africa", initial_size=Ne)
844+
demography.add_population(name="Eurasia", initial_size=Ne)
845+
demography.add_population(name="Neanderthal", initial_size=Ne)
846+
847+
# 2% introgression 50 kya
848+
demography.add_mass_migration(
849+
time=50 * time_units, source='Eurasia', dest='Neanderthal', proportion=0.02)
850+
# Eurasian 'merges' backwards in time into Africa population, 70 kya
851+
demography.add_mass_migration(
852+
time=70 * time_units, source='Eurasia', dest='Africa', proportion=1)
853+
# Neanderthal 'merges' backwards in time into African population, 300 kya
854+
demography.add_mass_migration(
855+
time=300 * time_units, source='Neanderthal', dest='Africa', proportion=1)
856+
857+
ts = msprime.sim_ancestry(
858+
recombination_rate=1e-8,
859+
sequence_length=sequence_length,
860+
samples=[
861+
msprime.SampleSet(1, ploidy=1, population='Africa'),
862+
msprime.SampleSet(1, ploidy=1, population='Eurasia'),
863+
# Neanderthal sample taken 30 kya
864+
msprime.SampleSet(1, ploidy=1, time=30 * time_units, population='Neanderthal'),
865+
],
866+
demography = demography,
867+
record_migrations=True, # Needed for tracking segments.
868+
random_seed=random_seed,
869+
)
870+
return ts
871+
872+
ts = run_simulation(20 * 10**6, 1)
873+
874+
css = ".y-axis .tick .lab {font-size: 85%}" # Multi-line labels unimplemented: use smaller font
875+
css += ".y-axis .tick .grid {stroke: lightgrey}" # Default gridline type
876+
css += ".y-axis .ticks .tick:nth-child(3) .grid {stroke-dasharray: 4}" # 3rd line from bottom
877+
css += ".y-axis .ticks .tick:nth-child(3) .grid {stroke: magenta}" # also 3rd line from bottom
878+
css += ".y-axis .ticks .tick:nth-child(4) .grid {stroke: blue}" # 4th line from bottom
879+
css += ".y-axis .ticks .tick:nth-child(5) .grid {stroke: darkgrey}" # 5th line from bottom
880+
y_ticks = {0: "0", 30: "30", 50: "Introgress", 70: "Eur origin", 300: "Nea origin", 1000: "1000"}
881+
y_ticks = {y * time_units: lab for y, lab in y_ticks.items()}
882+
SVG(ts.draw_svg(
883+
size=(1200, 500),
884+
x_lim=(0, 25_000),
885+
time_scale="log_time",
886+
node_labels = {0: "Afr", 1: "Eur", 2: "Neand"},
887+
y_axis=True,
888+
y_label="Time (kya)",
889+
x_label="Genomic position (bp)",
890+
y_ticks=y_ticks,
891+
y_gridlines=True,
892+
style=css))
893+
```
894+
895+
815896
#### Animation
816897

817898
The classes attached to the SVG also allow elements to be animated. Here's a

0 commit comments

Comments
 (0)