Skip to content
/ usum Public

USUM: Plotting sequence similarity using USEARCH & UMAP & t-SNE

License

Notifications You must be signed in to change notification settings

prihoda/usum

Repository files navigation

USUM: Plotting sequence similarity embeddings using USEARCH & UMAP

USUM uses USEARCH and UMAP (or t-SNE) to plot DNA 🧬 and protein 🧶 sequence similarity embeddings.

PyPI - Downloads PyPI license PyPI version CI

Installation

  1. Install USEARCH dependency manually: https://drive5.com/usearch/download.html
    (consider supporting the author by buying the 64bit license)

  2. Install usum using PIP:

pip install usum

Usage

Use usum to plot input protein or DNA sequences in FASTA format.

Show all available options using usum --help

Minimal example

usum example.fa --maxdist 0.2 --termdist 0.3 --output example

Multiple input files with labels

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --output example

This will produce a PNG plot:

UMAP static example

An interactive Bokeh HTML plot is also created:

UMAP Bokeh example

Using t-SNE instead of UMAP

You can also produce a t-SNE plot using the --tsne flag.

usum first.fa second.fa --labels First Second --maxdist 0.2 --termdist 0.3 --tsne --output example

This will produce a PNG plot:

UMAP static example

Plotting random subset

You can use --limit to extract and plot a random subset of the input sequences.

# Plot 10k sequences from each input file
usum first.fa second.fa --labels First Second --limit 10000 --maxdist 0.2 --termdist 0.3 --output example

You can control randomness and reproducibility using the --seed option.

Plotting options

See usum --help for all plotting options.

See UMAP API Guide for more info about the UMAP options.

  • Use --limit to plot a random subset of records
  • Use --width and --height to control plot size in pixels
  • Use --resume to reuse previous distance matrix from the output folder
  • Use --tsne to produce a t-SNE embedding instead of UMAP (you can use this with --resume)
  • Use --umap-spread to control how close together the embedded points are in the UMAP embedding
  • Use --umap-min-dist to control minimum distance between points in UMAP embedding
  • Use --neighbors to control number of neighbors in UMAP graph

Reusing previous results

When changing just the plot options, you can use --resume to reuse previous results from the output folder.

Warning This will reuse the previous distance matrix, so changes to limits or USEARCH args won't take effect.

# Reuse result from umap output directory
usum --resume --output example --width 600 --height 600 --theme fire

Programmatic use

from usum import usum

# Show help
help(usum)

# Run USUM
usum(inputs=['input.fa'], output='usum', maxdist=0.2, termdist=0.3)

How it works

  • A sparse distance matrix is calculated using USEARCH calc_distmx command.
  • The distances are based on % identity, so the method is agnostic to sequence type (DNA or protein)
  • The distance matrix is embedded as a precomputed metric using UMAP
  • The embedding is plotted using umap.plot.