GCEH is a lightweight analysis tool for exploring codon usage, GC/GC3 composition, and the overall structural features of a protein-coding nucleotide sequence ('complexity'). It provides descriptive statistics and visualizations, optionally comparing the input against a reference codon table.
-
Parses a DNA or RNA coding sequence (case-insensitive; accepts A/C/G/T/U).
-
Removes stop codons and truncates the sequence at the first encountered stop.
-
Generates a sorted codon table listing:
- codon triplet
- encoded amino acid
- relative frequency within the sequence
- Produces a stacked bar plot of codon usage per amino acid.
- Highlights the predominant codon for each amino acid using labels.
- Label size is adjustable; set to
0to hide labels entirely.
- The "k-mer k" field sets the k-mer size used for generating the raw complexity profile.
- "Smooth" parameter applies an adjustable sliding window to smooth the resulting curve.
- The GC-balance α slider lets you control how strongly GC content influences the complexity score (from no GC contribution at zero to GC-weighted behavior).
Note: The complexity score is a custom composite heuristic combining Shannon entropy, k-mer richness, GC balance, homopolymer run length, and periodicity penalty. It is not derived from or validated against any published sequence complexity measure (e.g., Lempel-Ziv, Wootton-Federhen, DUST). It is intended as a descriptive tool for exploring sequence composition, particularly in the context of synthetic construct design where balanced base composition is desirable.
_normalize_sequence(seq): uppercases, converts U→T, and replaces non-ACGT with N. _shannon_entropy_from_counts(counts): mono-nucleotide Shannon entropy (bits) over A/C/G/T. _gc_balance(counts): GC balance score peaking at 50% GC, 0 at extremes. _longest_run(seq): length of the longest homopolymer run (A/C/G/T only). _kmer_richness(seq, k): for each window, count how many distinct k-length words appear (ignoring any with N) and divide by the maximum possible count, min(4^k, number of valid k-mers). The score ranges from 0 (no diversity) to 1 (all possible k-mers observed). _periodicity_penalty(seq, max_shift): scan fixed offsets (shifts 1…max_shift), compute the fraction of positions where seq[i] equals seq[i+shift] (ignoring N/non-ACGT), and take the highest fraction as the penalty. Scores range 0–1; higher means more single-base periodicity/repeat signal. window_metrics(seq, k, max_n_frac, gcbal_alpha): per-window composite complexity score and components (entropy, GC balance, k-mer richness, homopolymer penalty, periodicity penalty, N fraction). compute_complexity_track(sequence, window_size, step, k, max_n_frac, gcbal_alpha, return_components, smooth): slides across a sequence to produce arrays of window start/end/mid coordinates and complexity scores (optionally smoothed) plus component tracks for plotting or downstream analysis.
- Computes GC and GC3 content as a sliding-window moving average.
- Window size is user-configurable.
- Cumulative GC3 plotting to assess compositional uniformity.
- Compare your sequence against any consensus codon-usage table (included: H. sapiens, E. coli).
- Highlights dominant codons in CDS.
Click to launch an executable notebook environment:
git clone https://github.com/prion-1/GCEH.git
cd GCEH
pip install -r requirements.txtThen run the Jupyter notebook:
jupyter notebook gceh.ipynbor use the complexity_analysis.py module directly in your own scripts.
- A valid coding sequence (DNA or RNA), containing letters A, C, G, T, U.
- Mixed case accepted; whitespace and newlines ignored.
- Sequence is automatically truncated at the first in-frame stop codon.
GCEH/
├── gceh.ipynb # Main interactive notebook
├── complexity_analysis.py # GC/GC3 sliding-window analysis module
├── requirements.txt # Python dependencies
├── README.md # This file
└── LICENSE # GPL-3.0 license
This project is released under the GPL-3.0 License.
See LICENSE for details.
Tip
Pasta.