This repository contains tools and scripts for analyzing sequence-dependent DNA methylation patterns using whole-genome bisulfite sequencing (WGBS) data.
This pipeline processes aligned methylation data (BAM) to extract sequence features and correlate them with observed fragment-level methylation patterns.
This pipeline depends on wgbstools (>= version 0.2.0). Install here: wgbstools Typical install time is a few minutes plus downloading reference genome fasta files which can be another few minutes.
moreutils, bedtools (v2.30.0), samtools (1.21), htslib (1.21), tabix (1.13+ds), blat (37x1), zsh 5.8.1 (or convert to bash), R coloc package (6.0.0), TwoSampleMR (0.6.29) numpy (1.25.2), pandas (2.3.3), scipy (1.12.0), scikit-learn (1.1.3), pybedtools (0.9.0)
The analysis is based on the human DNA methylation atlas.
- Source: Loyfer et al., Nature 2023
- Dataset: The BAM files used in this analysis are obtained from the Loyfer Atlas via the European Genome-phenome Archive (EGA).
- EGA Study Accession: EGAS00001006791
This repository comes with publicly available WGBS data for the purposes of a demo. It is located in "atlas_data" folder. Follow the below instructions,
and in particular execute_pipeline.txt on the demo data to run the demo. The expected outputs of the demo are bimodal regions identified on the bam files
and splitting WGBS reads by allele at the one specified SNP sd_asm_analysis/homog/homog_aligned/all_snps/all_gnom_ad_in_bimodal.snps_file.txt.gz .
Expected run time of the demo is 10 minutes. Expected run time on a large scale atlas is ~4 hours per WGBS sample.
Executing the full pipeline on many SNPs and many WGBS files grouped by tissue then identifies tissue-specific SD-ASM.
This demo was tested using zsh 5.8.1 (x86_64-ubuntu-linux-gnu) and Python 3.10.12 . For full python environment that this was tested on see demo/requirements.txt . Typicall install times for the git repository is less than a minute.
All processing steps and command-line instructions are documented in the file execute_pipeline.txt. To reproduce the analysis or run it on new data, please follow the sequence of commands provided in that file.
-
Clone the Repository Use
git clone https://github.com/yonniejon/sequence_dependent_methylation_analysis.gitto download the project. -
Data Preparation
- Download the required BAM files from the EGA study EGAS00001006791.
- Ensure they are placed in the expected directory structure as defined in the scripts.
-
Run the Pipeline
- Open the file execute_pipeline.txt.
- Execute the steps sequentially. This includes:
- Pre-processing and filtering of BAM files.
- Extracting methylation states at specific CpG sites.
- Calculating sequence-dependent features.
- Downstream statistical analysis.
If you use this code or the provided data, please cite:
Rosenski, J., Sabag, O., et al. The genetic basis for DNA methylation variation across tissues and development. (2025). https://doi.org/10.1101/2025.09.15.675351