miRNA Experiment Search

A Python pipeline to search public NCBI metadata sources for experiments involving selected miRNAs, cell lines, species, assay types, perturbation types, and control conditions.

The tool currently searches GEO and SRA through the NCBI Entrez E-utilities API and produces clean candidate tables plus summary statistics.

Overview

This repository is designed for exploratory discovery of public experiments related to miRNAs. It supports searches such as:

miRNA perturbation RNA-seq experiments
AGO/CLIP/CLASH studies for miRNA-target interactions
qPCR or reporter-assay validation studies
experiments in specific cell lines
studies restricted to a species
studies with control-like metadata

The pipeline uses user-friendly canonical filter names, such as RNA-seq, CLIP, CLASH, qPCR, overexpression, or knockdown, and expands them internally into broader keyword dictionaries.

Key Features

Search by miRNA list, cell-line list, or both
Search GEO and/or SRA
Filter by species, for example Homo sapiens
Filter by experiment type:
- RNA-seq
- small_RNA-seq
- CLIP
- CLASH
- qPCR
- microarray
- proteomics
- reporter_assay
Filter by perturbation type:
- overexpression
- knockdown
- knockout
- perturbation_any
Optional control keyword requirement
Optional date filtering
Built-in exclusion of precursor/primary miRNA-focused studies
Generates:
- candidate experiment table
- summary statistics
- summary by miRNA
- summary by accession

Repository Structure

mirna-experiment-search/
├── config/
│   └── search.yaml
├── data/
│   ├── mirnas.txt
│   └── cell_lines.txt
├── results/
├── src/
│   ├── annotate.py
│   ├── entrez.py
│   ├── io_utils.py
│   ├── query_builder.py
│   ├── summarize.py
│   └── vocab.py
├── run_search.py
├── pyproject.toml
├── uv.lock
└── README.md

Installation

This project uses uv for environment and dependency management.

From inside the cloned repository:

uv sync

To check that the environment works:

uv run python --version

Input Files

miRNA list

Provide a plain text file with one miRNA per line.

Example: data/mirnas.txt

hsa-miR-22-3p
hsa-miR-192-5p
hsa-miR-200c-3p

Cell-line list

Provide a plain text file with one cell line per line.

Example: data/cell_lines.txt

HEK293T
HEK293
HeLa

Both files are optional. The pipeline can search by miRNA only, cell line only, experiment type only, or any combination.

Configuration

The main configuration file is:

config/search.yaml

Example:

mirnas_file: "data/mirnas.txt"
cell_lines_file: "data/cell_lines.txt"

sources:
  - GEO
  - SRA

species: "Homo sapiens"

experiment_types:
  - RNA-seq

perturbation_types: []

require_control: false

date_from: null
date_to: null

retmax: 20
sleep_seconds: 0.34

exclude_precursor_terms: true

outdir: "results"

The config file contains only high-level user choices. Synonyms and related keywords are defined internally in src/vocab.py.

For example:

experiment_types:
  - RNA-seq

is expanded internally to terms such as:

RNA-seq
RNA seq
RNA sequencing
mRNA-seq
transcriptome
transcriptomic

Running the Pipeline

Run with the default config:

uv run python run_search.py --config config/search.yaml

Run with command-line overrides:

uv run python run_search.py \
  --config config/search.yaml \
  --sources GEO \
  --species "Homo sapiens" \
  --experiment-types RNA-seq CLIP \
  --perturbation-types overexpression knockdown \
  --retmax 10 \
  --outdir results/test_run

On Windows PowerShell:

uv run python run_search.py `
  --config config/search.yaml `
  --sources GEO `
  --species "Homo sapiens" `
  --experiment-types RNA-seq CLIP `
  --perturbation-types overexpression knockdown `
  --retmax 10 `
  --outdir results/test_run

Example Searches

Search human RNA-seq studies for selected miRNAs

uv run python run_search.py \
  --mirnas-file data/mirnas.txt \
  --species "Homo sapiens" \
  --experiment-types RNA-seq

Search HEK-related CLIP and CLASH experiments

uv run python run_search.py \
  --cell-lines-file data/cell_lines.txt \
  --species "Homo sapiens" \
  --experiment-types CLIP CLASH

Search miRNA overexpression experiments

uv run python run_search.py \
  --mirnas-file data/mirnas.txt \
  --species "Homo sapiens" \
  --perturbation-types overexpression

Require control-like metadata

uv run python run_search.py \
  --mirnas-file data/mirnas.txt \
  --species "Homo sapiens" \
  --experiment-types RNA-seq \
  --require-control

Search with date bounds

uv run python run_search.py \
  --mirnas-file data/mirnas.txt \
  --species "Homo sapiens" \
  --experiment-types RNA-seq \
  --date-from 2018 \
  --date-to 2026

Outputs

The pipeline writes output files to the configured output directory.

Default:

results/

`candidate_experiments.tsv`

Main output table with one row per candidate record.

Typical columns include:

Column	Description
`query_mirna`	miRNA used in the query
`query_cell_line`	cell line used in the query
`source`	GEO or SRA
`uid`	Entrez UID
`accession`	GEO/SRA accession when available
`summary`	metadata summary, if available
`species_or_organism`	organism metadata
`matched_experiment_types`	inferred experiment categories
`matched_perturbation_types`	inferred perturbation categories
`has_perturbation_keyword`	whether perturbation terms were found
`has_overexpression_keyword`	whether overexpression terms were found
`has_knockdown_keyword`	whether knockdown terms were found
`has_knockout_keyword`	whether knockout terms were found
`has_control_keyword`	whether control-like terms were found
`inferred_cell_lines`	cell lines inferred from metadata
`url`	NCBI URL
`title`	record title

The internal Entrez query is not included in the final candidate table.

`summary_statistics.tsv`

Compact count table with overall statistics, such as:

total candidate records
unique accessions
records by source
records with controls
records by experiment type
records by perturbation type
records by inferred cell line

`summary_by_mirna.tsv`

One row per queried miRNA with:

number of records
unique accessions
GEO/SRA counts
control counts
inferred cell lines
matched experiment types
matched perturbation types

`summary_by_accession.tsv`

One row per accession with:

source
accession
number of matched rows
matched miRNAs
matched cell lines
inferred experiment and perturbation types
title
URL

Entrez Search Implementation

Searches are performed using the NCBI Entrez Programming Utilities, specifically:

esearch to retrieve matching record UIDs
esummary to retrieve metadata for those UIDs

For each combination of miRNA, cell line, source, and filters, the pipeline builds an Entrez query using:

miRNA aliases
optional cell-line terms
species terms
experiment-type synonyms
perturbation-type synonyms
optional control terms
optional date range
optional precursor/primary miRNA exclusion terms

The query is submitted to either:

GEO via Entrez database gds
SRA via Entrez database sra

A short delay is applied between requests to avoid excessive request rates.

Built-in Vocabulary

The controlled vocabulary is defined in:

src/vocab.py

This includes dictionaries for:

experiment types
perturbation types
control terms
species terms
precursor-exclusion terms
common cell-line aliases

This design keeps the user config simple while allowing the code to expand high-level categories into comprehensive search terms.

Limitations

This tool searches metadata, not raw experimental data. Therefore:

GEO/SRA metadata can be incomplete or noisy
cell-line inference is keyword-based and approximate
control detection is keyword-based and approximate
a candidate hit does not guarantee the experiment is directly usable
manual curation is recommended before downstream biological analysis

Recommended Workflow

Prepare mirnas.txt and/or cell_lines.txt
Select filters in config/search.yaml
Run the pipeline
Inspect candidate_experiments.tsv
Use summary_statistics.tsv and summary_by_mirna.tsv for triage
Manually verify top candidate studies in GEO/SRA

Development

Run the pipeline:

uv run python run_search.py --config config/search.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

miRNA Experiment Search

Overview

Key Features

Repository Structure

Installation

Input Files

miRNA list

Cell-line list

Configuration

Running the Pipeline

Example Searches

Search human RNA-seq studies for selected miRNAs

Search HEK-related CLIP and CLASH experiments

Search miRNA overexpression experiments

Require control-like metadata

Search with date bounds

Outputs

`candidate_experiments.tsv`

`summary_statistics.tsv`

`summary_by_mirna.tsv`

`summary_by_accession.tsv`

Entrez Search Implementation

Built-in Vocabulary

Limitations

Recommended Workflow

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_search.py		run_search.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

miRNA Experiment Search

Overview

Key Features

Repository Structure

Installation

Input Files

miRNA list

Cell-line list

Configuration

Running the Pipeline

Example Searches

Search human RNA-seq studies for selected miRNAs

Search HEK-related CLIP and CLASH experiments

Search miRNA overexpression experiments

Require control-like metadata

Search with date bounds

Outputs

candidate_experiments.tsv

summary_statistics.tsv

summary_by_mirna.tsv

summary_by_accession.tsv

Entrez Search Implementation

Built-in Vocabulary

Limitations

Recommended Workflow

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`candidate_experiments.tsv`

`summary_statistics.tsv`

`summary_by_mirna.tsv`

`summary_by_accession.tsv`

Packages