From a13ddca7d89aa7969cd5ddde41738d760795234f Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 09:43:31 -0500 Subject: [PATCH 01/40] docs(cruncher): document FIMO-like scoring --- src/dnadesign/cruncher/README.md | 8 ++++++++ src/dnadesign/cruncher/docs/demos/demo_basics_two_tf.md | 7 +++++++ .../cruncher/docs/demos/demo_campaigns_multi_tf.md | 5 +++++ src/dnadesign/cruncher/docs/reference/config.md | 4 ++++ 4 files changed, 24 insertions(+) diff --git a/src/dnadesign/cruncher/README.md b/src/dnadesign/cruncher/README.md index 7d8fe983..677281bd 100644 --- a/src/dnadesign/cruncher/README.md +++ b/src/dnadesign/cruncher/README.md @@ -20,6 +20,14 @@ A typical workflow looks like: 3. Generate synthetic sequences (e.g., via [MCMC](https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo)) using the locked motifs. 4. Analyze / visualize / report from run artifacts. +Scoring is **FIMO-like**: cruncher builds log-odds PWMs against a 0‑order +background, scans each candidate sequence to find the best window per TF +(optionally bidirectional), and can scale that best hit to a p‑value using a +DP‑derived null distribution (`score_scale: logp`). For `logp`, the tail +probability for the best window is converted to a sequence‑level p via +`p_seq = 1 − (1 − p_win)^n_windows`. This is an internal implementation; cruncher +does not call the FIMO binary. + --- ### Quickstart (happy path) diff --git a/src/dnadesign/cruncher/docs/demos/demo_basics_two_tf.md b/src/dnadesign/cruncher/docs/demos/demo_basics_two_tf.md index bcf46b99..7aabeb4f 100644 --- a/src/dnadesign/cruncher/docs/demos/demo_basics_two_tf.md +++ b/src/dnadesign/cruncher/docs/demos/demo_basics_two_tf.md @@ -4,6 +4,13 @@ **cruncher** scores each TF by the best PWM match anywhere in the candidate sequence on either strand, then optimizes the min/soft‑min across TFs so the weakest TF improves. It explores sequence space with Gibbs + parallel tempering (MCMC) and returns a diverse elite set (unique up to reverse‑complement) plus diagnostics for stability/mixing. Motif overlap is allowed and treated as informative structure in analysis. +Scoring is **FIMO-like** (internal implementation): for each PWM, cruncher builds +log‑odds scores against a 0‑order background, scans all windows to find the best +hit (optionally bidirectional), and optionally converts that best hit to a +p‑value via a DP‑derived null distribution (`score_scale: logp`). For `logp`, +the tail probability for the best window becomes a sequence‑level p via +`p_seq = 1 − (1 − p_win)^n_windows`. + **Terminology:** - **sites** = training binding sequences diff --git a/src/dnadesign/cruncher/docs/demos/demo_campaigns_multi_tf.md b/src/dnadesign/cruncher/docs/demos/demo_campaigns_multi_tf.md index 3e8ea4f8..c6118f85 100644 --- a/src/dnadesign/cruncher/docs/demos/demo_campaigns_multi_tf.md +++ b/src/dnadesign/cruncher/docs/demos/demo_campaigns_multi_tf.md @@ -4,6 +4,11 @@ This demo walks through a process of running category-based sequence optimization campaigns, with a focus on campaign selection (site counts + PWM quality), derived configs, and multi-TF runs. +Scoring is **FIMO-like** (internal implementation): cruncher uses PWM log‑odds +scanning against a 0‑order background, takes the best window per TF (optionally +both strands), and can convert that best hit to a p‑value via a DP‑derived null +distribution (`score_scale: logp`, with `p_seq = 1 − (1 − p_win)^n_windows`). + ### Demo instance - **Workspace**: `src/dnadesign/cruncher/workspaces/demo_campaigns_multi_tf/` diff --git a/src/dnadesign/cruncher/docs/reference/config.md b/src/dnadesign/cruncher/docs/reference/config.md index 27270e80..6022cff0 100644 --- a/src/dnadesign/cruncher/docs/reference/config.md +++ b/src/dnadesign/cruncher/docs/reference/config.md @@ -385,6 +385,10 @@ Notes: - `objective.bidirectional=true` scores both strands (reverse complement) when scanning PWMs. - `objective.combine` controls how per-TF scores are combined (`min` for weakest-TF optimization, `sum` for sum-based). - `objective.allow_unscaled_llr=true` allows `score_scale=llr` in multi-TF runs (otherwise validation fails). +- `objective.score_scale=logp` is FIMO‑like: it uses a DP‑derived null + distribution under a 0‑order background to compute a tail p‑value for the + best window, then converts to a sequence‑level p via + `p_seq = 1 − (1 − p_win)^n_windows` before reporting `−log10(p_seq)`. - `elites.min_hamming` is the Hamming-distance filter for elites (0 disables). If `output.trim.enabled=true` yields variable lengths, the distance is computed over the shared prefix plus the length difference. - `elites.k` controls how many sequences are retained before diversity filtering (0 = keep all). - `elites.dsDNA_canonicalize=true` treats reverse complements as identical when computing unique fractions and (optionally) stores `canonical_sequence` in elites. From 4cad6062be5ec21cb591b57ab9bcad91bf7d6f8a Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 09:43:44 -0500 Subject: [PATCH 02/40] build: add pixi task aliases for dense and cruncher --- pixi.toml | 1 + 1 file changed, 1 insertion(+) diff --git a/pixi.toml b/pixi.toml index af81839d..7b4672b4 100644 --- a/pixi.toml +++ b/pixi.toml @@ -6,6 +6,7 @@ platforms = ["osx-arm64", "osx-64", "linux-64"] [tasks] cruncher = "uv run cruncher" +dense = "uv run dense" [dependencies] meme = "*" From 71ea141680e921ea096f56ee929bbbd16d090f13 Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 09:43:55 -0500 Subject: [PATCH 03/40] densegen: improve FIMO sampling UX and audit metadata --- .../densegen/src/adapters/outputs/parquet.py | 13 + .../densegen/src/adapters/sources/base.py | 2 +- .../src/adapters/sources/binding_sites.py | 2 +- .../src/adapters/sources/pwm_artifact.py | 53 +- .../src/adapters/sources/pwm_artifact_set.py | 42 +- .../densegen/src/adapters/sources/pwm_fimo.py | 178 +++++++ .../src/adapters/sources/pwm_jaspar.py | 42 +- .../src/adapters/sources/pwm_matrix_csv.py | 49 +- .../densegen/src/adapters/sources/pwm_meme.py | 42 +- .../src/adapters/sources/pwm_meme_set.py | 42 +- .../src/adapters/sources/pwm_sampling.py | 464 ++++++++++++++++-- .../src/adapters/sources/sequence_library.py | 2 +- .../src/adapters/sources/usr_sequences.py | 2 +- src/dnadesign/densegen/src/cli.py | 34 +- src/dnadesign/densegen/src/config/__init__.py | 77 ++- src/dnadesign/densegen/src/core/metadata.py | 8 + .../densegen/src/core/metadata_schema.py | 26 + src/dnadesign/densegen/src/core/pipeline.py | 124 ++++- .../densegen/src/core/pvalue_bins.py | 32 ++ .../densegen/src/integrations/__init__.py | 3 + .../densegen/src/integrations/meme_suite.py | 41 ++ .../tests/test_cli_summarize_library.py | 8 + .../densegen/tests/test_outputs_parquet.py | 8 + .../tests/test_pipeline_library_index.py | 15 + .../densegen/tests/test_pwm_fimo_utils.py | 93 ++++ .../densegen/tests/test_pwm_sampling_bins.py | 74 +++ 26 files changed, 1377 insertions(+), 99 deletions(-) create mode 100644 src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py create mode 100644 src/dnadesign/densegen/src/core/pvalue_bins.py create mode 100644 src/dnadesign/densegen/src/integrations/__init__.py create mode 100644 src/dnadesign/densegen/src/integrations/meme_suite.py create mode 100644 src/dnadesign/densegen/tests/test_pipeline_library_index.py create mode 100644 src/dnadesign/densegen/tests/test_pwm_fimo_utils.py create mode 100644 src/dnadesign/densegen/tests/test_pwm_sampling_bins.py diff --git a/src/dnadesign/densegen/src/adapters/outputs/parquet.py b/src/dnadesign/densegen/src/adapters/outputs/parquet.py index 35f533f2..a160098f 100644 --- a/src/dnadesign/densegen/src/adapters/outputs/parquet.py +++ b/src/dnadesign/densegen/src/adapters/outputs/parquet.py @@ -32,6 +32,12 @@ def _meta_arrow_type(name: str, pa): "input_pwm_ids", "required_regulators", } + list_float = { + "input_pwm_pvalue_bins", + } + list_int = { + "input_pwm_pvalue_bin_ids", + } int_fields = { "length", "random_seed", @@ -61,6 +67,7 @@ def _meta_arrow_type(name: str, pa): "compression_ratio", "input_pwm_score_threshold", "input_pwm_score_percentile", + "input_pwm_pvalue_threshold", "sampling_fraction", "sampling_fraction_pairs", "gap_fill_gc_min", @@ -79,10 +86,16 @@ def _meta_arrow_type(name: str, pa): "sampling_relaxed_cap", "gap_fill_used", "gap_fill_relaxed", + "input_pwm_keep_all_candidates_debug", + "input_pwm_include_matched_sequence", } if name in list_str: return pa.list_(pa.string()) + if name in list_float: + return pa.list_(pa.float64()) + if name in list_int: + return pa.list_(pa.int64()) if name == "used_tfbs_detail": return pa.list_( pa.struct( diff --git a/src/dnadesign/densegen/src/adapters/sources/base.py b/src/dnadesign/densegen/src/adapters/sources/base.py index d1f9758c..2d645291 100644 --- a/src/dnadesign/densegen/src/adapters/sources/base.py +++ b/src/dnadesign/densegen/src/adapters/sources/base.py @@ -39,7 +39,7 @@ def infer_format(path: Path) -> str | None: class BaseDataSource(abc.ABC): @abc.abstractmethod - def load_data(self, *, rng=None) -> Tuple[List, Optional[pd.DataFrame]]: + def load_data(self, *, rng=None, outputs_root: Path | None = None) -> Tuple[List, Optional[pd.DataFrame]]: """ Returns: (data_entries, meta_df) diff --git a/src/dnadesign/densegen/src/adapters/sources/binding_sites.py b/src/dnadesign/densegen/src/adapters/sources/binding_sites.py index 6bea5b56..6c74a022 100644 --- a/src/dnadesign/densegen/src/adapters/sources/binding_sites.py +++ b/src/dnadesign/densegen/src/adapters/sources/binding_sites.py @@ -57,7 +57,7 @@ def _load_table(self, path: Path, fmt: str) -> pd.DataFrame: return pd.read_excel(path) raise ValueError(f"Unsupported binding_sites.format: {fmt}") - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): data_path = resolve_path(self.cfg_path, self.path) if not (data_path.exists() and data_path.is_file()): raise FileNotFoundError(f"Binding sites file not found. Looked here:\n - {data_path}") diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py index b2604800..446ca742 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py @@ -152,7 +152,7 @@ class PWMArtifactDataSource(BaseDataSource): cfg_path: Path sampling: dict - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): if rng is None: raise ValueError("PWM sampling requires an RNG; pass the pipeline RNG explicitly.") artifact_path = resolve_path(self.cfg_path, self.path) @@ -173,8 +173,25 @@ def load_data(self, *, rng=None): length_range = sampling.get("length_range") trim_window_length = sampling.get("trim_window_length") trim_window_strategy = sampling.get("trim_window_strategy", "max_info") - - selected = sample_pwm_sites( + scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() + pvalue_threshold = sampling.get("pvalue_threshold") + pvalue_bins = sampling.get("pvalue_bins") + pvalue_bin_ids = sampling.get("pvalue_bin_ids") + bgfile = sampling.get("bgfile") + selection_policy = str(sampling.get("selection_policy", "random_uniform")) + keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) + include_matched_sequence = bool(sampling.get("include_matched_sequence", False)) + bgfile_path: Path | None = None + if bgfile is not None: + bgfile_path = resolve_path(self.cfg_path, str(bgfile)) + if not (bgfile_path.exists() and bgfile_path.is_file()): + raise FileNotFoundError(f"PWM sampling bgfile not found. Looked here:\n - {bgfile_path}") + debug_output_dir: Path | None = None + if keep_all_candidates_debug and outputs_root is not None: + debug_output_dir = Path(outputs_root) / "meta" / "fimo" + + return_meta = scoring_backend == "fimo" + result = sample_pwm_sites( rng, motif, strategy=strategy, @@ -184,20 +201,36 @@ def load_data(self, *, rng=None): max_seconds=max_seconds, score_threshold=threshold, score_percentile=percentile, + scoring_backend=scoring_backend, + pvalue_threshold=pvalue_threshold, + pvalue_bins=pvalue_bins, + pvalue_bin_ids=pvalue_bin_ids, + bgfile=bgfile_path, + selection_policy=selection_policy, + keep_all_candidates_debug=keep_all_candidates_debug, + include_matched_sequence=include_matched_sequence, + debug_output_dir=debug_output_dir, + debug_label=f"{artifact_path.stem}__{motif.motif_id}", length_policy=length_policy, length_range=length_range, trim_window_length=trim_window_length, trim_window_strategy=str(trim_window_strategy), + return_metadata=return_meta, ) + if return_meta: + selected, meta_by_seq = result # type: ignore[misc] + else: + selected = result # type: ignore[assignment] + meta_by_seq = {} entries = [(motif.motif_id, seq, str(artifact_path)) for seq in selected] import pandas as pd - df_out = pd.DataFrame( - { - "tf": [motif.motif_id] * len(selected), - "tfbs": selected, - "source": [str(artifact_path)] * len(selected), - } - ) + rows = [] + for seq in selected: + row = {"tf": motif.motif_id, "tfbs": seq, "source": str(artifact_path)} + if meta_by_seq: + row.update(meta_by_seq.get(seq, {})) + rows.append(row) + df_out = pd.DataFrame(rows) return entries, df_out diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py index 9ed3dae9..6fff70b3 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py @@ -28,7 +28,7 @@ class PWMArtifactSetDataSource(BaseDataSource): sampling: dict overrides_by_motif_id: dict[str, dict] | None = None - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): if rng is None: raise ValueError("PWM sampling requires an RNG; pass the pipeline RNG explicitly.") @@ -69,7 +69,24 @@ def load_data(self, *, rng=None): length_range = sampling_cfg.get("length_range") trim_window_length = sampling_cfg.get("trim_window_length") trim_window_strategy = sampling_cfg.get("trim_window_strategy", "max_info") - selected = sample_pwm_sites( + scoring_backend = str(sampling_cfg.get("scoring_backend", "densegen")).lower() + pvalue_threshold = sampling_cfg.get("pvalue_threshold") + pvalue_bins = sampling_cfg.get("pvalue_bins") + pvalue_bin_ids = sampling_cfg.get("pvalue_bin_ids") + bgfile = sampling_cfg.get("bgfile") + selection_policy = str(sampling_cfg.get("selection_policy", "random_uniform")) + keep_all_candidates_debug = bool(sampling_cfg.get("keep_all_candidates_debug", False)) + include_matched_sequence = bool(sampling_cfg.get("include_matched_sequence", False)) + bgfile_path: Path | None = None + if bgfile is not None: + bgfile_path = resolve_path(self.cfg_path, str(bgfile)) + if not (bgfile_path.exists() and bgfile_path.is_file()): + raise FileNotFoundError(f"PWM sampling bgfile not found. Looked here:\n - {bgfile_path}") + debug_output_dir: Path | None = None + if keep_all_candidates_debug and outputs_root is not None: + debug_output_dir = Path(outputs_root) / "meta" / "fimo" + return_meta = scoring_backend == "fimo" + result = sample_pwm_sites( rng, motif, strategy=strategy, @@ -79,15 +96,34 @@ def load_data(self, *, rng=None): max_seconds=max_seconds, score_threshold=threshold, score_percentile=percentile, + scoring_backend=scoring_backend, + pvalue_threshold=pvalue_threshold, + pvalue_bins=pvalue_bins, + pvalue_bin_ids=pvalue_bin_ids, + bgfile=bgfile_path, + selection_policy=selection_policy, + keep_all_candidates_debug=keep_all_candidates_debug, + include_matched_sequence=include_matched_sequence, + debug_output_dir=debug_output_dir, + debug_label=f"{Path(path).stem}__{motif.motif_id}", length_policy=length_policy, length_range=length_range, trim_window_length=trim_window_length, trim_window_strategy=str(trim_window_strategy), + return_metadata=return_meta, ) + if return_meta: + selected, meta_by_seq = result # type: ignore[misc] + else: + selected = result # type: ignore[assignment] + meta_by_seq = {} for seq in selected: entries.append((motif.motif_id, seq, str(path))) - all_rows.append({"tf": motif.motif_id, "tfbs": seq, "source": str(path)}) + row = {"tf": motif.motif_id, "tfbs": seq, "source": str(path)} + if meta_by_seq: + row.update(meta_by_seq.get(seq, {})) + all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py b/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py new file mode 100644 index 00000000..cbdc06c6 --- /dev/null +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py @@ -0,0 +1,178 @@ +""" +-------------------------------------------------------------------------------- + +dnadesign/densegen/adapters/sources/pwm_fimo.py + +Helpers for MEME Suite FIMO-backed scoring of PWM-sampled candidates. + +Module Author(s): Eric J. South +Dunlop Lab +-------------------------------------------------------------------------------- +""" + +from __future__ import annotations + +import csv +import re +import subprocess +from dataclasses import dataclass +from pathlib import Path +from typing import Iterable, Sequence + +from ...integrations.meme_suite import resolve_executable +from .pwm_sampling import PWMMotif, normalize_background + +_HEADER_RE = re.compile(r"[\s\-]+") +_SAFE_ID_RE = re.compile(r"[^A-Za-z0-9_.-]+") + + +@dataclass(frozen=True) +class FimoHit: + sequence_name: str + start: int + stop: int + strand: str + score: float + pvalue: float + matched_sequence: str | None = None + + +def _normalize_header(name: str) -> str: + return _HEADER_RE.sub("_", str(name).strip().lower()) + + +def _sanitize_id(text: str) -> str: + cleaned = _SAFE_ID_RE.sub("_", str(text).strip()) + return cleaned or "motif" + + +def build_candidate_records(motif_id: str, sequences: Sequence[str]) -> list[tuple[str, str]]: + prefix = _sanitize_id(motif_id) + return [(f"{prefix}|cand{idx}", seq) for idx, seq in enumerate(sequences)] + + +def write_candidates_fasta(records: Sequence[tuple[str, str]], out_path: Path) -> None: + lines = [] + for rec_id, seq in records: + lines.append(f">{rec_id}") + lines.append(str(seq)) + out_path.write_text("\n".join(lines) + "\n") + + +def write_minimal_meme_motif(motif: PWMMotif, out_path: Path) -> str: + motif_id = _sanitize_id(motif.motif_id) + bg = normalize_background(motif.background) + lines = [ + "MEME version 4", + "", + "ALPHABET= ACGT", + "", + "strands: + -", + "", + "Background letter frequencies:", + f"A {bg['A']:.6g} C {bg['C']:.6g} G {bg['G']:.6g} T {bg['T']:.6g}", + "", + f"MOTIF {motif_id}", + f"letter-probability matrix: alength= 4 w= {len(motif.matrix)}", + ] + for row in motif.matrix: + lines.append( + f"{float(row.get('A', 0.0)):.6g} {float(row.get('C', 0.0)):.6g} " + f"{float(row.get('G', 0.0)):.6g} {float(row.get('T', 0.0)):.6g}" + ) + out_path.write_text("\n".join(lines) + "\n") + return motif_id + + +def parse_fimo_tsv(text: str) -> list[dict]: + lines = [ln for ln in text.splitlines() if ln.strip() and not ln.lstrip().startswith("#")] + if not lines: + return [] + reader = csv.reader(lines, delimiter="\t") + header = next(reader, None) + if header is None: + return [] + alias = {"pvalue": "p_value", "qvalue": "q_value", "sequence": "sequence_name"} + normalized = [alias.get(_normalize_header(h), _normalize_header(h)) for h in header] + idx = {name: i for i, name in enumerate(normalized)} + required = {"sequence_name", "start", "stop", "strand", "score", "p_value"} + if not required.issubset(idx): + raise ValueError(f"FIMO output missing required columns: {sorted(required - set(idx))}") + rows: list[dict] = [] + for row in reader: + if not row: + continue + seq_name = row[idx["sequence_name"]] + entry = { + "sequence_name": seq_name, + "start": int(row[idx["start"]]), + "stop": int(row[idx["stop"]]), + "strand": row[idx["strand"]], + "score": float(row[idx["score"]]), + "p_value": float(row[idx["p_value"]]), + } + if "q_value" in idx: + try: + entry["q_value"] = float(row[idx["q_value"]]) + except Exception: + entry["q_value"] = None + if "matched_sequence" in idx: + entry["matched_sequence"] = row[idx["matched_sequence"]] + rows.append(entry) + return rows + + +def aggregate_best_hits(rows: Iterable[dict]) -> dict[str, FimoHit]: + best: dict[str, FimoHit] = {} + for row in rows: + seq_name = row["sequence_name"] + pval = float(row["p_value"]) + score = float(row["score"]) + hit = FimoHit( + sequence_name=seq_name, + start=int(row["start"]), + stop=int(row["stop"]), + strand=str(row["strand"]), + score=score, + pvalue=pval, + matched_sequence=row.get("matched_sequence"), + ) + prev = best.get(seq_name) + if prev is None or pval < prev.pvalue or (pval == prev.pvalue and score > prev.score): + best[seq_name] = hit + return best + + +def run_fimo( + *, + meme_motif_path: Path, + fasta_path: Path, + bgfile: Path | None = None, + norc: bool = False, + thresh: float | None = None, + include_matched_sequence: bool = False, + return_tsv: bool = False, +) -> tuple[list[dict], str | None]: + exe = resolve_executable("fimo", tool_path=None) + if exe is None: + raise FileNotFoundError( + "FIMO executable not found. Install MEME Suite and ensure `fimo` is on PATH, " + "or set MEME_BIN to the MEME bin directory (pixi users: `pixi run dense ...`)." + ) + cmd = [str(exe), "--text"] + if not include_matched_sequence: + cmd.append("--skip-matched-sequence") + if norc: + cmd.append("--norc") + if thresh is not None: + cmd.extend(["--thresh", str(thresh)]) + if bgfile is not None: + cmd.extend(["--bgfile", str(bgfile)]) + cmd.extend([str(meme_motif_path), str(fasta_path)]) + result = subprocess.run(cmd, capture_output=True, text=True, check=False) + if result.returncode != 0: + stderr = result.stderr.strip() + raise RuntimeError(f"FIMO failed (exit {result.returncode}). {stderr or 'No stderr output.'}") + tsv_text = result.stdout + rows = parse_fimo_tsv(tsv_text) + return rows, (tsv_text if return_tsv else None) diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py index c052a008..75a73d19 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py @@ -87,7 +87,7 @@ class PWMJasparDataSource(BaseDataSource): motif_ids: Optional[List[str]] sampling: dict - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): if rng is None: raise ValueError("PWM sampling requires an RNG; pass the pipeline RNG explicitly.") jaspar_path = resolve_path(self.cfg_path, self.path) @@ -113,11 +113,28 @@ def load_data(self, *, rng=None): length_range = sampling.get("length_range") trim_window_length = sampling.get("trim_window_length") trim_window_strategy = sampling.get("trim_window_strategy", "max_info") + scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() + pvalue_threshold = sampling.get("pvalue_threshold") + pvalue_bins = sampling.get("pvalue_bins") + pvalue_bin_ids = sampling.get("pvalue_bin_ids") + bgfile = sampling.get("bgfile") + selection_policy = str(sampling.get("selection_policy", "random_uniform")) + keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) + include_matched_sequence = bool(sampling.get("include_matched_sequence", False)) + bgfile_path: Path | None = None + if bgfile is not None: + bgfile_path = resolve_path(self.cfg_path, str(bgfile)) + if not (bgfile_path.exists() and bgfile_path.is_file()): + raise FileNotFoundError(f"PWM sampling bgfile not found. Looked here:\n - {bgfile_path}") + debug_output_dir: Path | None = None + if keep_all_candidates_debug and outputs_root is not None: + debug_output_dir = Path(outputs_root) / "meta" / "fimo" entries = [] all_rows = [] for motif in motifs: - selected = sample_pwm_sites( + return_meta = scoring_backend == "fimo" + result = sample_pwm_sites( rng, motif, strategy=strategy, @@ -127,14 +144,33 @@ def load_data(self, *, rng=None): max_seconds=max_seconds, score_threshold=threshold, score_percentile=percentile, + scoring_backend=scoring_backend, + pvalue_threshold=pvalue_threshold, + pvalue_bins=pvalue_bins, + pvalue_bin_ids=pvalue_bin_ids, + bgfile=bgfile_path, + selection_policy=selection_policy, + keep_all_candidates_debug=keep_all_candidates_debug, + include_matched_sequence=include_matched_sequence, + debug_output_dir=debug_output_dir, + debug_label=f"{jaspar_path.stem}__{motif.motif_id}", length_policy=length_policy, length_range=length_range, trim_window_length=trim_window_length, trim_window_strategy=str(trim_window_strategy), + return_metadata=return_meta, ) + if return_meta: + selected, meta_by_seq = result # type: ignore[misc] + else: + selected = result # type: ignore[assignment] + meta_by_seq = {} for seq in selected: entries.append((motif.motif_id, seq, str(jaspar_path))) - all_rows.append({"tf": motif.motif_id, "tfbs": seq, "source": str(jaspar_path)}) + row = {"tf": motif.motif_id, "tfbs": seq, "source": str(jaspar_path)} + if meta_by_seq: + row.update(meta_by_seq.get(seq, {})) + all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py index c34aa49c..5df2088c 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py @@ -29,7 +29,7 @@ class PWMMatrixCSVDataSource(BaseDataSource): columns: dict[str, str] sampling: dict - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): if rng is None: raise ValueError("PWM sampling requires an RNG; pass the pipeline RNG explicitly.") if not self.motif_id or not str(self.motif_id).strip(): @@ -77,8 +77,25 @@ def load_data(self, *, rng=None): length_range = sampling.get("length_range") trim_window_length = sampling.get("trim_window_length") trim_window_strategy = sampling.get("trim_window_strategy", "max_info") - - selected = sample_pwm_sites( + scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() + pvalue_threshold = sampling.get("pvalue_threshold") + pvalue_bins = sampling.get("pvalue_bins") + pvalue_bin_ids = sampling.get("pvalue_bin_ids") + bgfile = sampling.get("bgfile") + selection_policy = str(sampling.get("selection_policy", "random_uniform")) + keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) + include_matched_sequence = bool(sampling.get("include_matched_sequence", False)) + bgfile_path: Path | None = None + if bgfile is not None: + bgfile_path = resolve_path(self.cfg_path, str(bgfile)) + if not (bgfile_path.exists() and bgfile_path.is_file()): + raise FileNotFoundError(f"PWM sampling bgfile not found. Looked here:\n - {bgfile_path}") + debug_output_dir: Path | None = None + if keep_all_candidates_debug and outputs_root is not None: + debug_output_dir = Path(outputs_root) / "meta" / "fimo" + + return_meta = scoring_backend == "fimo" + result = sample_pwm_sites( rng, motif, strategy=strategy, @@ -88,14 +105,34 @@ def load_data(self, *, rng=None): max_seconds=max_seconds, score_threshold=threshold, score_percentile=percentile, + scoring_backend=scoring_backend, + pvalue_threshold=pvalue_threshold, + pvalue_bins=pvalue_bins, + pvalue_bin_ids=pvalue_bin_ids, + bgfile=bgfile_path, + selection_policy=selection_policy, + keep_all_candidates_debug=keep_all_candidates_debug, + include_matched_sequence=include_matched_sequence, + debug_output_dir=debug_output_dir, + debug_label=f"{csv_path.stem}__{motif.motif_id}", length_policy=length_policy, length_range=length_range, trim_window_length=trim_window_length, trim_window_strategy=str(trim_window_strategy), + return_metadata=return_meta, ) + if return_meta: + selected, meta_by_seq = result # type: ignore[misc] + else: + selected = result # type: ignore[assignment] + meta_by_seq = {} entries = [(motif.motif_id, seq, str(csv_path)) for seq in selected] - df_out = pd.DataFrame( - {"tf": [motif.motif_id] * len(selected), "tfbs": selected, "source": [str(csv_path)] * len(selected)} - ) + rows = [] + for seq in selected: + row = {"tf": motif.motif_id, "tfbs": seq, "source": str(csv_path)} + if meta_by_seq: + row.update(meta_by_seq.get(seq, {})) + rows.append(row) + df_out = pd.DataFrame(rows) return entries, df_out diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py index e364c413..7f7193ac 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py @@ -56,7 +56,7 @@ class PWMMemeDataSource(BaseDataSource): motif_ids: Optional[List[str]] sampling: dict - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): if rng is None: raise ValueError("PWM sampling requires an RNG; pass the pipeline RNG explicitly.") meme_path = resolve_path(self.cfg_path, self.path) @@ -91,12 +91,29 @@ def load_data(self, *, rng=None): length_range = sampling.get("length_range") trim_window_length = sampling.get("trim_window_length") trim_window_strategy = sampling.get("trim_window_strategy", "max_info") + scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() + pvalue_threshold = sampling.get("pvalue_threshold") + pvalue_bins = sampling.get("pvalue_bins") + pvalue_bin_ids = sampling.get("pvalue_bin_ids") + bgfile = sampling.get("bgfile") + selection_policy = str(sampling.get("selection_policy", "random_uniform")) + keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) + include_matched_sequence = bool(sampling.get("include_matched_sequence", False)) + bgfile_path: Path | None = None + if bgfile is not None: + bgfile_path = resolve_path(self.cfg_path, str(bgfile)) + if not (bgfile_path.exists() and bgfile_path.is_file()): + raise FileNotFoundError(f"PWM sampling bgfile not found. Looked here:\n - {bgfile_path}") + debug_output_dir: Path | None = None + if keep_all_candidates_debug and outputs_root is not None: + debug_output_dir = Path(outputs_root) / "meta" / "fimo" entries = [] all_rows = [] for motif in motifs: pwm = _motif_to_pwm(motif, background) - selected = sample_pwm_sites( + return_meta = scoring_backend == "fimo" + result = sample_pwm_sites( rng, pwm, strategy=strategy, @@ -106,15 +123,34 @@ def load_data(self, *, rng=None): max_seconds=max_seconds, score_threshold=threshold, score_percentile=percentile, + scoring_backend=scoring_backend, + pvalue_threshold=pvalue_threshold, + pvalue_bins=pvalue_bins, + pvalue_bin_ids=pvalue_bin_ids, + bgfile=bgfile_path, + selection_policy=selection_policy, + keep_all_candidates_debug=keep_all_candidates_debug, + include_matched_sequence=include_matched_sequence, + debug_output_dir=debug_output_dir, + debug_label=f"{meme_path.stem}__{pwm.motif_id}", length_policy=length_policy, length_range=length_range, trim_window_length=trim_window_length, trim_window_strategy=str(trim_window_strategy), + return_metadata=return_meta, ) + if return_meta: + selected, meta_by_seq = result # type: ignore[misc] + else: + selected = result # type: ignore[assignment] + meta_by_seq = {} for seq in selected: entries.append((pwm.motif_id, seq, str(meme_path))) - all_rows.append({"tf": pwm.motif_id, "tfbs": seq, "source": str(meme_path)}) + row = {"tf": pwm.motif_id, "tfbs": seq, "source": str(meme_path)} + if meta_by_seq: + row.update(meta_by_seq.get(seq, {})) + all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py index f5434721..1e521914 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py @@ -42,7 +42,7 @@ class PWMMemeSetDataSource(BaseDataSource): motif_ids: Optional[List[str]] sampling: dict - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): if rng is None: raise ValueError("PWM sampling requires an RNG; pass the pipeline RNG explicitly.") resolved = [resolve_path(self.cfg_path, path) for path in self.paths] @@ -85,12 +85,29 @@ def load_data(self, *, rng=None): length_range = sampling.get("length_range") trim_window_length = sampling.get("trim_window_length") trim_window_strategy = sampling.get("trim_window_strategy", "max_info") + scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() + pvalue_threshold = sampling.get("pvalue_threshold") + pvalue_bins = sampling.get("pvalue_bins") + pvalue_bin_ids = sampling.get("pvalue_bin_ids") + bgfile = sampling.get("bgfile") + selection_policy = str(sampling.get("selection_policy", "random_uniform")) + keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) + include_matched_sequence = bool(sampling.get("include_matched_sequence", False)) + bgfile_path: Path | None = None + if bgfile is not None: + bgfile_path = resolve_path(self.cfg_path, str(bgfile)) + if not (bgfile_path.exists() and bgfile_path.is_file()): + raise FileNotFoundError(f"PWM sampling bgfile not found. Looked here:\n - {bgfile_path}") + debug_output_dir: Path | None = None + if keep_all_candidates_debug and outputs_root is not None: + debug_output_dir = Path(outputs_root) / "meta" / "fimo" entries = [] all_rows = [] for motif, background, path in motifs_payload: pwm = _motif_to_pwm(motif, background) - selected = sample_pwm_sites( + return_meta = scoring_backend == "fimo" + result = sample_pwm_sites( rng, pwm, strategy=strategy, @@ -100,14 +117,33 @@ def load_data(self, *, rng=None): max_seconds=max_seconds, score_threshold=threshold, score_percentile=percentile, + scoring_backend=scoring_backend, + pvalue_threshold=pvalue_threshold, + pvalue_bins=pvalue_bins, + pvalue_bin_ids=pvalue_bin_ids, + bgfile=bgfile_path, + selection_policy=selection_policy, + keep_all_candidates_debug=keep_all_candidates_debug, + include_matched_sequence=include_matched_sequence, + debug_output_dir=debug_output_dir, + debug_label=f"{Path(path).stem}__{pwm.motif_id}", length_policy=length_policy, length_range=length_range, trim_window_length=trim_window_length, trim_window_strategy=str(trim_window_strategy), + return_metadata=return_meta, ) + if return_meta: + selected, meta_by_seq = result # type: ignore[misc] + else: + selected = result # type: ignore[assignment] + meta_by_seq = {} for seq in selected: entries.append((pwm.motif_id, seq, str(path))) - all_rows.append({"tf": pwm.motif_id, "tfbs": seq, "source": str(path)}) + row = {"tf": pwm.motif_id, "tfbs": seq, "source": str(path)} + if meta_by_seq: + row.update(meta_by_seq.get(seq, {})) + all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py b/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py index 6a17f905..5c3514f5 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py @@ -15,12 +15,40 @@ import logging import time from dataclasses import dataclass +from pathlib import Path from typing import List, Optional, Sequence, Tuple import numpy as np +from ...core.pvalue_bins import resolve_pvalue_bins + SMOOTHING_ALPHA = 1e-6 log = logging.getLogger(__name__) +_SAFE_LABEL_RE = None + + +def _safe_label(text: str) -> str: + global _SAFE_LABEL_RE + if _SAFE_LABEL_RE is None: + import re + + _SAFE_LABEL_RE = re.compile(r"[^A-Za-z0-9_.-]+") + cleaned = _SAFE_LABEL_RE.sub("_", str(text).strip()) + return cleaned or "motif" + + +@dataclass(frozen=True) +class FimoCandidate: + seq: str + pvalue: float + score: float + bin_id: int + bin_low: float + bin_high: float + start: int + stop: int + strand: str + matched_sequence: Optional[str] = None @dataclass(frozen=True) @@ -221,6 +249,154 @@ def select_by_score( return unique[:n_sites] +def _resolve_pvalue_edges(pvalue_bins: Sequence[float] | None) -> list[float]: + edges = resolve_pvalue_bins(pvalue_bins) + if not edges: + raise ValueError("pvalue_bins must contain at least one edge.") + cleaned: list[float] = [] + prev = 0.0 + for edge in edges: + edge_val = float(edge) + if not (0.0 < edge_val <= 1.0): + raise ValueError("pvalue_bins values must be in (0, 1].") + if edge_val <= prev: + raise ValueError("pvalue_bins must be strictly increasing.") + cleaned.append(edge_val) + prev = edge_val + if abs(cleaned[-1] - 1.0) > 1e-12: + raise ValueError("pvalue_bins must end with 1.0.") + return cleaned + + +def _assign_pvalue_bin(pvalue: float, edges: Sequence[float]) -> tuple[int, float, float]: + low = 0.0 + for idx, edge in enumerate(edges): + if pvalue <= edge: + return idx, low, float(edge) + low = float(edge) + if not edges: + return 0, 0.0, 1.0 + if len(edges) == 1: + return 0, 0.0, float(edges[0]) + return len(edges) - 1, float(edges[-2]), float(edges[-1]) + + +def _format_pvalue_bins(edges: Sequence[float], counts: Sequence[int]) -> str: + if not edges or not counts: + return "-" + labels: list[str] = [] + low = 0.0 + for edge, count in zip(edges, counts): + labels.append(f"({low:.0e},{float(edge):.0e}]:{int(count)}") + low = float(edge) + return " ".join(labels) + + +def _stratified_sample( + candidates: List[FimoCandidate], + *, + n_sites: int, + rng: np.random.Generator, + n_bins: int, +) -> List[FimoCandidate]: + bins: list[list[FimoCandidate]] = [[] for _ in range(n_bins)] + for cand in candidates: + idx = max(0, min(int(cand.bin_id), n_bins - 1)) + bins[idx].append(cand) + for bucket in bins: + rng.shuffle(bucket) + picked: list[FimoCandidate] = [] + while len(picked) < n_sites: + progressed = False + for bucket in bins: + if bucket: + picked.append(bucket.pop()) + progressed = True + if len(picked) >= n_sites: + break + if not progressed: + break + return picked + + +def _select_fimo_candidates( + candidates: List[FimoCandidate], + *, + n_sites: int, + selection_policy: str, + rng: np.random.Generator, + pvalue_threshold: float, + keep_weak: bool, + n_bins: int, + context: dict, +) -> List[FimoCandidate]: + unique: list[FimoCandidate] = [] + seen: set[str] = set() + for cand in candidates: + if cand.seq in seen: + continue + seen.add(cand.seq) + unique.append(cand) + if len(unique) < n_sites: + msg_lines = [ + ( + "PWM sampling failed for motif " + f"'{context.get('motif_id')}' " + f"(width={context.get('width')}, strategy={context.get('strategy')}, " + f"length={context.get('length_label')}, window={context.get('window_label')}, " + f"backend=fimo, selection={selection_policy}, " + f"pvalue={context.get('pvalue_label')})." + ), + ( + f"Requested n_sites={context.get('n_sites')} oversample_factor={context.get('oversample_factor')} " + f"-> candidates requested={context.get('requested_candidates')} " + f"generated={context.get('generated_candidates')}" + f"{context.get('cap_label')}." + ), + (f"Unique candidates after filtering={len(unique)} (need {n_sites})."), + ] + if context.get("length_observed"): + msg_lines.append(f"Observed candidate lengths={context.get('length_observed')}.") + if context.get("pvalue_bins_label") is not None: + msg_lines.append(f"P-value bins={context.get('pvalue_bins_label')}.") + if context.get("pvalue_bin_ids") is not None: + msg_lines.append(f"Selected bins={context.get('pvalue_bin_ids')}.") + suggestions = [ + "reduce n_sites", + "relax pvalue_threshold (e.g., 1e-4 → 1e-3)", + "increase oversample_factor", + ] + if context.get("pvalue_bin_ids") is not None: + suggestions.append("broaden pvalue_bin_ids (or remove bin filtering)") + if context.get("cap_applied"): + suggestions.append("increase max_candidates (cap was hit)") + if context.get("time_limited"): + suggestions.append("increase max_seconds (time limit was hit)") + if context.get("width") is not None and int(context.get("width")) <= 6: + suggestions.append("try length_policy=range with a longer length_range") + msg_lines.append("Try next: " + "; ".join(suggestions) + ".") + raise ValueError(" ".join(msg_lines)) + if selection_policy == "random_uniform": + if len(unique) == n_sites: + return unique + picks = rng.choice(len(unique), size=n_sites, replace=False) + return [unique[int(i)] for i in picks] + if selection_policy == "top_n": + if keep_weak: + ordered = sorted(unique, key=lambda c: (-c.pvalue, c.score)) + else: + ordered = sorted(unique, key=lambda c: (c.pvalue, -c.score)) + return ordered[:n_sites] + if selection_policy == "stratified": + return _stratified_sample( + unique, + n_sites=n_sites, + rng=rng, + n_bins=n_bins, + ) + raise ValueError(f"Unsupported pwm selection_policy: {selection_policy}") + + def sample_pwm_sites( rng: np.random.Generator, motif: PWMMotif, @@ -232,19 +408,53 @@ def sample_pwm_sites( max_seconds: Optional[float] = None, score_threshold: Optional[float], score_percentile: Optional[float], + scoring_backend: str = "densegen", + pvalue_threshold: Optional[float] = None, + pvalue_bins: Optional[Sequence[float]] = None, + pvalue_bin_ids: Optional[Sequence[int]] = None, + bgfile: Optional[str | Path] = None, + selection_policy: str = "random_uniform", + keep_all_candidates_debug: bool = False, + include_matched_sequence: bool = False, + debug_output_dir: Optional[Path] = None, + debug_label: Optional[str] = None, length_policy: str = "exact", length_range: Optional[Sequence[int]] = None, trim_window_length: Optional[int] = None, trim_window_strategy: str = "max_info", -) -> List[str]: + return_metadata: bool = False, +) -> List[str] | Tuple[List[str], dict[str, dict]]: if n_sites <= 0: raise ValueError("n_sites must be > 0") if oversample_factor <= 0: raise ValueError("oversample_factor must be > 0") if max_seconds is not None and float(max_seconds) <= 0: raise ValueError("max_seconds must be > 0 when set") - if (score_threshold is None) == (score_percentile is None): - raise ValueError("PWM sampling requires exactly one of score_threshold or score_percentile") + scoring_backend = str(scoring_backend or "densegen").lower() + if scoring_backend not in {"densegen", "fimo"}: + raise ValueError(f"Unsupported pwm sampling scoring_backend: {scoring_backend}") + if scoring_backend == "densegen": + if (score_threshold is None) == (score_percentile is None): + raise ValueError("PWM sampling requires exactly one of score_threshold or score_percentile") + if pvalue_bins is not None: + raise ValueError("pvalue_bins is only valid when scoring_backend='fimo'") + if pvalue_bin_ids is not None: + raise ValueError("pvalue_bin_ids is only valid when scoring_backend='fimo'") + if include_matched_sequence: + raise ValueError("include_matched_sequence is only valid when scoring_backend='fimo'") + else: + if pvalue_threshold is None: + raise ValueError("PWM sampling requires pvalue_threshold when scoring_backend='fimo'") + pvalue_threshold = float(pvalue_threshold) + if not (0.0 < pvalue_threshold <= 1.0): + raise ValueError("pwm.sampling.pvalue_threshold must be between 0 and 1") + if selection_policy not in {"random_uniform", "top_n", "stratified"}: + raise ValueError(f"Unsupported pwm selection_policy: {selection_policy}") + if score_threshold is not None or score_percentile is not None: + log.warning( + "PWM sampling scoring_backend=fimo ignores score_threshold/score_percentile for motif %s.", + motif.motif_id, + ) if strategy == "consensus" and n_sites != 1: raise ValueError("PWM sampling strategy 'consensus' requires n_sites=1") @@ -276,10 +486,41 @@ def sample_pwm_sites( matrix = motif.matrix score_label = f"threshold={score_threshold}" if score_threshold is not None else f"percentile={score_percentile}" + pvalue_label = None + if scoring_backend == "fimo" and pvalue_threshold is not None: + comparator = ">=" if keep_low else "<=" + pvalue_label = f"{comparator}{pvalue_threshold:g}" length_label = str(length_policy) if length_policy == "range" and length_range is not None and len(length_range) == 2: length_label = f"{length_policy}({length_range[0]}..{length_range[1]})" + def _cap_label(cap_applied: bool, time_limited: bool) -> str: + cap_label = "" + if cap_applied and max_candidates is not None: + cap_label = f" (capped by max_candidates={max_candidates})" + if time_limited and max_seconds is not None: + cap_label = f"{cap_label}; max_seconds={max_seconds}" if cap_label else f" (max_seconds={max_seconds})" + return cap_label + + def _context(length_obs: str, cap_applied: bool, requested: int, generated: int, time_limited: bool) -> dict: + return { + "motif_id": motif.motif_id, + "width": width, + "strategy": strategy, + "length_label": length_label, + "window_label": window_label, + "length_observed": length_obs, + "score_label": score_label, + "pvalue_label": pvalue_label, + "n_sites": n_sites, + "oversample_factor": oversample_factor, + "requested_candidates": requested, + "generated_candidates": generated, + "cap_applied": cap_applied, + "cap_label": _cap_label(cap_applied, time_limited), + "time_limited": time_limited, + } + def _select( candidates: List[Tuple[str, float]], *, @@ -289,33 +530,13 @@ def _select( generated: int, time_limited: bool, ): - cap_label = "" - if cap_applied and max_candidates is not None: - cap_label = f" (capped by max_candidates={max_candidates})" - if time_limited and max_seconds is not None: - cap_label = f"{cap_label}; max_seconds={max_seconds}" if cap_label else f" (max_seconds={max_seconds})" return select_by_score( candidates, n_sites=n_sites, threshold=score_threshold, percentile=score_percentile, keep_low=keep_low, - context={ - "motif_id": motif.motif_id, - "width": width, - "strategy": strategy, - "length_label": length_label, - "window_label": window_label, - "length_observed": length_obs, - "score_label": score_label, - "n_sites": n_sites, - "oversample_factor": oversample_factor, - "requested_candidates": requested, - "generated_candidates": generated, - "cap_applied": cap_applied, - "cap_label": cap_label, - "time_limited": time_limited, - }, + context=_context(length_obs, cap_applied, requested, generated, time_limited), ) def _resolve_length() -> int: @@ -342,19 +563,180 @@ def _embed_with_background(seq: str, target_len: int) -> str: right = sample_sequence_from_background(rng, motif.background, right_len) return f"{left}{seq}{right}" + def _score_with_fimo( + sequences: List[str], + *, + length_obs: str, + cap_applied: bool, + requested: int, + generated: int, + time_limited: bool, + ) -> tuple[List[str], dict[str, dict]]: + import tempfile + + from .pwm_fimo import ( + aggregate_best_hits, + build_candidate_records, + run_fimo, + write_candidates_fasta, + write_minimal_meme_motif, + ) + + if pvalue_threshold is None: + raise ValueError("pvalue_threshold required for fimo backend") + resolved_bins = _resolve_pvalue_edges(pvalue_bins) + allowed_bins: Optional[set[int]] = None + if pvalue_bin_ids is not None: + allowed_bins = {int(idx) for idx in pvalue_bin_ids} + max_idx = len(resolved_bins) - 1 + if any(idx > max_idx for idx in allowed_bins): + raise ValueError(f"pvalue_bin_ids contains an index outside the available bins (max={max_idx}).") + keep_weak = keep_low + debug_path: Optional[Path] = None + debug_dir = debug_output_dir + if keep_all_candidates_debug: + if debug_dir is None: + tmp_dir = tempfile.mkdtemp(prefix="densegen-fimo-") + debug_dir = Path(tmp_dir) + log.warning( + "PWM sampling keep_all_candidates_debug enabled without outputs_root; " + "writing FIMO debug TSVs to %s", + debug_dir, + ) + debug_dir.mkdir(parents=True, exist_ok=True) + label = _safe_label(debug_label or motif.motif_id) + debug_path = debug_dir / f"{label}__fimo.tsv" + + with tempfile.TemporaryDirectory() as tmp: + tmp_path = Path(tmp) + meme_path = tmp_path / "motif.meme" + fasta_path = tmp_path / "candidates.fasta" + motif_for_fimo = PWMMotif(motif_id=motif.motif_id, matrix=matrix, background=motif.background) + write_minimal_meme_motif(motif_for_fimo, meme_path) + records = build_candidate_records(motif.motif_id, sequences) + write_candidates_fasta(records, fasta_path) + thresh = 1.0 if keep_all_candidates_debug or keep_weak else float(pvalue_threshold) + rows, raw_tsv = run_fimo( + meme_motif_path=meme_path, + fasta_path=fasta_path, + bgfile=Path(bgfile) if bgfile is not None else None, + thresh=thresh, + include_matched_sequence=include_matched_sequence or keep_all_candidates_debug, + return_tsv=debug_path is not None, + ) + if debug_path is not None and raw_tsv is not None: + debug_path.write_text(raw_tsv) + log.info("FIMO debug TSV written: %s", debug_path) + best_hits = aggregate_best_hits(rows) + + candidates: List[FimoCandidate] = [] + total_bin_counts = [0 for _ in resolved_bins] + accepted_bin_counts = [0 for _ in resolved_bins] + for rec_id, seq in records: + hit = best_hits.get(rec_id) + if hit is None: + continue + bin_id, bin_low, bin_high = _assign_pvalue_bin(hit.pvalue, resolved_bins) + total_bin_counts[bin_id] += 1 + if keep_weak: + accept = hit.pvalue >= float(pvalue_threshold) + else: + accept = hit.pvalue <= float(pvalue_threshold) + if allowed_bins is not None and bin_id not in allowed_bins: + continue + if not accept: + continue + accepted_bin_counts[bin_id] += 1 + candidates.append( + FimoCandidate( + seq=seq, + pvalue=hit.pvalue, + score=hit.score, + bin_id=bin_id, + bin_low=bin_low, + bin_high=bin_high, + start=hit.start, + stop=hit.stop, + strand=hit.strand, + matched_sequence=hit.matched_sequence, + ) + ) + + total_hits = sum(total_bin_counts) + accepted_hits = sum(accepted_bin_counts) + bins_label = _format_pvalue_bins(resolved_bins, total_bin_counts) + accepted_label = _format_pvalue_bins(resolved_bins, accepted_bin_counts) + + context = _context(length_obs, cap_applied, requested, generated, time_limited) + context["pvalue_bins_label"] = bins_label + context["pvalue_bin_ids"] = sorted(allowed_bins) if allowed_bins is not None else None + picked = _select_fimo_candidates( + candidates, + n_sites=n_sites, + selection_policy=selection_policy, + rng=rng, + pvalue_threshold=float(pvalue_threshold), + keep_weak=keep_weak, + n_bins=len(resolved_bins), + context=context, + ) + selected_bin_counts = [0 for _ in resolved_bins] + for cand in picked: + idx = max(0, min(int(cand.bin_id), len(resolved_bins) - 1)) + selected_bin_counts[idx] += 1 + selected_label = _format_pvalue_bins(resolved_bins, selected_bin_counts) + log.info( + "FIMO yield for motif %s: hits=%d accepted=%d selected=%d bins=%s accepted_bins=%s selected_bins=%s%s", + motif.motif_id, + total_hits, + accepted_hits, + len(picked), + bins_label, + accepted_label, + selected_label, + f" allowed_bins={sorted(allowed_bins)}" if allowed_bins is not None else "", + ) + meta_by_seq: dict[str, dict] = {} + for cand in picked: + meta = { + "fimo_score": cand.score, + "fimo_pvalue": cand.pvalue, + "fimo_bin_id": cand.bin_id, + "fimo_bin_low": cand.bin_low, + "fimo_bin_high": cand.bin_high, + "fimo_start": cand.start, + "fimo_stop": cand.stop, + "fimo_strand": cand.strand, + } + if cand.matched_sequence: + meta["fimo_matched_sequence"] = cand.matched_sequence + meta_by_seq[cand.seq] = meta + return [c.seq for c in picked], meta_by_seq + if strategy == "consensus": seq = "".join(max(row.items(), key=lambda kv: kv[1])[0] for row in matrix) target_len = _resolve_length() full_seq = _embed_with_background(seq, target_len) - score = score_sequence(seq, matrix, log_odds=log_odds, background=motif.background) - return _select( - [(full_seq, score)], + if scoring_backend == "densegen": + score = score_sequence(seq, matrix, log_odds=log_odds, background=motif.background) + selected = _select( + [(full_seq, score)], + length_obs=str(target_len), + cap_applied=False, + requested=1, + generated=1, + time_limited=False, + ) + return (selected, {}) if return_metadata else selected + selected, meta = _score_with_fimo( + [full_seq], length_obs=str(target_len), cap_applied=False, requested=1, generated=1, time_limited=False, ) + return (selected, meta) if return_metadata else selected requested_candidates = max(1, n_sites * oversample_factor) n_candidates = requested_candidates @@ -373,7 +755,7 @@ def _embed_with_background(seq: str, target_len: int) -> str: cap_val, ) n_candidates = max(1, n_candidates) - candidates: List[Tuple[str, float]] = [] + candidates: List[Tuple[str, str]] = [] lengths: List[int] = [] start = time.monotonic() time_limited = False @@ -389,12 +771,7 @@ def _embed_with_background(seq: str, target_len: int) -> str: else: core = sample_sequence_from_pwm(rng, matrix) full_seq = _embed_with_background(core, target_len) - candidates.append( - ( - full_seq, - score_sequence(core, matrix, log_odds=log_odds, background=motif.background), - ) - ) + candidates.append((full_seq, core)) if time_limited: log.warning( "PWM sampling hit max_seconds for motif %s: generated=%d requested=%d", @@ -405,11 +782,26 @@ def _embed_with_background(seq: str, target_len: int) -> str: length_obs = "-" if lengths: length_obs = f"{min(lengths)}..{max(lengths)}" if min(lengths) != max(lengths) else str(lengths[0]) - return _select( - candidates, + if scoring_backend == "densegen": + scored = [ + (full_seq, score_sequence(core, matrix, log_odds=log_odds, background=motif.background)) + for full_seq, core in candidates + ] + selected = _select( + scored, + length_obs=length_obs, + cap_applied=cap_applied, + requested=requested_candidates, + generated=len(candidates), + time_limited=time_limited, + ) + return (selected, {}) if return_metadata else selected + selected, meta = _score_with_fimo( + [full_seq for full_seq, _core in candidates], length_obs=length_obs, cap_applied=cap_applied, requested=requested_candidates, generated=len(candidates), time_limited=time_limited, ) + return (selected, meta) if return_metadata else selected diff --git a/src/dnadesign/densegen/src/adapters/sources/sequence_library.py b/src/dnadesign/densegen/src/adapters/sources/sequence_library.py index a6875fa9..d17bbf04 100644 --- a/src/dnadesign/densegen/src/adapters/sources/sequence_library.py +++ b/src/dnadesign/densegen/src/adapters/sources/sequence_library.py @@ -50,7 +50,7 @@ def _load_table(self, path: Path, fmt: str) -> pd.DataFrame: return pq.read_table(path).to_pandas() raise ValueError(f"Unsupported sequence_library.format: {fmt}") - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): data_path = resolve_path(self.cfg_path, self.path) if not (data_path.exists() and data_path.is_file()): raise FileNotFoundError(f"Sequence library file not found. Looked here:\n - {data_path}") diff --git a/src/dnadesign/densegen/src/adapters/sources/usr_sequences.py b/src/dnadesign/densegen/src/adapters/sources/usr_sequences.py index 32c07cb3..66606d5c 100644 --- a/src/dnadesign/densegen/src/adapters/sources/usr_sequences.py +++ b/src/dnadesign/densegen/src/adapters/sources/usr_sequences.py @@ -26,7 +26,7 @@ class USRSequencesDataSource(BaseDataSource): root: str limit: Optional[int] = None - def load_data(self, *, rng=None): + def load_data(self, *, rng=None, outputs_root: Path | None = None): try: from dnadesign.usr.src.dataset import Dataset as USRDataset # type: ignore except Exception as e: # pragma: no cover - depends on optional USR install diff --git a/src/dnadesign/densegen/src/cli.py b/src/dnadesign/densegen/src/cli.py index 5253f2c8..fbc6e201 100644 --- a/src/dnadesign/densegen/src/cli.py +++ b/src/dnadesign/densegen/src/cli.py @@ -254,9 +254,16 @@ def _warn_pwm_sampling_configs(loaded, cfg_path: Path) -> None: sampling = getattr(inp, "sampling", None) if sampling is None: continue + scoring_backend = getattr(sampling, "scoring_backend", "densegen") n_sites = getattr(sampling, "n_sites", None) oversample = getattr(sampling, "oversample_factor", None) max_candidates = getattr(sampling, "max_candidates", None) + score_threshold = getattr(sampling, "score_threshold", None) + score_percentile = getattr(sampling, "score_percentile", None) + if scoring_backend == "fimo" and (score_threshold is not None or score_percentile is not None): + warnings.append( + f"{getattr(inp, 'name', src_type)}: scoring_backend=fimo ignores score_threshold/score_percentile." + ) if isinstance(n_sites, int) and isinstance(oversample, int) and max_candidates is not None: requested = n_sites * oversample if requested > int(max_candidates): @@ -949,7 +956,11 @@ def describe( "motifs", "n_sites", "strategy", + "backend", "score", + "selection", + "bins", + "bgfile", "oversample", "max_candidates", "max_seconds", @@ -971,11 +982,25 @@ def describe( motif_label = f"{len(getattr(inp, 'paths', []) or [])} artifacts" else: motif_label = "from artifact" + backend = getattr(sampling, "scoring_backend", "densegen") score_label = "-" - if sampling.score_threshold is not None: + if backend == "fimo" and sampling.pvalue_threshold is not None: + comparator = ">=" if sampling.strategy == "background" else "<=" + score_label = f"pvalue{comparator}{sampling.pvalue_threshold}" + elif sampling.score_threshold is not None: score_label = f"threshold={sampling.score_threshold}" elif sampling.score_percentile is not None: score_label = f"percentile={sampling.score_percentile}" + selection_label = "-" if backend != "fimo" else (getattr(sampling, "selection_policy", None) or "-") + bins_label = "-" + if backend == "fimo": + bins_label = "canonical" + if getattr(sampling, "pvalue_bins", None) is not None: + bins_label = "custom" + bin_ids = getattr(sampling, "pvalue_bin_ids", None) + if bin_ids: + bins_label = f"{bins_label} pick={bin_ids}" + bgfile_label = getattr(sampling, "bgfile", None) or "-" length_label = str(sampling.length_policy) if sampling.length_policy == "range" and sampling.length_range is not None: length_label = f"range({sampling.length_range[0]}..{sampling.length_range[1]})" @@ -984,7 +1009,11 @@ def describe( motif_label, str(sampling.n_sites), str(sampling.strategy), + str(backend), score_label, + str(selection_label), + str(bins_label), + str(bgfile_label), str(sampling.oversample_factor), str(sampling.max_candidates) if sampling.max_candidates is not None else "-", str(sampling.max_seconds) if sampling.max_seconds is not None else "-", @@ -1146,6 +1175,9 @@ def run( raise typer.Exit(code=1) console.print(":tada: [bold green]Run complete[/].") + console.print("[bold]Next steps[/]:") + console.print(f" - dense summarize --library -c {cfg_path}") + console.print(f" - dense report -c {cfg_path}") # Auto-plot if configured if not no_plot and root.plots: diff --git a/src/dnadesign/densegen/src/config/__init__.py b/src/dnadesign/densegen/src/config/__init__.py index 192d2f6b..beed783f 100644 --- a/src/dnadesign/densegen/src/config/__init__.py +++ b/src/dnadesign/densegen/src/config/__init__.py @@ -21,6 +21,8 @@ from pydantic import BaseModel, ConfigDict, Field, ValidationError, field_validator, model_validator from typing_extensions import Literal +from ..core.pvalue_bins import CANONICAL_PVALUE_BINS + # ---- Strict YAML loader (duplicate keys fail) ---- class _StrictLoader(yaml.SafeLoader): @@ -161,6 +163,14 @@ class PWMSamplingConfig(BaseModel): max_seconds: Optional[float] = None score_threshold: Optional[float] = None score_percentile: Optional[float] = None + scoring_backend: Literal["densegen", "fimo"] = "densegen" + pvalue_threshold: Optional[float] = None + pvalue_bins: Optional[List[float]] = None + pvalue_bin_ids: Optional[List[int]] = None + bgfile: Optional[str] = None + selection_policy: Literal["random_uniform", "top_n", "stratified"] = "random_uniform" + keep_all_candidates_debug: bool = False + include_matched_sequence: bool = False length_policy: Literal["exact", "range"] = "exact" length_range: Optional[tuple[int, int]] = None trim_window_length: Optional[int] = None @@ -219,15 +229,76 @@ def _trim_length_ok(cls, v: Optional[int]): raise ValueError("pwm.sampling.trim_window_length must be a positive integer") return v + @field_validator("bgfile") + @classmethod + def _bgfile_ok(cls, v: Optional[str]): + if v is None: + return v + if not str(v).strip(): + raise ValueError("pwm.sampling.bgfile must be a non-empty string when set") + return str(v).strip() + + @field_validator("pvalue_bins") + @classmethod + def _pvalue_bins_ok(cls, v: Optional[List[float]]): + if v is None: + return v + if not v: + raise ValueError("pwm.sampling.pvalue_bins must be non-empty when set") + bins = [float(x) for x in v] + prev = 0.0 + for val in bins: + if not (0.0 < val <= 1.0): + raise ValueError("pwm.sampling.pvalue_bins values must be in (0, 1]") + if val <= prev: + raise ValueError("pwm.sampling.pvalue_bins must be strictly increasing") + prev = val + if abs(bins[-1] - 1.0) > 1e-12: + raise ValueError("pwm.sampling.pvalue_bins must end with 1.0") + return bins + + @field_validator("pvalue_bin_ids") + @classmethod + def _pvalue_bin_ids_ok(cls, v: Optional[List[int]]): + if v is None: + return v + if not v: + raise ValueError("pwm.sampling.pvalue_bin_ids must be non-empty when set") + ids = [int(x) for x in v] + if any(idx < 0 for idx in ids): + raise ValueError("pwm.sampling.pvalue_bin_ids values must be >= 0") + if len(set(ids)) != len(ids): + raise ValueError("pwm.sampling.pvalue_bin_ids must be unique") + return ids + @model_validator(mode="after") def _score_mode(self): has_thresh = self.score_threshold is not None has_pct = self.score_percentile is not None - if has_thresh == has_pct: - raise ValueError("pwm.sampling must set exactly one of score_threshold or score_percentile") + if self.scoring_backend == "densegen": + if has_thresh == has_pct: + raise ValueError("pwm.sampling must set exactly one of score_threshold or score_percentile") + if self.pvalue_threshold is not None: + raise ValueError("pwm.sampling.pvalue_threshold is only valid when scoring_backend='fimo'") + if self.pvalue_bins is not None: + raise ValueError("pwm.sampling.pvalue_bins is only valid when scoring_backend='fimo'") + if self.pvalue_bin_ids is not None: + raise ValueError("pwm.sampling.pvalue_bin_ids is only valid when scoring_backend='fimo'") + if self.include_matched_sequence: + raise ValueError("pwm.sampling.include_matched_sequence is only valid when scoring_backend='fimo'") + else: + if self.pvalue_threshold is None: + raise ValueError("pwm.sampling.pvalue_threshold is required when scoring_backend='fimo'") + if not (0.0 < float(self.pvalue_threshold) <= 1.0): + raise ValueError("pwm.sampling.pvalue_threshold must be between 0 and 1") + if self.pvalue_bin_ids is not None: + bins = list(self.pvalue_bins) if self.pvalue_bins is not None else list(CANONICAL_PVALUE_BINS) + max_idx = len(bins) - 1 + if any(idx > max_idx for idx in self.pvalue_bin_ids): + raise ValueError("pwm.sampling.pvalue_bin_ids contains an index outside the available bins") if self.strategy == "consensus" and int(self.n_sites) != 1: raise ValueError("pwm.sampling.strategy=consensus requires n_sites=1") - if self.score_percentile is not None: + if self.scoring_backend == "densegen" and self.score_percentile is not None: if not (0.0 < float(self.score_percentile) < 100.0): raise ValueError("pwm.sampling.score_percentile must be between 0 and 100") if self.length_policy == "exact" and self.length_range is not None: diff --git a/src/dnadesign/densegen/src/core/metadata.py b/src/dnadesign/densegen/src/core/metadata.py index 263bb1d0..cb8d8ca6 100644 --- a/src/dnadesign/densegen/src/core/metadata.py +++ b/src/dnadesign/densegen/src/core/metadata.py @@ -141,8 +141,16 @@ def build_metadata( "sampling_fraction": sampling_fraction, "sampling_fraction_pairs": sampling_fraction_pairs, "input_pwm_strategy": input_meta.get("input_pwm_strategy"), + "input_pwm_scoring_backend": input_meta.get("input_pwm_scoring_backend"), "input_pwm_score_threshold": input_meta.get("input_pwm_score_threshold"), "input_pwm_score_percentile": input_meta.get("input_pwm_score_percentile"), + "input_pwm_pvalue_threshold": input_meta.get("input_pwm_pvalue_threshold"), + "input_pwm_pvalue_bins": input_meta.get("input_pwm_pvalue_bins"), + "input_pwm_pvalue_bin_ids": input_meta.get("input_pwm_pvalue_bin_ids"), + "input_pwm_selection_policy": input_meta.get("input_pwm_selection_policy"), + "input_pwm_bgfile": input_meta.get("input_pwm_bgfile"), + "input_pwm_keep_all_candidates_debug": input_meta.get("input_pwm_keep_all_candidates_debug"), + "input_pwm_include_matched_sequence": input_meta.get("input_pwm_include_matched_sequence"), "input_pwm_n_sites": input_meta.get("input_pwm_n_sites"), "input_pwm_oversample_factor": input_meta.get("input_pwm_oversample_factor"), "fixed_elements": fixed_elements_dump, diff --git a/src/dnadesign/densegen/src/core/metadata_schema.py b/src/dnadesign/densegen/src/core/metadata_schema.py index db26d5d8..70379bd3 100644 --- a/src/dnadesign/densegen/src/core/metadata_schema.py +++ b/src/dnadesign/densegen/src/core/metadata_schema.py @@ -93,8 +93,16 @@ class MetaField: allow_none=True, ), MetaField("input_pwm_strategy", (str,), "PWM sampling strategy.", allow_none=True), + MetaField("input_pwm_scoring_backend", (str,), "PWM scoring backend (densegen|fimo).", allow_none=True), MetaField("input_pwm_score_threshold", (numbers.Real,), "PWM score threshold.", allow_none=True), MetaField("input_pwm_score_percentile", (numbers.Real,), "PWM score percentile.", allow_none=True), + MetaField("input_pwm_pvalue_threshold", (numbers.Real,), "PWM p-value threshold (FIMO).", allow_none=True), + MetaField("input_pwm_pvalue_bins", (list,), "PWM p-value bins (FIMO).", allow_none=True), + MetaField("input_pwm_pvalue_bin_ids", (list,), "Selected p-value bin indices (FIMO).", allow_none=True), + MetaField("input_pwm_selection_policy", (str,), "PWM selection policy (FIMO).", allow_none=True), + MetaField("input_pwm_bgfile", (str,), "PWM background model path (FIMO).", allow_none=True), + MetaField("input_pwm_keep_all_candidates_debug", (bool,), "PWM FIMO debug TSV enabled.", allow_none=True), + MetaField("input_pwm_include_matched_sequence", (bool,), "PWM matched-sequence capture.", allow_none=True), MetaField("input_pwm_n_sites", (int,), "PWM sampling n_sites.", allow_none=True), MetaField("input_pwm_oversample_factor", (int,), "PWM sampling oversample factor.", allow_none=True), MetaField("fixed_elements", (dict,), "Fixed-element constraints (promoters + side biases)."), @@ -198,6 +206,24 @@ def _validate_list_fields(meta: Mapping[str, Any]) -> None: raise TypeError("Metadata field 'used_tf_counts' must contain dict entries") if "tf" not in item or "count" not in item: raise ValueError("used_tf_counts entries must include 'tf' and 'count'") + + if "input_pwm_pvalue_bins" in meta: + vals = meta["input_pwm_pvalue_bins"] + if vals is not None: + if isinstance(vals, (str, bytes)) or not isinstance(vals, Sequence): + raise TypeError("Metadata field 'input_pwm_pvalue_bins' must be a list of numbers") + for item in vals: + if not isinstance(item, numbers.Real): + raise TypeError("Metadata field 'input_pwm_pvalue_bins' must contain only numbers") + + if "input_pwm_pvalue_bin_ids" in meta: + vals = meta["input_pwm_pvalue_bin_ids"] + if vals is not None: + if isinstance(vals, (str, bytes)) or not isinstance(vals, Sequence): + raise TypeError("Metadata field 'input_pwm_pvalue_bin_ids' must be a list of integers") + for item in vals: + if not isinstance(item, int): + raise TypeError("Metadata field 'input_pwm_pvalue_bin_ids' must contain only integers") if not isinstance(item["tf"], str): raise TypeError("used_tf_counts.tf must be a string") if not isinstance(item["count"], int): diff --git a/src/dnadesign/densegen/src/core/pipeline.py b/src/dnadesign/densegen/src/core/pipeline.py index 05b4e8da..dea38e94 100644 --- a/src/dnadesign/densegen/src/core/pipeline.py +++ b/src/dnadesign/densegen/src/core/pipeline.py @@ -44,6 +44,7 @@ ) from .metadata import build_metadata from .postprocess import random_fill +from .pvalue_bins import resolve_pvalue_bins from .run_manifest import PlanManifest, RunManifest from .run_paths import ( ensure_run_meta_dir, @@ -164,6 +165,18 @@ def _sampling_attr(sampling, name: str, default=None): return default +def _resolve_pvalue_bins_meta(sampling) -> list[float] | None: + if sampling is None: + return None + backend = str(_sampling_attr(sampling, "scoring_backend") or "densegen").lower() + bins = _sampling_attr(sampling, "pvalue_bins") + if backend == "fimo": + return resolve_pvalue_bins(bins) + if bins is None: + return None + return [float(v) for v in bins] + + def _extract_pwm_sampling_config(source_cfg) -> dict | None: sampling = getattr(source_cfg, "sampling", None) if sampling is None: @@ -190,6 +203,7 @@ def _extract_pwm_sampling_config(source_cfg) -> dict | None: length_range = list(length_range) return { "strategy": _sampling_attr(sampling, "strategy"), + "scoring_backend": _sampling_attr(sampling, "scoring_backend"), "n_sites": _sampling_attr(sampling, "n_sites"), "oversample_factor": _sampling_attr(sampling, "oversample_factor"), "max_candidates": _sampling_attr(sampling, "max_candidates"), @@ -199,6 +213,11 @@ def _extract_pwm_sampling_config(source_cfg) -> dict | None: "capped": capped, "score_threshold": _sampling_attr(sampling, "score_threshold"), "score_percentile": _sampling_attr(sampling, "score_percentile"), + "pvalue_threshold": _sampling_attr(sampling, "pvalue_threshold"), + "pvalue_bins": _resolve_pvalue_bins_meta(sampling), + "selection_policy": _sampling_attr(sampling, "selection_policy"), + "bgfile": _sampling_attr(sampling, "bgfile"), + "keep_all_candidates_debug": _sampling_attr(sampling, "keep_all_candidates_debug"), "length_policy": _sampling_attr(sampling, "length_policy"), "length_range": length_range, } @@ -452,8 +471,16 @@ def _input_metadata(source_cfg, cfg_path: Path) -> dict: sampling = getattr(source_cfg, "sampling", None) if sampling is not None: meta["input_pwm_strategy"] = getattr(sampling, "strategy", None) + meta["input_pwm_scoring_backend"] = getattr(sampling, "scoring_backend", None) meta["input_pwm_score_threshold"] = getattr(sampling, "score_threshold", None) meta["input_pwm_score_percentile"] = getattr(sampling, "score_percentile", None) + meta["input_pwm_pvalue_threshold"] = getattr(sampling, "pvalue_threshold", None) + meta["input_pwm_pvalue_bins"] = _resolve_pvalue_bins_meta(sampling) + meta["input_pwm_pvalue_bin_ids"] = getattr(sampling, "pvalue_bin_ids", None) + meta["input_pwm_selection_policy"] = getattr(sampling, "selection_policy", None) + meta["input_pwm_bgfile"] = getattr(sampling, "bgfile", None) + meta["input_pwm_keep_all_candidates_debug"] = getattr(sampling, "keep_all_candidates_debug", None) + meta["input_pwm_include_matched_sequence"] = getattr(sampling, "include_matched_sequence", None) meta["input_pwm_n_sites"] = getattr(sampling, "n_sites", None) meta["input_pwm_oversample_factor"] = getattr(sampling, "oversample_factor", None) meta["input_pwm_max_candidates"] = getattr(sampling, "max_candidates", None) @@ -1012,18 +1039,26 @@ def _load_failure_counts_from_attempts( def _load_existing_library_index(outputs_root: Path) -> int: attempts_path = outputs_root / "attempts.parquet" - if not attempts_path.exists(): - return 0 - try: - df = pd.read_parquet(attempts_path, columns=["sampling_library_index"]) - except Exception: - return 0 - if df.empty or "sampling_library_index" not in df.columns: - return 0 - try: - return int(pd.to_numeric(df["sampling_library_index"], errors="coerce").dropna().max() or 0) - except Exception: + paths: list[Path] = [] + if attempts_path.exists(): + paths.append(attempts_path) + paths.extend(sorted(outputs_root.glob("attempts_part-*.parquet"))) + if not paths: return 0 + max_idx = 0 + for path in paths: + try: + df = pd.read_parquet(path, columns=["sampling_library_index"]) + except Exception: + continue + if df.empty or "sampling_library_index" not in df.columns: + continue + try: + current = int(pd.to_numeric(df["sampling_library_index"], errors="coerce").dropna().max() or 0) + except Exception: + continue + max_idx = max(max_idx, current) + return max_idx def _append_attempt( @@ -1289,7 +1324,7 @@ def _process_plan_for_source( # Load source src_obj = deps.source_factory(source_cfg, cfg_path) - data_entries, meta_df = src_obj.load_data(rng=np_rng) + data_entries, meta_df = src_obj.load_data(rng=np_rng, outputs_root=outputs_root) input_meta = _input_metadata(source_cfg, cfg_path) input_tf_tfbs_pair_count: int | None = None if meta_df is not None and isinstance(meta_df, pd.DataFrame): @@ -1313,6 +1348,17 @@ def _process_plan_for_source( "sampling_fraction_pairs": None, } ) + pair_label = str(input_tf_tfbs_pair_count) if input_tf_tfbs_pair_count is not None else "-" + log.info( + "[%s/%s] Input summary: mode=%s rows=%d tfs=%d tfbs=%d pairs=%s", + source_label, + plan_name, + input_meta.get("input_mode"), + input_row_count, + input_tf_count, + input_tfbs_count, + pair_label, + ) source_type = getattr(source_cfg, "type", None) if source_type in PWM_INPUT_TYPES and meta_df is not None and "tf" in meta_df.columns: input_meta["input_pwm_ids"] = sorted(set(meta_df["tf"].tolist())) @@ -1325,15 +1371,27 @@ def _process_plan_for_source( max_seconds = _sampling_attr(input_sampling_cfg, "max_seconds") score_threshold = _sampling_attr(input_sampling_cfg, "score_threshold") score_percentile = _sampling_attr(input_sampling_cfg, "score_percentile") + scoring_backend = _sampling_attr(input_sampling_cfg, "scoring_backend") or "densegen" + pvalue_threshold = _sampling_attr(input_sampling_cfg, "pvalue_threshold") + selection_policy = _sampling_attr(input_sampling_cfg, "selection_policy") length_policy = _sampling_attr(input_sampling_cfg, "length_policy") length_range = _sampling_attr(input_sampling_cfg, "length_range") if length_range is not None: length_range = list(length_range) score_label = "-" - if score_threshold is not None: + if scoring_backend == "fimo" and pvalue_threshold is not None: + comparator = ">=" if str(strategy) == "background" else "<=" + score_label = f"pvalue{comparator}{pvalue_threshold}" + elif score_threshold is not None: score_label = f"threshold={score_threshold}" elif score_percentile is not None: score_label = f"percentile={score_percentile}" + bins_label = "-" + if scoring_backend == "fimo": + bins_label = "canonical" if _sampling_attr(input_sampling_cfg, "pvalue_bins") is None else "custom" + bin_ids = _sampling_attr(input_sampling_cfg, "pvalue_bin_ids") + if bin_ids: + bins_label = f"{bins_label} pick={sorted(list(bin_ids))}" length_label = str(length_policy) if length_policy == "range" and length_range: length_label = f"{length_policy}({length_range[0]}..{length_range[1]})" @@ -1345,14 +1403,18 @@ def _process_plan_for_source( if max_seconds is not None: cap_label = f"{cap_label}; max_seconds={max_seconds}" if cap_label != "-" else f"{max_seconds}s" counts_label = _summarize_tf_counts(meta_df["tf"].tolist()) + selection_label = selection_policy if scoring_backend == "fimo" else "-" log.info( - "PWM input sampling for %s: motifs=%d | sites=%s | strategy=%s | score=%s | " - "oversample=%s | max_candidates=%s | length=%s", + "PWM input sampling for %s: motifs=%d | sites=%s | strategy=%s | backend=%s | score=%s | " + "selection=%s | bins=%s | oversample=%s | max_candidates=%s | length=%s", source_label, len(input_meta.get("input_pwm_ids") or []), counts_label or "-", strategy, + scoring_backend, score_label, + selection_label, + bins_label, oversample, cap_label, length_label, @@ -1702,26 +1764,34 @@ def _record_site_failures(reason: str) -> None: input_meta["sampling_fraction_pairs"] = sampling_fraction_pairs # Library summary (succinct) tf_summary = _summarize_tf_counts(regulator_labels) + library_index = sampling_info.get("library_index") + strategy_label = sampling_info.get("library_sampling_strategy", library_sampling_strategy) + pool_label = sampling_info.get("pool_strategy") + target_len = sampling_info.get("target_length") + achieved_len = sampling_info.get("achieved_length") + header = f"Stage B library for {source_label}/{plan_name}" + if library_index is not None: + header = f"{header} (build {library_index})" if tf_summary: log.info( - "Library for %s/%s: %d motifs | TF counts: %s | target=%d achieved=%d pool=%s", - source_label, - plan_name, + "%s: %d motifs | TF counts: %s | target=%s achieved=%s pool=%s sampling=%s", + header, len(library_for_opt), tf_summary, - sampling_info.get("target_length"), - sampling_info.get("achieved_length"), - sampling_info.get("pool_strategy"), + target_len, + achieved_len, + pool_label, + strategy_label, ) else: log.info( - "Library for %s/%s: %d motifs | target=%d achieved=%d pool=%s", - source_label, - plan_name, + "%s: %d motifs | target=%s achieved=%s pool=%s sampling=%s", + header, len(library_for_opt), - sampling_info.get("target_length"), - sampling_info.get("achieved_length"), - sampling_info.get("pool_strategy"), + target_len, + achieved_len, + pool_label, + strategy_label, ) solver_min_counts: dict[str, int] | None = None diff --git a/src/dnadesign/densegen/src/core/pvalue_bins.py b/src/dnadesign/densegen/src/core/pvalue_bins.py new file mode 100644 index 00000000..69084a56 --- /dev/null +++ b/src/dnadesign/densegen/src/core/pvalue_bins.py @@ -0,0 +1,32 @@ +""" +-------------------------------------------------------------------------------- + +dnadesign/densegen/core/pvalue_bins.py + +Canonical p-value bin edges for FIMO-based PWM sampling. + +Module Author(s): Eric J. South +Dunlop Lab +-------------------------------------------------------------------------------- +""" + +from __future__ import annotations + +from typing import Sequence + +CANONICAL_PVALUE_BINS: tuple[float, ...] = ( + 1e-10, + 1e-8, + 1e-6, + 1e-4, + 1e-3, + 1e-2, + 1e-1, + 1.0, +) + + +def resolve_pvalue_bins(pvalue_bins: Sequence[float] | None) -> list[float]: + if pvalue_bins is None: + return list(CANONICAL_PVALUE_BINS) + return [float(v) for v in pvalue_bins] diff --git a/src/dnadesign/densegen/src/integrations/__init__.py b/src/dnadesign/densegen/src/integrations/__init__.py new file mode 100644 index 00000000..d3759fd2 --- /dev/null +++ b/src/dnadesign/densegen/src/integrations/__init__.py @@ -0,0 +1,3 @@ +""" +DenseGen external tool integrations. +""" diff --git a/src/dnadesign/densegen/src/integrations/meme_suite.py b/src/dnadesign/densegen/src/integrations/meme_suite.py new file mode 100644 index 00000000..9abdb34c --- /dev/null +++ b/src/dnadesign/densegen/src/integrations/meme_suite.py @@ -0,0 +1,41 @@ +""" +-------------------------------------------------------------------------------- + +dnadesign/densegen/integrations/meme_suite.py + +Lightweight MEME Suite tool resolution for DenseGen. + +Module Author(s): Eric J. South +Dunlop Lab +-------------------------------------------------------------------------------- +""" + +from __future__ import annotations + +import os +import shutil +from pathlib import Path + + +def resolve_executable(tool: str, *, tool_path: Path | None = None) -> Path | None: + if tool_path is not None: + resolved = tool_path.expanduser() + if resolved.is_dir(): + candidate = resolved / tool + else: + candidate = resolved + if candidate.name != tool: + raise FileNotFoundError( + f"Configured tool_path points to '{candidate.name}', expected '{tool}'. " + "Provide a bin directory or the correct executable." + ) + if candidate.exists(): + return candidate + raise FileNotFoundError(f"Configured tool_path does not contain '{tool}': {candidate}") + env_dir = os.getenv("MEME_BIN") + if env_dir: + candidate = Path(env_dir).expanduser() / tool + if candidate.exists(): + return candidate + found = shutil.which(tool) + return Path(found) if found else None diff --git a/src/dnadesign/densegen/tests/test_cli_summarize_library.py b/src/dnadesign/densegen/tests/test_cli_summarize_library.py index e5ddc45b..49618288 100644 --- a/src/dnadesign/densegen/tests/test_cli_summarize_library.py +++ b/src/dnadesign/densegen/tests/test_cli_summarize_library.py @@ -59,8 +59,16 @@ def _base_meta(library_hash: str, library_index: int) -> dict: "sampling_fraction": 0.5, "sampling_fraction_pairs": 0.5, "input_pwm_strategy": None, + "input_pwm_scoring_backend": None, "input_pwm_score_threshold": None, "input_pwm_score_percentile": None, + "input_pwm_pvalue_threshold": None, + "input_pwm_pvalue_bins": None, + "input_pwm_pvalue_bin_ids": None, + "input_pwm_selection_policy": None, + "input_pwm_bgfile": None, + "input_pwm_keep_all_candidates_debug": None, + "input_pwm_include_matched_sequence": None, "input_pwm_n_sites": None, "input_pwm_oversample_factor": None, "fixed_elements": {"promoter_constraints": [], "side_biases": {"left": [], "right": []}}, diff --git a/src/dnadesign/densegen/tests/test_outputs_parquet.py b/src/dnadesign/densegen/tests/test_outputs_parquet.py index 502591af..606bd03f 100644 --- a/src/dnadesign/densegen/tests/test_outputs_parquet.py +++ b/src/dnadesign/densegen/tests/test_outputs_parquet.py @@ -54,8 +54,16 @@ def _dummy_meta() -> dict: "sampling_fraction": None, "sampling_fraction_pairs": 0.5, "input_pwm_strategy": None, + "input_pwm_scoring_backend": None, "input_pwm_score_threshold": None, "input_pwm_score_percentile": None, + "input_pwm_pvalue_threshold": None, + "input_pwm_pvalue_bins": None, + "input_pwm_pvalue_bin_ids": None, + "input_pwm_selection_policy": None, + "input_pwm_bgfile": None, + "input_pwm_keep_all_candidates_debug": None, + "input_pwm_include_matched_sequence": None, "input_pwm_n_sites": None, "input_pwm_oversample_factor": None, "fixed_elements": {"promoter_constraints": [], "side_biases": {"left": [], "right": []}}, diff --git a/src/dnadesign/densegen/tests/test_pipeline_library_index.py b/src/dnadesign/densegen/tests/test_pipeline_library_index.py new file mode 100644 index 00000000..97816a86 --- /dev/null +++ b/src/dnadesign/densegen/tests/test_pipeline_library_index.py @@ -0,0 +1,15 @@ +from __future__ import annotations + +from pathlib import Path + +import pandas as pd + +from dnadesign.densegen.src.core.pipeline import _load_existing_library_index + + +def test_load_existing_library_index_reads_parts(tmp_path: Path) -> None: + outputs = tmp_path + df = pd.DataFrame({"sampling_library_index": [1, 2, 5]}) + part = outputs / "attempts_part-000.parquet" + df.to_parquet(part) + assert _load_existing_library_index(outputs) == 5 diff --git a/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py b/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py new file mode 100644 index 00000000..95c2aa8b --- /dev/null +++ b/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py @@ -0,0 +1,93 @@ +from __future__ import annotations + +from pathlib import Path + +import pytest + +from dnadesign.densegen.src.adapters.sources.pwm_fimo import ( + aggregate_best_hits, + build_candidate_records, + parse_fimo_tsv, + run_fimo, + write_candidates_fasta, + write_minimal_meme_motif, +) +from dnadesign.densegen.src.adapters.sources.pwm_sampling import PWMMotif +from dnadesign.densegen.src.integrations.meme_suite import resolve_executable + + +def test_write_minimal_meme_motif(tmp_path: Path) -> None: + motif = PWMMotif( + motif_id="M1", + matrix=[ + {"A": 0.7, "C": 0.1, "G": 0.1, "T": 0.1}, + {"A": 0.2, "C": 0.3, "G": 0.4, "T": 0.1}, + ], + background={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + ) + out = tmp_path / "motif.meme" + motif_id = write_minimal_meme_motif(motif, out) + text = out.read_text() + assert "MEME version" in text + assert "Background letter frequencies" in text + assert f"MOTIF {motif_id}" in text + lines = [ln for ln in text.splitlines() if ln.strip()] + idx = next(i for i, ln in enumerate(lines) if ln.startswith("letter-probability matrix")) + matrix_lines = lines[idx + 1 : idx + 1 + len(motif.matrix)] + assert len(matrix_lines) == len(motif.matrix) + for row in matrix_lines: + vals = [float(x) for x in row.split()] + assert abs(sum(vals) - 1.0) < 1e-6 + + +def test_write_candidates_fasta(tmp_path: Path) -> None: + records = build_candidate_records("My Motif", ["ACG", "TTT"]) + out = tmp_path / "candidates.fasta" + write_candidates_fasta(records, out) + lines = out.read_text().splitlines() + assert lines[0].startswith(">") + assert lines[1] == "ACG" + assert lines[2].startswith(">") + assert lines[3] == "TTT" + assert records[0][0].endswith("|cand0") + assert records[1][0].endswith("|cand1") + + +def test_parse_fimo_tsv_and_best_hits() -> None: + tsv = "\n".join( + [ + "motif_id\tmotif_alt_id\tsequence_name\tstart\tstop\tstrand\tscore\tp-value\tq-value\tmatched_sequence", + "M1\t.\tcand0\t2\t4\t+\t5.2\t1e-4\t0.01\tACG", + "M1\t.\tcand0\t1\t3\t-\t4.0\t1e-3\t0.1\tTGC", + "M1\t.\tcand1\t1\t3\t+\t2.0\t0.5\t1.0\tAAA", + ] + ) + rows = parse_fimo_tsv(tsv) + best = aggregate_best_hits(rows) + assert best["cand0"].pvalue == pytest.approx(1e-4) + assert best["cand0"].score == pytest.approx(5.2) + assert best["cand0"].matched_sequence == "ACG" + assert best["cand1"].pvalue == pytest.approx(0.5) + + +@pytest.mark.skipif(resolve_executable("fimo", tool_path=None) is None, reason="fimo executable not available") +def test_run_fimo_smoke(tmp_path: Path) -> None: + motif = PWMMotif( + motif_id="M1", + matrix=[ + {"A": 0.8, "C": 0.1, "G": 0.05, "T": 0.05}, + {"A": 0.8, "C": 0.1, "G": 0.05, "T": 0.05}, + {"A": 0.8, "C": 0.1, "G": 0.05, "T": 0.05}, + ], + background={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + ) + meme_path = tmp_path / "motif.meme" + fasta_path = tmp_path / "candidates.fasta" + write_minimal_meme_motif(motif, meme_path) + records = build_candidate_records("M1", ["AAA", "CCC"]) + write_candidates_fasta(records, fasta_path) + rows, _raw = run_fimo(meme_motif_path=meme_path, fasta_path=fasta_path, thresh=1.0) + assert rows + for row in rows: + pval = float(row["p_value"]) + assert 0.0 <= pval <= 1.0 diff --git a/src/dnadesign/densegen/tests/test_pwm_sampling_bins.py b/src/dnadesign/densegen/tests/test_pwm_sampling_bins.py new file mode 100644 index 00000000..c20bbeb9 --- /dev/null +++ b/src/dnadesign/densegen/tests/test_pwm_sampling_bins.py @@ -0,0 +1,74 @@ +from __future__ import annotations + +import numpy as np + +from dnadesign.densegen.src.adapters.sources.pwm_sampling import ( + FimoCandidate, + _assign_pvalue_bin, + _stratified_sample, +) + + +def test_assign_pvalue_bin_edges() -> None: + edges = [1e-4, 1e-2, 1.0] + assert _assign_pvalue_bin(1e-4, edges) == (0, 0.0, 1e-4) + assert _assign_pvalue_bin(5e-4, edges) == (1, 1e-4, 1e-2) + assert _assign_pvalue_bin(0.5, edges) == (2, 1e-2, 1.0) + + +def test_stratified_sample_balances_bins() -> None: + rng = np.random.default_rng(0) + candidates = [ + FimoCandidate( + seq="AAAA", + pvalue=1e-6, + score=10.0, + bin_id=0, + bin_low=0.0, + bin_high=1e-4, + start=0, + stop=3, + strand="+", + matched_sequence=None, + ), + FimoCandidate( + seq="AAAT", + pvalue=5e-6, + score=9.0, + bin_id=0, + bin_low=0.0, + bin_high=1e-4, + start=0, + stop=3, + strand="+", + matched_sequence=None, + ), + FimoCandidate( + seq="TTTT", + pvalue=5e-3, + score=6.0, + bin_id=1, + bin_low=1e-4, + bin_high=1e-2, + start=0, + stop=3, + strand="+", + matched_sequence=None, + ), + FimoCandidate( + seq="TTTA", + pvalue=8e-3, + score=5.0, + bin_id=1, + bin_low=1e-4, + bin_high=1e-2, + start=0, + stop=3, + strand="+", + matched_sequence=None, + ), + ] + + picked = _stratified_sample(candidates, n_sites=3, rng=rng, n_bins=2) + assert len(picked) == 3 + assert {int(c.bin_id) for c in picked} == {0, 1} From d803d834f980113a32b77fc73ace25611e391594 Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 09:44:16 -0500 Subject: [PATCH 04/40] docs(densegen): update stratified FIMO demo and mining workflow --- src/dnadesign/densegen/README.md | 5 +- .../densegen/docs/demo/demo_basic.md | 53 ++++++++++++-- src/dnadesign/densegen/docs/guide/inputs.md | 69 +++++++++++++++++-- .../densegen/docs/reference/config.md | 17 ++++- .../workspaces/demo_meme_two_tf/config.yaml | 9 +-- 5 files changed, 134 insertions(+), 19 deletions(-) diff --git a/src/dnadesign/densegen/README.md b/src/dnadesign/densegen/README.md index b3c511e7..be196836 100644 --- a/src/dnadesign/densegen/README.md +++ b/src/dnadesign/densegen/README.md @@ -17,11 +17,14 @@ Prerequisites include Python, dense-arrays, and a MILP solver. CBC is open-sourc Use the canonical demo config (small, Parquet-only). The demo uses MEME motif files copied from the Cruncher basic demo workspace (`inputs/local_motifs`) and parsed with Cruncher’s MEME parser for DRY, consistent parsing. +FIMO-backed PWM sampling is supported when MEME Suite is available (`fimo` on PATH via `pixi run`). +Stratified FIMO sampling uses canonical p‑value bins by default; see the guide for mining workflows. ```bash uv run dense validate -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml uv run dense describe -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --no-plot +pixi run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --no-plot +uv run dense summarize -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --library --top-per-tf 5 uv run dense plot -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --only tf_usage,tf_coverage ``` diff --git a/src/dnadesign/densegen/docs/demo/demo_basic.md b/src/dnadesign/densegen/docs/demo/demo_basic.md index f229f84d..b2904916 100644 --- a/src/dnadesign/densegen/docs/demo/demo_basic.md +++ b/src/dnadesign/densegen/docs/demo/demo_basic.md @@ -27,6 +27,10 @@ If you have not synced dependencies yet: uv sync --locked ``` +This demo uses **FIMO** (MEME Suite) to adjudicate strong motif matches. Ensure `fimo` is on PATH +or set `MEME_BIN` to the MEME bin directory. If you use pixi, run commands via +`pixi run dense ...` so MEME tools are available (recommended for the run step). + All commands below assume you are at the repo root. We will write the demo run to a scratch directory; set a run root: @@ -49,7 +53,9 @@ src/dnadesign/densegen/workspaces/demo_meme_two_tf/inputs/cpxR.txt ``` These are MEME files parsed with Cruncher’s MEME parser (DenseGen reuses the same parsing -logic for DRY). The demo uses LexA + CpxR motifs and exercises PWM sampling bounds. +logic for DRY). The demo uses LexA + CpxR motifs and exercises PWM sampling bounds. Sampling +uses FIMO p-values to define “strong” matches and `selection_policy: stratified` to balance +across canonical p‑value bins (see the input-stage sampling table in `dense describe`). ### 1b) (Optional) Rebuild inputs from Cruncher @@ -113,7 +119,7 @@ Example output: ┏━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ name ┃ quota ┃ has promoter_constraints ┃ ┡━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩ -│ meme_demo │ 6 │ no │ +│ meme_demo │ 50 │ no │ └──────┴───────┴──────────────────────────┘ ``` @@ -149,16 +155,17 @@ Solver-stage library sampling ## 6) Run generation ```bash -uv run dense run -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml --no-plot +pixi run dense run -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml --no-plot ``` Example output (abridged): ```text 2026-01-15 14:02:02 | INFO | dnadesign.densegen.src.utils.logging_utils | Logging initialized (level=INFO) -Quota plan: meme_demo=6 +Quota plan: meme_demo=50 2026-01-15 14:02:02 | INFO | dnadesign.densegen.src.adapters.optimizer.dense_arrays | Solver selected: CBC -2026-01-15 14:02:05 | INFO | dnadesign.densegen.src.core.pipeline | [demo/demo] 5/5 (100.00%) (local 5/5) CR=1.050 | seq ATTGACAGTAAACCTGCGGGAAATATAATTTACTCCGTATTTGCACATGGTTATCCACAG +2026-01-15 14:02:05 | INFO | dnadesign.densegen.src.adapters.sources.pwm_sampling | FIMO yield for motif lexA: hits=960 accepted=120 selected=80 bins=(0e+00,1e-10]:0 ... selected_bins=(0e+00,1e-10]:0 ... +2026-01-15 14:02:06 | INFO | dnadesign.densegen.src.core.pipeline | [demo/demo] 2/50 (4.00%) (local 2/2) CR=1.050 | seq ATTGACAGTAAACCTGCGGGAAATATAATTTACTCCGTATTTGCACATGGTTATCCACAG 2026-01-15 14:02:05 | INFO | dnadesign.densegen.src.core.pipeline | Inputs manifest written: /private/tmp/densegen-demo-20260115-1405/demo_press/outputs/meta/inputs_manifest.json 🎉 Run complete. ``` @@ -182,7 +189,7 @@ Run: demo_press Root: /private/tmp/densegen-demo-20260115-1405/demo_press Sche ┏━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓ ┃ input ┃ plan ┃ generated ┃ duplica… ┃ failed ┃ resamples ┃ librari… ┃ stalls ┃ ┡━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩ -│ lexA_cpxR_meme │ meme_demo │ 6 │ 0 │ 0 │ 0 │ 3 │ 0 │ +│ lexA_cpxR_meme │ meme_demo │ 50 │ 0 │ 0 │ 0 │ 3 │ 0 │ └──────────────┴──────┴───────────┴──────────┴────────┴───────────┴──────────┴────────┘ ``` @@ -190,9 +197,12 @@ Use `--verbose` for constraint-failure breakdowns and duplicate-solution counts. Use `--library` to print offered-vs-used summaries for quick debugging: ```bash -uv run dense summarize --run /private/tmp/densegen-demo-20260115-1405/demo_press --library +uv run dense summarize --run /private/tmp/densegen-demo-20260115-1405/demo_press --library --top-per-tf 5 ``` +This library summary is the quickest way to audit which TFBS were offered vs +used in the solver stage (Stage‑B sampling). + If any solutions are rejected, DenseGen writes them to `outputs/attempts.parquet` in the run root. @@ -317,6 +327,7 @@ inputs: motif_ids: [lexA] sampling: strategy: background + scoring_backend: densegen n_sites: 200 oversample_factor: 5 score_percentile: 10 @@ -324,6 +335,34 @@ inputs: Swap `type` and `path` to `pwm_jaspar` or `pwm_matrix_csv` with the same `sampling` block. +For **strong match** sampling with FIMO p-values: + +```yaml +inputs: + - name: lexA_meme + type: pwm_meme + path: inputs/lexA.txt + motif_ids: [lexA] + sampling: + strategy: stochastic + scoring_backend: fimo + pvalue_threshold: 1e-4 + selection_policy: top_n + n_sites: 80 + oversample_factor: 10 +``` + +To mine specific affinity strata, add canonical p‑value bins and select bins by index: + +```yaml + sampling: + scoring_backend: fimo + pvalue_threshold: 1e-3 + selection_policy: stratified + pvalue_bins: [1e-6, 1e-4, 1e-3, 1e-2, 1e-1, 1.0] + pvalue_bin_ids: [1, 2] # (1e-6..1e-4] and (1e-4..1e-3] +``` + ### Add USR output USR is an optional I/O adapter. To write both Parquet and USR: diff --git a/src/dnadesign/densegen/docs/guide/inputs.md b/src/dnadesign/densegen/docs/guide/inputs.md index f5fda12e..093a7ba0 100644 --- a/src/dnadesign/densegen/docs/guide/inputs.md +++ b/src/dnadesign/densegen/docs/guide/inputs.md @@ -95,21 +95,47 @@ Use a MEME-format PWM file and explicitly sample binding sites. Required sampling fields: - `strategy`: `consensus | stochastic | background` - `n_sites`: number of binding sites to generate per motif -- `score_threshold` or `score_percentile` (exactly one) +- `scoring_backend`: `densegen | fimo` (default: `densegen`) +- `score_threshold` or `score_percentile` (exactly one; densegen backend only) +- `pvalue_threshold` (float in (0, 1]; fimo backend only) - `oversample_factor`: oversampling multiplier for candidate generation - `max_candidates` (optional): cap on candidate generation; helps bound long motifs - `max_seconds` (optional): time limit for candidate generation (best-effort cap) +- `selection_policy`: `random_uniform | top_n | stratified` (default: `random_uniform`; fimo only) +- `pvalue_bins` (optional): list of p‑value bin edges (strictly increasing; must end with `1.0`) +- `pvalue_bin_ids` (optional): list of bin indices to keep (0‑based, using `pvalue_bins`) +- `bgfile` (optional): MEME bfile-format background model for FIMO +- `keep_all_candidates_debug` (optional): write raw FIMO TSVs to `outputs/meta/fimo/` for inspection +- `include_matched_sequence` (optional): include `fimo_matched_sequence` column in the TFBS table Notes: -- Sampling scores use PWM log-odds with the motif background (from MEME when available). -- `score_threshold` / `score_percentile` controls similarity to the PWM consensus - (higher percentiles or thresholds yield stronger matches). +- `densegen` scoring uses PWM log-odds with the motif background (from MEME when available). +- `fimo` scoring scans the entire emitted TFBS and uses a model-based p-value threshold. + `pvalue_threshold` controls match strength (smaller values are stronger). +- `fimo` backend requires the `fimo` executable on PATH (run via pixi). +- If `bgfile` is omitted, FIMO uses the motif background (or uniform if none provided). +- `background` selects low-scoring sequences (<= threshold/percentile; or pvalue >= threshold for fimo). +- `selection_policy: stratified` uses fixed p‑value bins to balance strong/weak matches. +- Canonical p‑value bins (default): `[1e-10, 1e-8, 1e-6, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]`. + Bin 0 is `(0, 1e-10]`, bin 1 is `(1e-10, 1e-8]`, etc. + +#### FIMO p-values (beginner-friendly) +- A **p-value** is the probability that a random sequence (under the background model) + would score **at least as well** as the observed match. +- Smaller p-values mean **stronger** motif matches; larger p-values mean **weaker** matches. +- As a rule of thumb: `1e-4` is a strong match, `1e-3` is moderate, `1e-2` is weak. +- DenseGen accepts a candidate if its **best hit** within the emitted TFBS passes the threshold. +- For `strategy: background`, DenseGen keeps **weak** matches where `pvalue >= pvalue_threshold`. +- If you set `pvalue_bin_ids`, DenseGen only keeps candidates in those bins (useful for mining + specific affinity ranges). +- FIMO adds per‑TFBS metadata columns: `fimo_score`, `fimo_pvalue`, `fimo_start`, `fimo_stop`, + `fimo_strand`, `fimo_bin_id`, `fimo_bin_low`, `fimo_bin_high`, and (optionally) + `fimo_matched_sequence`. - `length_policy` defaults to `exact`. Use `length_policy: range` with `length_range: [min, max]` to sample variable lengths (min must be >= motif length). - `trim_window_length` optionally trims the PWM to a max‑information window before sampling (useful for long motifs when you want shorter cores); `trim_window_strategy` currently supports `max_info`. - `consensus` requires `n_sites: 1`. -- `background` selects low-scoring sequences from the PWM. Example: @@ -130,6 +156,39 @@ inputs: length_range: [22, 28] ``` +FIMO-backed example: + +```yaml +inputs: + - name: lexA_meme + type: pwm_meme + path: inputs/lexA.txt + motif_ids: [lexA] + sampling: + strategy: stochastic + scoring_backend: fimo + pvalue_threshold: 1e-4 + selection_policy: top_n + n_sites: 80 + oversample_factor: 12 + max_candidates: 50000 + max_seconds: 5 +``` + +#### Mining workflow (p‑value strata) +If you want to **mine** sequences across affinity strata, use `selection_policy: stratified` plus +canonical p‑value bins. A typical workflow: + +1) Oversample candidates (`oversample_factor`, `max_candidates`) and score with FIMO. +2) Accept candidates using `pvalue_threshold` (global strength cutoff). +3) Use `pvalue_bin_ids` to select one or more bins (e.g., moderate matches only). +4) Repeat runs to accumulate a deduplicated reservoir of sequences per bin. +5) Use `dense summarize --library` to inspect which TFBS were offered vs used in Stage‑B sampling. + +DenseGen reports per‑bin yield summaries (hits, accepted, selected) for every FIMO run, so you can +track how many candidates land in each bin and adjust thresholds or oversampling accordingly. With +`selection_policy: stratified`, the selected‑bin counts show how evenly the final pool spans strata. + --- ### PWM MEME set (`type: pwm_meme_set`) diff --git a/src/dnadesign/densegen/docs/reference/config.md b/src/dnadesign/densegen/docs/reference/config.md index a31f6ce6..9b7c938a 100644 --- a/src/dnadesign/densegen/docs/reference/config.md +++ b/src/dnadesign/densegen/docs/reference/config.md @@ -62,13 +62,26 @@ PWM inputs perform **input sampling** (sampling sites from PWMs) via - `oversample_factor` (int > 0) - `max_candidates` (optional int > 0; caps candidate generation) - `max_seconds` (optional float > 0; time limit for candidate generation) - - `score_threshold` or `score_percentile` (exactly one) + - `scoring_backend`: `densegen | fimo` (default: `densegen`) + - `score_threshold` or `score_percentile` (exactly one; **densegen** backend only) + - `pvalue_threshold` (float in (0, 1]; **fimo** backend only) + - `selection_policy`: `random_uniform | top_n | stratified` (default: `random_uniform`; fimo only) + - `pvalue_bins` (optional list of floats; must end with `1.0`) - p‑value bin edges for stratified sampling + - `pvalue_bin_ids` (optional list of ints) - select specific p‑value bins (0‑based indices) + - `bgfile` (optional path) - MEME bfile-format background model for FIMO + - `keep_all_candidates_debug` (bool, default false) - write raw FIMO TSVs to `outputs/meta/fimo/` for inspection + - `include_matched_sequence` (bool, default false) - include `fimo_matched_sequence` in TFBS outputs - `length_policy`: `exact | range` (default: `exact`) - `length_range`: `[min, max]` (required when `length_policy=range`; `min` >= motif length) - `trim_window_length` (optional int > 0; trims PWM to a max‑information window before sampling) - `trim_window_strategy`: `max_info` (window selection strategy) - `consensus` requires `n_sites: 1` - - `background` selects low-scoring sequences (<= threshold/percentile) + - `background` selects low-scoring sequences (<= threshold/percentile; or pvalue >= threshold for fimo) + - FIMO resolves `fimo` via `MEME_BIN` or PATH; pixi users should run `pixi run dense ...` so it is available. + - Canonical p‑value bins (default): `[1e-10, 1e-8, 1e-6, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]` + (bin 0 is `(0, 1e-10]`, bin 1 is `(1e-10, 1e-8]`, etc.) + - FIMO runs log per‑bin yield summaries (hits, accepted, selected); `selection_policy: stratified` + makes the selected‑bin distribution explicit for mining workflows. - `type: pwm_meme_set` - `paths` - list of MEME PWM files (merged into a single TF pool) - `motif_ids` (optional list) - choose motifs by ID across files diff --git a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml index b261c1a3..e5794dd6 100644 --- a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +++ b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml @@ -23,8 +23,9 @@ densegen: n_sites: 80 oversample_factor: 12 max_candidates: 50000 # bounded candidate generation - score_threshold: null - score_percentile: 80 + scoring_backend: fimo + pvalue_threshold: 1e-4 + selection_policy: stratified length_policy: range length_range: [22, 28] @@ -40,7 +41,7 @@ densegen: generation: sequence_length: 60 - quota: 6 + quota: 50 sampling: pool_strategy: subsample library_size: 24 @@ -58,7 +59,7 @@ densegen: plan: - name: meme_demo - quota: 6 + quota: 50 required_regulators: [lexA, cpxR] solver: From 4700a9ce3d6c261a59840bd38eade31d9cee46a3 Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 10:39:44 -0500 Subject: [PATCH 05/40] densegen: add FIMO mining workflow and UX updates --- .../densegen/docs/demo/demo_basic.md | 9 +- .../densegen/docs/guide/generation.md | 11 + src/dnadesign/densegen/docs/guide/inputs.md | 52 ++- .../densegen/docs/reference/config.md | 26 +- .../densegen/src/adapters/outputs/parquet.py | 5 + .../src/adapters/sources/pwm_artifact.py | 2 + .../src/adapters/sources/pwm_artifact_set.py | 2 + .../densegen/src/adapters/sources/pwm_fimo.py | 9 +- .../src/adapters/sources/pwm_jaspar.py | 2 + .../src/adapters/sources/pwm_matrix_csv.py | 2 + .../densegen/src/adapters/sources/pwm_meme.py | 2 + .../src/adapters/sources/pwm_meme_set.py | 2 + .../src/adapters/sources/pwm_sampling.py | 387 +++++++++++++----- src/dnadesign/densegen/src/cli.py | 20 +- src/dnadesign/densegen/src/config/__init__.py | 95 ++++- src/dnadesign/densegen/src/core/metadata.py | 5 + .../densegen/src/core/metadata_schema.py | 35 +- src/dnadesign/densegen/src/core/pipeline.py | 277 +++++++++---- .../tests/test_cli_summarize_library.py | 5 + .../densegen/tests/test_outputs_parquet.py | 5 + .../densegen/tests/test_pwm_fimo_utils.py | 6 +- .../tests/test_pwm_sampling_mining.py | 80 ++++ .../workspaces/demo_meme_two_tf/config.yaml | 14 +- 23 files changed, 832 insertions(+), 221 deletions(-) create mode 100644 src/dnadesign/densegen/tests/test_pwm_sampling_mining.py diff --git a/src/dnadesign/densegen/docs/demo/demo_basic.md b/src/dnadesign/densegen/docs/demo/demo_basic.md index b2904916..d322242e 100644 --- a/src/dnadesign/densegen/docs/demo/demo_basic.md +++ b/src/dnadesign/densegen/docs/demo/demo_basic.md @@ -158,6 +158,10 @@ Solver-stage library sampling pixi run dense run -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml --no-plot ``` +The demo config sets `logging.progress_style: screen`, so in a TTY you will see a +refreshing dashboard (progress, leaderboards, last sequence). To see per‑sequence +logs, set `progress_style: stream` (and optionally tune `progress_every`). + Example output (abridged): ```text @@ -360,7 +364,10 @@ To mine specific affinity strata, add canonical p‑value bins and select bins b pvalue_threshold: 1e-3 selection_policy: stratified pvalue_bins: [1e-6, 1e-4, 1e-3, 1e-2, 1e-1, 1.0] - pvalue_bin_ids: [1, 2] # (1e-6..1e-4] and (1e-4..1e-3] + mining: + batch_size: 5000 + max_batches: 4 + retain_bin_ids: [1, 2] # (1e-6..1e-4] and (1e-4..1e-3] ``` ### Add USR output diff --git a/src/dnadesign/densegen/docs/guide/generation.md b/src/dnadesign/densegen/docs/guide/generation.md index c8b3d220..e3f29a8f 100644 --- a/src/dnadesign/densegen/docs/guide/generation.md +++ b/src/dnadesign/densegen/docs/guide/generation.md @@ -111,6 +111,17 @@ Notes: - `coverage_weighted` dynamically boosts underused TFBS based on the run’s usage counts. - `avoid_failed_motifs: true` down-weights TFBS that repeatedly appear in failed solve attempts (tracked in attempts.parquet). +### Run scheduling (round‑robin) + +`runtime.round_robin` controls **scheduling**, not sampling. When enabled, DenseGen interleaves plan +items across inputs so each plan advances in turn (one subsample per pass). This is useful when you +have multiple constraint sets (e.g., different fixed sequences) and want a single run to progress +each design target in parallel. + +Round‑robin is **distinct from Stage‑B sampling** (`generation.sampling`): library sampling still +uses the same policy per plan, but round‑robin can trigger more frequent library rebuilds when +`pool_strategy: iterative_subsample` is used. Expect extra compute if many plans are active. + --- ### Regulator constraints diff --git a/src/dnadesign/densegen/docs/guide/inputs.md b/src/dnadesign/densegen/docs/guide/inputs.md index 093a7ba0..63911198 100644 --- a/src/dnadesign/densegen/docs/guide/inputs.md +++ b/src/dnadesign/densegen/docs/guide/inputs.md @@ -100,10 +100,16 @@ Required sampling fields: - `pvalue_threshold` (float in (0, 1]; fimo backend only) - `oversample_factor`: oversampling multiplier for candidate generation - `max_candidates` (optional): cap on candidate generation; helps bound long motifs -- `max_seconds` (optional): time limit for candidate generation (best-effort cap) +- `max_seconds` (optional): time limit for candidate generation per batch (best-effort cap) - `selection_policy`: `random_uniform | top_n | stratified` (default: `random_uniform`; fimo only) - `pvalue_bins` (optional): list of p‑value bin edges (strictly increasing; must end with `1.0`) -- `pvalue_bin_ids` (optional): list of bin indices to keep (0‑based, using `pvalue_bins`) +- `pvalue_bin_ids` (deprecated; use `mining.retain_bin_ids`) +- `mining` (optional; fimo only): batch/time controls for mining with FIMO + - `batch_size` (int > 0): candidates per batch + - `max_batches` (optional int > 0): limit batches per motif + - `max_seconds` (optional float > 0): limit total mining time per motif + - `retain_bin_ids` (optional list of ints): keep only specific p‑value bins + - `log_every_batches` (int > 0): log yield summaries every N batches - `bgfile` (optional): MEME bfile-format background model for FIMO - `keep_all_candidates_debug` (optional): write raw FIMO TSVs to `outputs/meta/fimo/` for inspection - `include_matched_sequence` (optional): include `fimo_matched_sequence` column in the TFBS table @@ -126,11 +132,11 @@ Notes: - As a rule of thumb: `1e-4` is a strong match, `1e-3` is moderate, `1e-2` is weak. - DenseGen accepts a candidate if its **best hit** within the emitted TFBS passes the threshold. - For `strategy: background`, DenseGen keeps **weak** matches where `pvalue >= pvalue_threshold`. -- If you set `pvalue_bin_ids`, DenseGen only keeps candidates in those bins (useful for mining +- If you set `mining.retain_bin_ids`, DenseGen only keeps candidates in those bins (useful for mining specific affinity ranges). - FIMO adds per‑TFBS metadata columns: `fimo_score`, `fimo_pvalue`, `fimo_start`, `fimo_stop`, `fimo_strand`, `fimo_bin_id`, `fimo_bin_low`, `fimo_bin_high`, and (optionally) - `fimo_matched_sequence`. + `fimo_matched_sequence` (the best‑hit window within the TFBS). - `length_policy` defaults to `exact`. Use `length_policy: range` with `length_range: [min, max]` to sample variable lengths (min must be >= motif length). - `trim_window_length` optionally trims the PWM to a max‑information window before sampling (useful @@ -170,24 +176,42 @@ inputs: pvalue_threshold: 1e-4 selection_policy: top_n n_sites: 80 - oversample_factor: 12 - max_candidates: 50000 - max_seconds: 5 + oversample_factor: 200 + max_candidates: 20000 + mining: + batch_size: 5000 + max_batches: 4 + retain_bin_ids: [0, 1, 2, 3] + log_every_batches: 1 ``` #### Mining workflow (p‑value strata) If you want to **mine** sequences across affinity strata, use `selection_policy: stratified` plus -canonical p‑value bins. A typical workflow: +canonical p‑value bins and the `mining` block. A typical workflow: -1) Oversample candidates (`oversample_factor`, `max_candidates`) and score with FIMO. +1) Oversample candidates (`oversample_factor`, `max_candidates`) and score with FIMO in batches + (`mining.batch_size`). 2) Accept candidates using `pvalue_threshold` (global strength cutoff). -3) Use `pvalue_bin_ids` to select one or more bins (e.g., moderate matches only). -4) Repeat runs to accumulate a deduplicated reservoir of sequences per bin. +3) Use `mining.retain_bin_ids` to select one or more bins (e.g., moderate matches only). +4) Repeat runs (or increase `mining.max_batches` / `mining.max_seconds`) to accumulate a deduplicated + reservoir of sequences per bin. 5) Use `dense summarize --library` to inspect which TFBS were offered vs used in Stage‑B sampling. -DenseGen reports per‑bin yield summaries (hits, accepted, selected) for every FIMO run, so you can -track how many candidates land in each bin and adjust thresholds or oversampling accordingly. With -`selection_policy: stratified`, the selected‑bin counts show how evenly the final pool spans strata. +DenseGen reports per‑bin yield summaries (hits, accepted, selected) for retained bins only (or all +bins if `retain_bin_ids` is unset), so you can track how many candidates land in each stratum and +adjust thresholds or oversampling accordingly. With `selection_policy: stratified`, the selected‑bin +counts show how evenly the final pool spans strata. + +#### Stdout UX for long runs +DenseGen supports three logging styles so long runs stay readable: + +- `progress_style: stream` (default) logs per‑sequence updates; tune `progress_every` to reduce noise. +- `progress_style: summary` hides per‑sequence logs and only prints periodic leaderboard summaries. +- `progress_style: screen` clears and redraws a compact dashboard (progress, leaderboards, last sequence) + at `progress_refresh_seconds`. + +For iterative mining workflows, `screen` or `summary` modes are recommended to avoid log spam while still +seeing yield/leaderboard progress over time. --- diff --git a/src/dnadesign/densegen/docs/reference/config.md b/src/dnadesign/densegen/docs/reference/config.md index 9b7c938a..0fd049a7 100644 --- a/src/dnadesign/densegen/docs/reference/config.md +++ b/src/dnadesign/densegen/docs/reference/config.md @@ -67,7 +67,14 @@ PWM inputs perform **input sampling** (sampling sites from PWMs) via - `pvalue_threshold` (float in (0, 1]; **fimo** backend only) - `selection_policy`: `random_uniform | top_n | stratified` (default: `random_uniform`; fimo only) - `pvalue_bins` (optional list of floats; must end with `1.0`) - p‑value bin edges for stratified sampling - - `pvalue_bin_ids` (optional list of ints) - select specific p‑value bins (0‑based indices) + - `pvalue_bin_ids` (deprecated; use `mining.retain_bin_ids`) + - `mining` (optional; fimo only) - batch/time controls for mining via FIMO: + - `batch_size` (int > 0; default 100000) - candidates per FIMO batch + - `max_batches` (optional int > 0) - max batches per motif + - `max_seconds` (optional float > 0) - max seconds per motif mining loop + - `retain_bin_ids` (optional list of ints) - select p‑value bins to retain (0‑based indices); + retained bins are the only bins reported in yield summaries + - `log_every_batches` (int > 0; default 1) - log per‑bin yield summaries every N batches - `bgfile` (optional path) - MEME bfile-format background model for FIMO - `keep_all_candidates_debug` (bool, default false) - write raw FIMO TSVs to `outputs/meta/fimo/` for inspection - `include_matched_sequence` (bool, default false) - include `fimo_matched_sequence` in TFBS outputs @@ -80,8 +87,11 @@ PWM inputs perform **input sampling** (sampling sites from PWMs) via - FIMO resolves `fimo` via `MEME_BIN` or PATH; pixi users should run `pixi run dense ...` so it is available. - Canonical p‑value bins (default): `[1e-10, 1e-8, 1e-6, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]` (bin 0 is `(0, 1e-10]`, bin 1 is `(1e-10, 1e-8]`, etc.) - - FIMO runs log per‑bin yield summaries (hits, accepted, selected); `selection_policy: stratified` + - FIMO runs log per‑bin yield summaries (hits, accepted, selected). If `retain_bin_ids` is set, + only those bins are reported; otherwise all bins are reported. `selection_policy: stratified` makes the selected‑bin distribution explicit for mining workflows. + - When `mining` is enabled, `max_seconds` caps per‑batch candidate generation while + `mining.max_seconds` caps the overall mining loop. - `type: pwm_meme_set` - `paths` - list of MEME PWM files (merged into a single TF pool) - `motif_ids` (optional list) - choose motifs by ID across files @@ -200,7 +210,11 @@ binding-site and PWM-sampled inputs. ### `densegen.runtime` -- `round_robin` (bool) +- `round_robin` (bool) - interleave plan items across inputs (one subsample per plan per pass). + Use this when you have multiple distinct constraint sets (e.g., different fixed sequences) and want + a single run to advance each plan in turn. This **does not** change Stage‑B sampling logic; it only + changes scheduling. With `pool_strategy: iterative_subsample`, round‑robin can increase how often + libraries are rebuilt, so expect additional compute if many plans are active. - `arrays_generated_before_resample` (int > 0) - `min_count_per_tf` (int >= 0) - `max_duplicate_solutions`, `stall_seconds_before_resample`, `stall_warning_every_seconds` @@ -226,6 +240,12 @@ binding-site and PWM-sampled inputs. - `level` (e.g., `INFO`) - `suppress_solver_stderr` (bool) - `print_visual` (bool) +- `progress_style`: `stream | summary | screen` (default `stream`) + - `stream`: per‑sequence logs (controlled by `progress_every`) + - `summary`: suppress per‑sequence logs; keep periodic leaderboard summaries + - `screen`: clear and redraw a compact dashboard at `progress_refresh_seconds` +- `progress_every` (int >= 0) - log/refresh interval in sequences (`0` disables per‑sequence logging) +- `progress_refresh_seconds` (float > 0) - minimum seconds between screen refreshes --- diff --git a/src/dnadesign/densegen/src/adapters/outputs/parquet.py b/src/dnadesign/densegen/src/adapters/outputs/parquet.py index a160098f..a45456f2 100644 --- a/src/dnadesign/densegen/src/adapters/outputs/parquet.py +++ b/src/dnadesign/densegen/src/adapters/outputs/parquet.py @@ -37,6 +37,7 @@ def _meta_arrow_type(name: str, pa): } list_int = { "input_pwm_pvalue_bin_ids", + "input_pwm_mining_retain_bin_ids", } int_fields = { "length", @@ -45,6 +46,9 @@ def _meta_arrow_type(name: str, pa): "min_required_regulators", "input_pwm_n_sites", "input_pwm_oversample_factor", + "input_pwm_mining_batch_size", + "input_pwm_mining_max_batches", + "input_pwm_mining_log_every_batches", "input_row_count", "input_tf_count", "input_tfbs_count", @@ -68,6 +72,7 @@ def _meta_arrow_type(name: str, pa): "input_pwm_score_threshold", "input_pwm_score_percentile", "input_pwm_pvalue_threshold", + "input_pwm_mining_max_seconds", "sampling_fraction", "sampling_fraction_pairs", "gap_fill_gc_min", diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py index 446ca742..e193617d 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py @@ -177,6 +177,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") pvalue_bin_ids = sampling.get("pvalue_bin_ids") + mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) @@ -205,6 +206,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, pvalue_bin_ids=pvalue_bin_ids, + mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, keep_all_candidates_debug=keep_all_candidates_debug, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py index 6fff70b3..6a87a1f0 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py @@ -73,6 +73,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold = sampling_cfg.get("pvalue_threshold") pvalue_bins = sampling_cfg.get("pvalue_bins") pvalue_bin_ids = sampling_cfg.get("pvalue_bin_ids") + mining = sampling_cfg.get("mining") bgfile = sampling_cfg.get("bgfile") selection_policy = str(sampling_cfg.get("selection_policy", "random_uniform")) keep_all_candidates_debug = bool(sampling_cfg.get("keep_all_candidates_debug", False)) @@ -100,6 +101,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, pvalue_bin_ids=pvalue_bin_ids, + mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, keep_all_candidates_debug=keep_all_candidates_debug, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py b/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py index cbdc06c6..1cb2fc4b 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py @@ -46,9 +46,14 @@ def _sanitize_id(text: str) -> str: return cleaned or "motif" -def build_candidate_records(motif_id: str, sequences: Sequence[str]) -> list[tuple[str, str]]: +def build_candidate_records( + motif_id: str, + sequences: Sequence[str], + *, + start_index: int = 0, +) -> list[tuple[str, str]]: prefix = _sanitize_id(motif_id) - return [(f"{prefix}|cand{idx}", seq) for idx, seq in enumerate(sequences)] + return [(f"{prefix}|cand{start_index + idx}", seq) for idx, seq in enumerate(sequences)] def write_candidates_fasta(records: Sequence[tuple[str, str]], out_path: Path) -> None: diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py index 75a73d19..4ce3594f 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py @@ -117,6 +117,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") pvalue_bin_ids = sampling.get("pvalue_bin_ids") + mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) @@ -148,6 +149,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, pvalue_bin_ids=pvalue_bin_ids, + mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, keep_all_candidates_debug=keep_all_candidates_debug, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py index 5df2088c..049eecfd 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py @@ -81,6 +81,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") pvalue_bin_ids = sampling.get("pvalue_bin_ids") + mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) @@ -109,6 +110,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, pvalue_bin_ids=pvalue_bin_ids, + mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, keep_all_candidates_debug=keep_all_candidates_debug, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py index 7f7193ac..dc3facb0 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py @@ -95,6 +95,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") pvalue_bin_ids = sampling.get("pvalue_bin_ids") + mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) @@ -127,6 +128,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, pvalue_bin_ids=pvalue_bin_ids, + mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, keep_all_candidates_debug=keep_all_candidates_debug, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py index 1e521914..cafece29 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py @@ -89,6 +89,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") pvalue_bin_ids = sampling.get("pvalue_bin_ids") + mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) keep_all_candidates_debug = bool(sampling.get("keep_all_candidates_debug", False)) @@ -121,6 +122,7 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, pvalue_bin_ids=pvalue_bin_ids, + mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, keep_all_candidates_debug=keep_all_candidates_debug, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py b/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py index 5c3514f5..5b84db1a 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py @@ -37,6 +37,16 @@ def _safe_label(text: str) -> str: return cleaned or "motif" +def _mining_attr(mining, name: str, default=None): + if mining is None: + return default + if hasattr(mining, name): + return getattr(mining, name) + if isinstance(mining, dict): + return mining.get(name, default) + return default + + @dataclass(frozen=True) class FimoCandidate: seq: str @@ -281,15 +291,24 @@ def _assign_pvalue_bin(pvalue: float, edges: Sequence[float]) -> tuple[int, floa return len(edges) - 1, float(edges[-2]), float(edges[-1]) -def _format_pvalue_bins(edges: Sequence[float], counts: Sequence[int]) -> str: +def _format_pvalue_bins( + edges: Sequence[float], + counts: Sequence[int], + *, + only_bins: Optional[Sequence[int]] = None, +) -> str: if not edges or not counts: return "-" + only_set = {int(idx) for idx in only_bins} if only_bins is not None else None labels: list[str] = [] low = 0.0 - for edge, count in zip(edges, counts): + for idx, (edge, count) in enumerate(zip(edges, counts)): + if only_set is not None and idx not in only_set: + low = float(edge) + continue labels.append(f"({low:.0e},{float(edge):.0e}]:{int(count)}") low = float(edge) - return " ".join(labels) + return " ".join(labels) if labels else "-" def _stratified_sample( @@ -360,18 +379,22 @@ def _select_fimo_candidates( if context.get("pvalue_bins_label") is not None: msg_lines.append(f"P-value bins={context.get('pvalue_bins_label')}.") if context.get("pvalue_bin_ids") is not None: - msg_lines.append(f"Selected bins={context.get('pvalue_bin_ids')}.") + msg_lines.append(f"Retained bins={context.get('pvalue_bin_ids')}.") suggestions = [ "reduce n_sites", "relax pvalue_threshold (e.g., 1e-4 → 1e-3)", "increase oversample_factor", ] if context.get("pvalue_bin_ids") is not None: - suggestions.append("broaden pvalue_bin_ids (or remove bin filtering)") + suggestions.append("broaden mining.retain_bin_ids (or remove bin filtering)") if context.get("cap_applied"): suggestions.append("increase max_candidates (cap was hit)") if context.get("time_limited"): suggestions.append("increase max_seconds (time limit was hit)") + if context.get("mining_max_batches") is not None and context.get("mining_batches_limited"): + suggestions.append("increase mining.max_batches") + if context.get("mining_max_seconds") is not None and context.get("mining_time_limited"): + suggestions.append("increase mining.max_seconds") if context.get("width") is not None and int(context.get("width")) <= 6: suggestions.append("try length_policy=range with a longer length_range") msg_lines.append("Try next: " + "; ".join(suggestions) + ".") @@ -412,6 +435,7 @@ def sample_pwm_sites( pvalue_threshold: Optional[float] = None, pvalue_bins: Optional[Sequence[float]] = None, pvalue_bin_ids: Optional[Sequence[int]] = None, + mining: Optional[object] = None, bgfile: Optional[str | Path] = None, selection_policy: str = "random_uniform", keep_all_candidates_debug: bool = False, @@ -440,6 +464,8 @@ def sample_pwm_sites( raise ValueError("pvalue_bins is only valid when scoring_backend='fimo'") if pvalue_bin_ids is not None: raise ValueError("pvalue_bin_ids is only valid when scoring_backend='fimo'") + if mining is not None: + raise ValueError("mining is only valid when scoring_backend='fimo'") if include_matched_sequence: raise ValueError("include_matched_sequence is only valid when scoring_backend='fimo'") else: @@ -450,6 +476,10 @@ def sample_pwm_sites( raise ValueError("pwm.sampling.pvalue_threshold must be between 0 and 1") if selection_policy not in {"random_uniform", "top_n", "stratified"}: raise ValueError(f"Unsupported pwm selection_policy: {selection_policy}") + if mining is not None: + retain_bins = _mining_attr(mining, "retain_bin_ids") + if retain_bins is not None and pvalue_bin_ids is not None: + raise ValueError("Provide retain_bin_ids in mining or pvalue_bin_ids, not both.") if score_threshold is not None or score_percentile is not None: log.warning( "PWM sampling scoring_backend=fimo ignores score_threshold/score_percentile for motif %s.", @@ -503,6 +533,7 @@ def _cap_label(cap_applied: bool, time_limited: bool) -> str: return cap_label def _context(length_obs: str, cap_applied: bool, requested: int, generated: int, time_limited: bool) -> dict: + mining_cfg = mining return { "motif_id": motif.motif_id, "width": width, @@ -519,6 +550,11 @@ def _context(length_obs: str, cap_applied: bool, requested: int, generated: int, "cap_applied": cap_applied, "cap_label": _cap_label(cap_applied, time_limited), "time_limited": time_limited, + "mining_batch_size": _mining_attr(mining_cfg, "batch_size"), + "mining_max_batches": _mining_attr(mining_cfg, "max_batches"), + "mining_max_seconds": _mining_attr(mining_cfg, "max_seconds"), + "mining_log_every_batches": _mining_attr(mining_cfg, "log_every_batches"), + "mining_retain_bin_ids": _mining_attr(mining_cfg, "retain_bin_ids"), } def _select( @@ -564,13 +600,11 @@ def _embed_with_background(seq: str, target_len: int) -> str: return f"{left}{seq}{right}" def _score_with_fimo( - sequences: List[str], *, - length_obs: str, + n_candidates: int, cap_applied: bool, requested: int, - generated: int, - time_limited: bool, + sequences: Optional[List[str]] = None, ) -> tuple[List[str], dict[str, dict]]: import tempfile @@ -585,13 +619,20 @@ def _score_with_fimo( if pvalue_threshold is None: raise ValueError("pvalue_threshold required for fimo backend") resolved_bins = _resolve_pvalue_edges(pvalue_bins) + retain_bins = _mining_attr(mining, "retain_bin_ids") + if retain_bins is None and pvalue_bin_ids is not None: + retain_bins = list(pvalue_bin_ids) allowed_bins: Optional[set[int]] = None - if pvalue_bin_ids is not None: - allowed_bins = {int(idx) for idx in pvalue_bin_ids} + if retain_bins is not None: + allowed_bins = {int(idx) for idx in retain_bins} max_idx = len(resolved_bins) - 1 if any(idx > max_idx for idx in allowed_bins): - raise ValueError(f"pvalue_bin_ids contains an index outside the available bins (max={max_idx}).") + raise ValueError(f"retain_bin_ids contains an index outside the available bins (max={max_idx}).") keep_weak = keep_low + mining_batch_size = int(_mining_attr(mining, "batch_size", n_candidates)) + mining_max_batches = _mining_attr(mining, "max_batches") + mining_max_seconds = _mining_attr(mining, "max_seconds") + mining_log_every = int(_mining_attr(mining, "log_every_batches", 1)) debug_path: Optional[Path] = None debug_dir = debug_output_dir if keep_all_candidates_debug: @@ -607,69 +648,219 @@ def _score_with_fimo( label = _safe_label(debug_label or motif.motif_id) debug_path = debug_dir / f"{label}__fimo.tsv" + def _merge_tsv(existing: list[str], text: str) -> None: + lines = [ln for ln in text.splitlines() if ln.strip()] + if not lines: + return + if not existing: + existing.extend(lines) + return + header_skipped = False + for ln in lines: + if ln.lstrip().startswith("#"): + continue + if not header_skipped: + header_skipped = True + continue + existing.append(ln) + + def _generate_batch(count: int) -> tuple[list[str], list[int], bool]: + batch_start = time.monotonic() + sequences: list[str] = [] + lengths: list[int] = [] + time_limited = False + for _ in range(count): + if max_seconds is not None and sequences: + if (time.monotonic() - batch_start) >= float(max_seconds): + time_limited = True + break + target_len = _resolve_length() + lengths.append(int(target_len)) + if strategy == "background": + core = sample_sequence_from_background(rng, motif.background, width) + else: + core = sample_sequence_from_pwm(rng, matrix) + full_seq = _embed_with_background(core, target_len) + sequences.append(full_seq) + return sequences, lengths, time_limited + + total_bin_counts = [0 for _ in resolved_bins] + accepted_bin_counts = [0 for _ in resolved_bins] + candidates: List[FimoCandidate] = [] + seen: set[str] = set() + lengths_all: list[int] = [] + generated_total = 0 + time_limited = False + mining_time_limited = False + mining_batches_limited = False + batches = 0 + tsv_lines: list[str] = [] + provided_sequences = sequences + with tempfile.TemporaryDirectory() as tmp: tmp_path = Path(tmp) meme_path = tmp_path / "motif.meme" - fasta_path = tmp_path / "candidates.fasta" motif_for_fimo = PWMMotif(motif_id=motif.motif_id, matrix=matrix, background=motif.background) write_minimal_meme_motif(motif_for_fimo, meme_path) - records = build_candidate_records(motif.motif_id, sequences) - write_candidates_fasta(records, fasta_path) - thresh = 1.0 if keep_all_candidates_debug or keep_weak else float(pvalue_threshold) - rows, raw_tsv = run_fimo( - meme_motif_path=meme_path, - fasta_path=fasta_path, - bgfile=Path(bgfile) if bgfile is not None else None, - thresh=thresh, - include_matched_sequence=include_matched_sequence or keep_all_candidates_debug, - return_tsv=debug_path is not None, - ) - if debug_path is not None and raw_tsv is not None: - debug_path.write_text(raw_tsv) - log.info("FIMO debug TSV written: %s", debug_path) - best_hits = aggregate_best_hits(rows) - - candidates: List[FimoCandidate] = [] - total_bin_counts = [0 for _ in resolved_bins] - accepted_bin_counts = [0 for _ in resolved_bins] - for rec_id, seq in records: - hit = best_hits.get(rec_id) - if hit is None: - continue - bin_id, bin_low, bin_high = _assign_pvalue_bin(hit.pvalue, resolved_bins) - total_bin_counts[bin_id] += 1 - if keep_weak: - accept = hit.pvalue >= float(pvalue_threshold) - else: - accept = hit.pvalue <= float(pvalue_threshold) - if allowed_bins is not None and bin_id not in allowed_bins: - continue - if not accept: - continue - accepted_bin_counts[bin_id] += 1 - candidates.append( - FimoCandidate( - seq=seq, - pvalue=hit.pvalue, - score=hit.score, - bin_id=bin_id, - bin_low=bin_low, - bin_high=bin_high, - start=hit.start, - stop=hit.stop, - strand=hit.strand, - matched_sequence=hit.matched_sequence, + if provided_sequences is not None: + lengths_all = [len(seq) for seq in provided_sequences] + fasta_path = tmp_path / "candidates.fasta" + records = build_candidate_records(motif.motif_id, provided_sequences, start_index=0) + write_candidates_fasta(records, fasta_path) + thresh = 1.0 if keep_all_candidates_debug or keep_weak else float(pvalue_threshold) + rows, raw_tsv = run_fimo( + meme_motif_path=meme_path, + fasta_path=fasta_path, + bgfile=Path(bgfile) if bgfile is not None else None, + thresh=thresh, + include_matched_sequence=include_matched_sequence or keep_all_candidates_debug, + return_tsv=debug_path is not None, ) - ) + if debug_path is not None and raw_tsv is not None: + _merge_tsv(tsv_lines, raw_tsv) + best_hits = aggregate_best_hits(rows) + for rec_id, seq in records: + hit = best_hits.get(rec_id) + if hit is None: + continue + bin_id, bin_low, bin_high = _assign_pvalue_bin(hit.pvalue, resolved_bins) + if allowed_bins is not None and bin_id not in allowed_bins: + continue + total_bin_counts[bin_id] += 1 + if keep_weak: + accept = hit.pvalue >= float(pvalue_threshold) + else: + accept = hit.pvalue <= float(pvalue_threshold) + if not accept: + continue + if seq in seen: + continue + seen.add(seq) + accepted_bin_counts[bin_id] += 1 + candidates.append( + FimoCandidate( + seq=seq, + pvalue=hit.pvalue, + score=hit.score, + bin_id=bin_id, + bin_low=bin_low, + bin_high=bin_high, + start=hit.start, + stop=hit.stop, + strand=hit.strand, + matched_sequence=hit.matched_sequence, + ) + ) + generated_total = len(provided_sequences) + batches = 1 + else: + mining_start = time.monotonic() + while generated_total < n_candidates: + if mining_max_batches is not None and batches >= int(mining_max_batches): + mining_batches_limited = True + break + if mining_max_seconds is not None and (time.monotonic() - mining_start) >= float( + mining_max_seconds + ): + mining_time_limited = True + break + remaining = int(n_candidates) - generated_total + if remaining <= 0: + break + batch_target = min(int(mining_batch_size), remaining) + sequences, lengths, batch_limited = _generate_batch(batch_target) + if batch_limited: + time_limited = True + if not sequences: + break + lengths_all.extend(lengths) + fasta_path = tmp_path / "candidates.fasta" + records = build_candidate_records(motif.motif_id, sequences, start_index=generated_total) + write_candidates_fasta(records, fasta_path) + thresh = 1.0 if keep_all_candidates_debug or keep_weak else float(pvalue_threshold) + rows, raw_tsv = run_fimo( + meme_motif_path=meme_path, + fasta_path=fasta_path, + bgfile=Path(bgfile) if bgfile is not None else None, + thresh=thresh, + include_matched_sequence=include_matched_sequence or keep_all_candidates_debug, + return_tsv=debug_path is not None, + ) + if debug_path is not None and raw_tsv is not None: + _merge_tsv(tsv_lines, raw_tsv) + best_hits = aggregate_best_hits(rows) + for rec_id, seq in records: + hit = best_hits.get(rec_id) + if hit is None: + continue + bin_id, bin_low, bin_high = _assign_pvalue_bin(hit.pvalue, resolved_bins) + if allowed_bins is not None and bin_id not in allowed_bins: + continue + total_bin_counts[bin_id] += 1 + if keep_weak: + accept = hit.pvalue >= float(pvalue_threshold) + else: + accept = hit.pvalue <= float(pvalue_threshold) + if not accept: + continue + if seq in seen: + continue + seen.add(seq) + accepted_bin_counts[bin_id] += 1 + candidates.append( + FimoCandidate( + seq=seq, + pvalue=hit.pvalue, + score=hit.score, + bin_id=bin_id, + bin_low=bin_low, + bin_high=bin_high, + start=hit.start, + stop=hit.stop, + strand=hit.strand, + matched_sequence=hit.matched_sequence, + ) + ) + generated_total += len(sequences) + batches += 1 + if mining_log_every > 0 and batches % mining_log_every == 0: + bins_label = _format_pvalue_bins(resolved_bins, total_bin_counts, only_bins=retain_bins) + accepted_label = _format_pvalue_bins(resolved_bins, accepted_bin_counts, only_bins=retain_bins) + log.info( + "FIMO mining %s batch %d/%s: generated=%d accepted=%d bins=%s accepted_bins=%s", + motif.motif_id, + batches, + str(mining_max_batches) if mining_max_batches is not None else "-", + generated_total, + len(candidates), + bins_label, + accepted_label, + ) + + if debug_path is not None and tsv_lines: + debug_path.write_text("\n".join(tsv_lines) + "\n") + log.info("FIMO debug TSV written: %s", debug_path) total_hits = sum(total_bin_counts) accepted_hits = sum(accepted_bin_counts) - bins_label = _format_pvalue_bins(resolved_bins, total_bin_counts) - accepted_label = _format_pvalue_bins(resolved_bins, accepted_bin_counts) + bins_label = _format_pvalue_bins(resolved_bins, total_bin_counts, only_bins=retain_bins) + accepted_label = _format_pvalue_bins(resolved_bins, accepted_bin_counts, only_bins=retain_bins) + length_obs = "-" + if lengths_all: + length_obs = ( + f"{min(lengths_all)}..{max(lengths_all)}" + if min(lengths_all) != max(lengths_all) + else str(lengths_all[0]) + ) - context = _context(length_obs, cap_applied, requested, generated, time_limited) + context = _context(length_obs, cap_applied, requested, generated_total, time_limited) context["pvalue_bins_label"] = bins_label context["pvalue_bin_ids"] = sorted(allowed_bins) if allowed_bins is not None else None + context["mining_batch_size"] = mining_batch_size + context["mining_max_batches"] = mining_max_batches + context["mining_max_seconds"] = mining_max_seconds + context["mining_time_limited"] = mining_time_limited + context["mining_batches_limited"] = mining_batches_limited picked = _select_fimo_candidates( candidates, n_sites=n_sites, @@ -684,9 +875,9 @@ def _score_with_fimo( for cand in picked: idx = max(0, min(int(cand.bin_id), len(resolved_bins) - 1)) selected_bin_counts[idx] += 1 - selected_label = _format_pvalue_bins(resolved_bins, selected_bin_counts) + selected_label = _format_pvalue_bins(resolved_bins, selected_bin_counts, only_bins=retain_bins) log.info( - "FIMO yield for motif %s: hits=%d accepted=%d selected=%d bins=%s accepted_bins=%s selected_bins=%s%s", + "FIMO yield for motif %s: hits=%d accepted=%d selected=%d bins=%s accepted_bins=%s selected_bins=%s", motif.motif_id, total_hits, accepted_hits, @@ -694,7 +885,6 @@ def _score_with_fimo( bins_label, accepted_label, selected_label, - f" allowed_bins={sorted(allowed_bins)}" if allowed_bins is not None else "", ) meta_by_seq: dict[str, dict] = {} for cand in picked: @@ -729,12 +919,10 @@ def _score_with_fimo( ) return (selected, {}) if return_metadata else selected selected, meta = _score_with_fimo( - [full_seq], - length_obs=str(target_len), + n_candidates=1, cap_applied=False, requested=1, - generated=1, - time_limited=False, + sequences=[full_seq], ) return (selected, meta) if return_metadata else selected @@ -755,34 +943,34 @@ def _score_with_fimo( cap_val, ) n_candidates = max(1, n_candidates) - candidates: List[Tuple[str, str]] = [] - lengths: List[int] = [] - start = time.monotonic() - time_limited = False - for _ in range(n_candidates): - if max_seconds is not None and candidates: - if (time.monotonic() - start) >= float(max_seconds): - time_limited = True - break - target_len = _resolve_length() - lengths.append(int(target_len)) - if strategy == "background": - core = sample_sequence_from_background(rng, motif.background, width) - else: - core = sample_sequence_from_pwm(rng, matrix) - full_seq = _embed_with_background(core, target_len) - candidates.append((full_seq, core)) - if time_limited: - log.warning( - "PWM sampling hit max_seconds for motif %s: generated=%d requested=%d", - motif.motif_id, - len(candidates), - requested_candidates, - ) - length_obs = "-" - if lengths: - length_obs = f"{min(lengths)}..{max(lengths)}" if min(lengths) != max(lengths) else str(lengths[0]) if scoring_backend == "densegen": + candidates: List[Tuple[str, str]] = [] + lengths: List[int] = [] + start = time.monotonic() + time_limited = False + for _ in range(n_candidates): + if max_seconds is not None and candidates: + if (time.monotonic() - start) >= float(max_seconds): + time_limited = True + break + target_len = _resolve_length() + lengths.append(int(target_len)) + if strategy == "background": + core = sample_sequence_from_background(rng, motif.background, width) + else: + core = sample_sequence_from_pwm(rng, matrix) + full_seq = _embed_with_background(core, target_len) + candidates.append((full_seq, core)) + if time_limited: + log.warning( + "PWM sampling hit max_seconds for motif %s: generated=%d requested=%d", + motif.motif_id, + len(candidates), + requested_candidates, + ) + length_obs = "-" + if lengths: + length_obs = f"{min(lengths)}..{max(lengths)}" if min(lengths) != max(lengths) else str(lengths[0]) scored = [ (full_seq, score_sequence(core, matrix, log_odds=log_odds, background=motif.background)) for full_seq, core in candidates @@ -797,11 +985,8 @@ def _score_with_fimo( ) return (selected, {}) if return_metadata else selected selected, meta = _score_with_fimo( - [full_seq for full_seq, _core in candidates], - length_obs=length_obs, cap_applied=cap_applied, requested=requested_candidates, - generated=len(candidates), - time_limited=time_limited, + n_candidates=n_candidates, ) return (selected, meta) if return_metadata else selected diff --git a/src/dnadesign/densegen/src/cli.py b/src/dnadesign/densegen/src/cli.py index fbc6e201..9fbd1195 100644 --- a/src/dnadesign/densegen/src/cli.py +++ b/src/dnadesign/densegen/src/cli.py @@ -960,6 +960,7 @@ def describe( "score", "selection", "bins", + "mining", "bgfile", "oversample", "max_candidates", @@ -997,9 +998,23 @@ def describe( bins_label = "canonical" if getattr(sampling, "pvalue_bins", None) is not None: bins_label = "custom" - bin_ids = getattr(sampling, "pvalue_bin_ids", None) + mining_cfg = getattr(sampling, "mining", None) + bin_ids = getattr(mining_cfg, "retain_bin_ids", None) + if bin_ids is None: + bin_ids = getattr(sampling, "pvalue_bin_ids", None) if bin_ids: - bins_label = f"{bins_label} pick={bin_ids}" + bins_label = f"{bins_label} retain={bin_ids}" + mining_label = "-" + mining_cfg = getattr(sampling, "mining", None) + if backend == "fimo" and mining_cfg is not None: + parts = [f"batch={mining_cfg.batch_size}"] + if mining_cfg.max_batches is not None: + parts.append(f"max_batches={mining_cfg.max_batches}") + if mining_cfg.max_seconds is not None: + parts.append(f"max_seconds={mining_cfg.max_seconds}s") + if mining_cfg.retain_bin_ids: + parts.append(f"retain={mining_cfg.retain_bin_ids}") + mining_label = ", ".join(parts) bgfile_label = getattr(sampling, "bgfile", None) or "-" length_label = str(sampling.length_policy) if sampling.length_policy == "range" and sampling.length_range is not None: @@ -1013,6 +1028,7 @@ def describe( score_label, str(selection_label), str(bins_label), + str(mining_label), str(bgfile_label), str(sampling.oversample_factor), str(sampling.max_candidates) if sampling.max_candidates is not None else "-", diff --git a/src/dnadesign/densegen/src/config/__init__.py b/src/dnadesign/densegen/src/config/__init__.py index beed783f..fb882183 100644 --- a/src/dnadesign/densegen/src/config/__init__.py +++ b/src/dnadesign/densegen/src/config/__init__.py @@ -13,6 +13,7 @@ from __future__ import annotations import os +import warnings from dataclasses import dataclass from pathlib import Path from typing import Annotated, Any, Dict, List, Optional, Union @@ -154,6 +155,59 @@ class SequenceLibraryInput(BaseModel): sequence_column: str = "sequence" +class PWMMiningConfig(BaseModel): + model_config = ConfigDict(extra="forbid") + batch_size: int = 100000 + max_batches: Optional[int] = None + max_seconds: Optional[float] = None + retain_bin_ids: Optional[List[int]] = None + log_every_batches: int = 1 + + @field_validator("batch_size") + @classmethod + def _batch_size_ok(cls, v: int): + if v <= 0: + raise ValueError("pwm.sampling.mining.batch_size must be > 0") + return v + + @field_validator("max_batches") + @classmethod + def _max_batches_ok(cls, v: Optional[int]): + if v is not None and v <= 0: + raise ValueError("pwm.sampling.mining.max_batches must be > 0 when set") + return v + + @field_validator("max_seconds") + @classmethod + def _max_seconds_ok(cls, v: Optional[float]): + if v is None: + return v + if not isinstance(v, (int, float)) or float(v) <= 0: + raise ValueError("pwm.sampling.mining.max_seconds must be > 0 when set") + return float(v) + + @field_validator("retain_bin_ids") + @classmethod + def _retain_bin_ids_ok(cls, v: Optional[List[int]]): + if v is None: + return v + if not v: + raise ValueError("pwm.sampling.mining.retain_bin_ids must be non-empty when set") + ids = [int(x) for x in v] + if any(idx < 0 for idx in ids): + raise ValueError("pwm.sampling.mining.retain_bin_ids values must be >= 0") + if len(set(ids)) != len(ids): + raise ValueError("pwm.sampling.mining.retain_bin_ids must be unique") + return ids + + @field_validator("log_every_batches") + @classmethod + def _log_every_batches_ok(cls, v: int): + if v <= 0: + raise ValueError("pwm.sampling.mining.log_every_batches must be > 0") + return v + + class PWMSamplingConfig(BaseModel): model_config = ConfigDict(extra="forbid") strategy: Literal["consensus", "stochastic", "background"] = "stochastic" @@ -167,6 +221,7 @@ class PWMSamplingConfig(BaseModel): pvalue_threshold: Optional[float] = None pvalue_bins: Optional[List[float]] = None pvalue_bin_ids: Optional[List[int]] = None + mining: Optional[PWMMiningConfig] = None bgfile: Optional[str] = None selection_policy: Literal["random_uniform", "top_n", "stratified"] = "random_uniform" keep_all_candidates_debug: bool = False @@ -284,6 +339,8 @@ def _score_mode(self): raise ValueError("pwm.sampling.pvalue_bins is only valid when scoring_backend='fimo'") if self.pvalue_bin_ids is not None: raise ValueError("pwm.sampling.pvalue_bin_ids is only valid when scoring_backend='fimo'") + if self.mining is not None: + raise ValueError("pwm.sampling.mining is only valid when scoring_backend='fimo'") if self.include_matched_sequence: raise ValueError("pwm.sampling.include_matched_sequence is only valid when scoring_backend='fimo'") else: @@ -291,11 +348,26 @@ def _score_mode(self): raise ValueError("pwm.sampling.pvalue_threshold is required when scoring_backend='fimo'") if not (0.0 < float(self.pvalue_threshold) <= 1.0): raise ValueError("pwm.sampling.pvalue_threshold must be between 0 and 1") - if self.pvalue_bin_ids is not None: + if self.pvalue_bin_ids is not None and self.mining is not None: + raise ValueError( + "pwm.sampling.pvalue_bin_ids is deprecated; use pwm.sampling.mining.retain_bin_ids instead." + ) + if self.pvalue_bin_ids is not None and self.mining is None: + warnings.warn( + "pwm.sampling.pvalue_bin_ids is deprecated; use pwm.sampling.mining.retain_bin_ids.", + stacklevel=2, + ) + self.mining = PWMMiningConfig(retain_bin_ids=list(self.pvalue_bin_ids)) + bin_ids = None + if self.mining is not None and self.mining.retain_bin_ids is not None: + bin_ids = list(self.mining.retain_bin_ids) + elif self.pvalue_bin_ids is not None: + bin_ids = list(self.pvalue_bin_ids) + if bin_ids is not None: bins = list(self.pvalue_bins) if self.pvalue_bins is not None else list(CANONICAL_PVALUE_BINS) max_idx = len(bins) - 1 - if any(idx > max_idx for idx in self.pvalue_bin_ids): - raise ValueError("pwm.sampling.pvalue_bin_ids contains an index outside the available bins") + if any(idx > max_idx for idx in bin_ids): + raise ValueError("pwm.sampling.mining.retain_bin_ids contains an index outside the available bins") if self.strategy == "consensus" and int(self.n_sites) != 1: raise ValueError("pwm.sampling.strategy=consensus requires n_sites=1") if self.scoring_backend == "densegen" and self.score_percentile is not None: @@ -945,6 +1017,9 @@ class LoggingConfig(BaseModel): level: str = "INFO" suppress_solver_stderr: bool = True print_visual: bool = True + progress_style: Literal["stream", "summary", "screen"] = "stream" + progress_every: int = 1 + progress_refresh_seconds: float = 1.0 @field_validator("log_dir") @classmethod @@ -962,6 +1037,20 @@ def _level_ok(cls, v: str): raise ValueError(f"logging.level must be one of {sorted(allowed)}") return lv + @field_validator("progress_every") + @classmethod + def _progress_every_ok(cls, v: int): + if v < 0: + raise ValueError("logging.progress_every must be >= 0") + return int(v) + + @field_validator("progress_refresh_seconds") + @classmethod + def _progress_refresh_ok(cls, v: float): + if not isinstance(v, (int, float)) or float(v) <= 0: + raise ValueError("logging.progress_refresh_seconds must be > 0") + return float(v) + # ---- Plots ---- class PlotConfig(BaseModel): diff --git a/src/dnadesign/densegen/src/core/metadata.py b/src/dnadesign/densegen/src/core/metadata.py index cb8d8ca6..861de3ed 100644 --- a/src/dnadesign/densegen/src/core/metadata.py +++ b/src/dnadesign/densegen/src/core/metadata.py @@ -147,6 +147,11 @@ def build_metadata( "input_pwm_pvalue_threshold": input_meta.get("input_pwm_pvalue_threshold"), "input_pwm_pvalue_bins": input_meta.get("input_pwm_pvalue_bins"), "input_pwm_pvalue_bin_ids": input_meta.get("input_pwm_pvalue_bin_ids"), + "input_pwm_mining_batch_size": input_meta.get("input_pwm_mining_batch_size"), + "input_pwm_mining_max_batches": input_meta.get("input_pwm_mining_max_batches"), + "input_pwm_mining_max_seconds": input_meta.get("input_pwm_mining_max_seconds"), + "input_pwm_mining_retain_bin_ids": input_meta.get("input_pwm_mining_retain_bin_ids"), + "input_pwm_mining_log_every_batches": input_meta.get("input_pwm_mining_log_every_batches"), "input_pwm_selection_policy": input_meta.get("input_pwm_selection_policy"), "input_pwm_bgfile": input_meta.get("input_pwm_bgfile"), "input_pwm_keep_all_candidates_debug": input_meta.get("input_pwm_keep_all_candidates_debug"), diff --git a/src/dnadesign/densegen/src/core/metadata_schema.py b/src/dnadesign/densegen/src/core/metadata_schema.py index 70379bd3..ca0c2736 100644 --- a/src/dnadesign/densegen/src/core/metadata_schema.py +++ b/src/dnadesign/densegen/src/core/metadata_schema.py @@ -98,7 +98,27 @@ class MetaField: MetaField("input_pwm_score_percentile", (numbers.Real,), "PWM score percentile.", allow_none=True), MetaField("input_pwm_pvalue_threshold", (numbers.Real,), "PWM p-value threshold (FIMO).", allow_none=True), MetaField("input_pwm_pvalue_bins", (list,), "PWM p-value bins (FIMO).", allow_none=True), - MetaField("input_pwm_pvalue_bin_ids", (list,), "Selected p-value bin indices (FIMO).", allow_none=True), + MetaField( + "input_pwm_pvalue_bin_ids", + (list,), + "Deprecated: selected p-value bin indices (use input_pwm_mining_retain_bin_ids).", + allow_none=True, + ), + MetaField("input_pwm_mining_batch_size", (int,), "PWM mining batch size (FIMO).", allow_none=True), + MetaField("input_pwm_mining_max_batches", (int,), "PWM mining max batches (FIMO).", allow_none=True), + MetaField("input_pwm_mining_max_seconds", (numbers.Real,), "PWM mining max seconds (FIMO).", allow_none=True), + MetaField( + "input_pwm_mining_retain_bin_ids", + (list,), + "PWM mining retained p-value bin indices (FIMO).", + allow_none=True, + ), + MetaField( + "input_pwm_mining_log_every_batches", + (int,), + "PWM mining log frequency (batches).", + allow_none=True, + ), MetaField("input_pwm_selection_policy", (str,), "PWM selection policy (FIMO).", allow_none=True), MetaField("input_pwm_bgfile", (str,), "PWM background model path (FIMO).", allow_none=True), MetaField("input_pwm_keep_all_candidates_debug", (bool,), "PWM FIMO debug TSV enabled.", allow_none=True), @@ -224,10 +244,15 @@ def _validate_list_fields(meta: Mapping[str, Any]) -> None: for item in vals: if not isinstance(item, int): raise TypeError("Metadata field 'input_pwm_pvalue_bin_ids' must contain only integers") - if not isinstance(item["tf"], str): - raise TypeError("used_tf_counts.tf must be a string") - if not isinstance(item["count"], int): - raise TypeError("used_tf_counts.count must be an int") + + if "input_pwm_mining_retain_bin_ids" in meta: + vals = meta["input_pwm_mining_retain_bin_ids"] + if vals is not None: + if isinstance(vals, (str, bytes)) or not isinstance(vals, Sequence): + raise TypeError("Metadata field 'input_pwm_mining_retain_bin_ids' must be a list of integers") + for item in vals: + if not isinstance(item, int): + raise TypeError("Metadata field 'input_pwm_mining_retain_bin_ids' must contain only integers") if "min_count_by_regulator" in meta: vals = meta["min_count_by_regulator"] diff --git a/src/dnadesign/densegen/src/core/pipeline.py b/src/dnadesign/densegen/src/core/pipeline.py index dea38e94..df88519c 100644 --- a/src/dnadesign/densegen/src/core/pipeline.py +++ b/src/dnadesign/densegen/src/core/pipeline.py @@ -29,6 +29,7 @@ import numpy as np import pandas as pd +from rich.console import Console from ..adapters.optimizer import DenseArraysAdapter, OptimizerAdapter from ..adapters.outputs import OutputRecord, SinkBase, build_sinks, load_records_from_config, resolve_bio_alphabet @@ -165,6 +166,16 @@ def _sampling_attr(sampling, name: str, default=None): return default +def _mining_attr(mining, name: str, default=None): + if mining is None: + return default + if hasattr(mining, name): + return getattr(mining, name) + if isinstance(mining, dict): + return mining.get(name, default) + return default + + def _resolve_pvalue_bins_meta(sampling) -> list[float] | None: if sampling is None: return None @@ -201,6 +212,15 @@ def _extract_pwm_sampling_config(source_cfg) -> dict | None: length_range = _sampling_attr(sampling, "length_range") if length_range is not None: length_range = list(length_range) + mining = _sampling_attr(sampling, "mining") + mining_batch_size = _mining_attr(mining, "batch_size") + mining_max_batches = _mining_attr(mining, "max_batches") + mining_max_seconds = _mining_attr(mining, "max_seconds") + mining_retain_bin_ids = _mining_attr(mining, "retain_bin_ids") + legacy_bin_ids = _sampling_attr(sampling, "pvalue_bin_ids") + if mining_retain_bin_ids is None: + mining_retain_bin_ids = legacy_bin_ids + mining_log_every_batches = _mining_attr(mining, "log_every_batches") return { "strategy": _sampling_attr(sampling, "strategy"), "scoring_backend": _sampling_attr(sampling, "scoring_backend"), @@ -215,11 +235,21 @@ def _extract_pwm_sampling_config(source_cfg) -> dict | None: "score_percentile": _sampling_attr(sampling, "score_percentile"), "pvalue_threshold": _sampling_attr(sampling, "pvalue_threshold"), "pvalue_bins": _resolve_pvalue_bins_meta(sampling), + "pvalue_bin_ids": legacy_bin_ids, "selection_policy": _sampling_attr(sampling, "selection_policy"), "bgfile": _sampling_attr(sampling, "bgfile"), "keep_all_candidates_debug": _sampling_attr(sampling, "keep_all_candidates_debug"), "length_policy": _sampling_attr(sampling, "length_policy"), "length_range": length_range, + "mining": { + "batch_size": mining_batch_size, + "max_batches": mining_max_batches, + "max_seconds": mining_max_seconds, + "retain_bin_ids": mining_retain_bin_ids, + "log_every_batches": mining_log_every_batches, + } + if mining is not None + else None, } @@ -476,7 +506,15 @@ def _input_metadata(source_cfg, cfg_path: Path) -> dict: meta["input_pwm_score_percentile"] = getattr(sampling, "score_percentile", None) meta["input_pwm_pvalue_threshold"] = getattr(sampling, "pvalue_threshold", None) meta["input_pwm_pvalue_bins"] = _resolve_pvalue_bins_meta(sampling) - meta["input_pwm_pvalue_bin_ids"] = getattr(sampling, "pvalue_bin_ids", None) + mining_cfg = getattr(sampling, "mining", None) + retained_bins = _mining_attr(mining_cfg, "retain_bin_ids") + legacy_bin_ids = getattr(sampling, "pvalue_bin_ids", None) + meta["input_pwm_pvalue_bin_ids"] = legacy_bin_ids if legacy_bin_ids is not None else retained_bins + meta["input_pwm_mining_batch_size"] = _mining_attr(mining_cfg, "batch_size") + meta["input_pwm_mining_max_batches"] = _mining_attr(mining_cfg, "max_batches") + meta["input_pwm_mining_max_seconds"] = _mining_attr(mining_cfg, "max_seconds") + meta["input_pwm_mining_retain_bin_ids"] = retained_bins + meta["input_pwm_mining_log_every_batches"] = _mining_attr(mining_cfg, "log_every_batches") meta["input_pwm_selection_policy"] = getattr(sampling, "selection_policy", None) meta["input_pwm_bgfile"] = getattr(sampling, "bgfile", None) meta["input_pwm_keep_all_candidates_debug"] = getattr(sampling, "keep_all_candidates_debug", None) @@ -1296,6 +1334,12 @@ def _process_plan_for_source( log_cfg = global_cfg.logging print_visual = bool(log_cfg.print_visual) + progress_style = str(getattr(log_cfg, "progress_style", "stream")) + progress_every = int(getattr(log_cfg, "progress_every", 1)) + progress_refresh_seconds = float(getattr(log_cfg, "progress_refresh_seconds", 1.0)) + screen_console = Console() if progress_style == "screen" else None + last_screen_refresh = 0.0 + latest_failure_totals: str | None = None policy_gc_fill = str(fill_mode) policy_sampling = pool_strategy @@ -1376,6 +1420,11 @@ def _process_plan_for_source( selection_policy = _sampling_attr(input_sampling_cfg, "selection_policy") length_policy = _sampling_attr(input_sampling_cfg, "length_policy") length_range = _sampling_attr(input_sampling_cfg, "length_range") + mining_cfg = _sampling_attr(input_sampling_cfg, "mining") + mining_batch_size = _mining_attr(mining_cfg, "batch_size") + mining_max_batches = _mining_attr(mining_cfg, "max_batches") + mining_max_seconds = _mining_attr(mining_cfg, "max_seconds") + mining_retain_bins = _mining_attr(mining_cfg, "retain_bin_ids") if length_range is not None: length_range = list(length_range) score_label = "-" @@ -1389,9 +1438,13 @@ def _process_plan_for_source( bins_label = "-" if scoring_backend == "fimo": bins_label = "canonical" if _sampling_attr(input_sampling_cfg, "pvalue_bins") is None else "custom" - bin_ids = _sampling_attr(input_sampling_cfg, "pvalue_bin_ids") + bin_ids = ( + mining_retain_bins + if mining_retain_bins is not None + else _sampling_attr(input_sampling_cfg, "pvalue_bin_ids") + ) if bin_ids: - bins_label = f"{bins_label} pick={sorted(list(bin_ids))}" + bins_label = f"{bins_label} retain={sorted(list(bin_ids))}" length_label = str(length_policy) if length_policy == "range" and length_range: length_label = f"{length_policy}({length_range[0]}..{length_range[1]})" @@ -1404,9 +1457,19 @@ def _process_plan_for_source( cap_label = f"{cap_label}; max_seconds={max_seconds}" if cap_label != "-" else f"{max_seconds}s" counts_label = _summarize_tf_counts(meta_df["tf"].tolist()) selection_label = selection_policy if scoring_backend == "fimo" else "-" + mining_label = "-" + if scoring_backend == "fimo" and mining_cfg is not None: + parts = [] + if mining_batch_size is not None: + parts.append(f"batch={mining_batch_size}") + if mining_max_batches is not None: + parts.append(f"max_batches={mining_max_batches}") + if mining_max_seconds is not None: + parts.append(f"max_seconds={mining_max_seconds}s") + mining_label = ", ".join(parts) if parts else "enabled" log.info( "PWM input sampling for %s: motifs=%d | sites=%s | strategy=%s | backend=%s | score=%s | " - "selection=%s | bins=%s | oversample=%s | max_candidates=%s | length=%s", + "selection=%s | bins=%s | mining=%s | oversample=%s | max_candidates=%s | length=%s", source_label, len(input_meta.get("input_pwm_ids") or []), counts_label or "-", @@ -1415,6 +1478,7 @@ def _process_plan_for_source( score_label, selection_label, bins_label, + mining_label, oversample, cap_label, length_label, @@ -2301,37 +2365,75 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): pct = 100.0 * (global_generated / max(1, quota)) bar = _format_progress_bar(global_generated, quota, width=24) cr = getattr(sol, "compression_ratio", float("nan")) - if print_visual: - log.info( - "╭─ %s/%s %s %d/%d (%.2f%%) — local %d/%d — CR=%.3f\n" - "%s\nsequence %s\n" - "╰────────────────────────────────────────────────────────", - source_label, - plan_name, - bar, - global_generated, - quota, - pct, - local_generated, - max_per_subsample, - cr, - derived["visual"], - final_seq, - ) + should_log = progress_every > 0 and global_generated % max(1, progress_every) == 0 + if progress_style == "screen": + if should_log and screen_console is not None: + now = time.monotonic() + if (now - last_screen_refresh) >= progress_refresh_seconds: + screen_console.clear() + seq_preview = final_seq if len(final_seq) <= 120 else f"{final_seq[:117]}..." + screen_console.print( + f"[bold]{source_label}/{plan_name}[/] {bar} {global_generated}/{quota} ({pct:.2f}%)" + ) + screen_console.print( + f"local {local_generated}/{max_per_subsample} | CR={cr:.3f} | " + f"resamples={total_resamples} dup_out={duplicate_records} " + f"dup_sol={duplicate_solutions} fails={failed_solutions} stalls={stall_events}" + ) + if latest_failure_totals: + screen_console.print(f"failures: {latest_failure_totals}") + if tf_usage_counts: + screen_console.print( + f"TF leaderboard: {_summarize_leaderboard(tf_usage_counts, top=5)}" + ) + if usage_counts: + screen_console.print(f"TFBS leaderboard: {_summarize_leaderboard(usage_counts, top=5)}") + diversity_label = _summarize_diversity( + usage_counts, + tf_usage_counts, + library_tfs=library_tfs, + library_tfbs=library_tfbs, + ) + screen_console.print(f"Diversity: {diversity_label}") + if print_visual: + screen_console.print(derived["visual"]) + screen_console.print(f"sequence {seq_preview}") + last_screen_refresh = now + elif progress_style == "summary": + pass else: - log.info( - "[%s/%s] %s %d/%d (%.2f%%) (local %d/%d) CR=%.3f | seq %s", - source_label, - plan_name, - bar, - global_generated, - quota, - pct, - local_generated, - max_per_subsample, - cr, - final_seq, - ) + if should_log: + if print_visual: + log.info( + "╭─ %s/%s %s %d/%d (%.2f%%) — local %d/%d — CR=%.3f\n" + "%s\nsequence %s\n" + "╰────────────────────────────────────────────────────────", + source_label, + plan_name, + bar, + global_generated, + quota, + pct, + local_generated, + max_per_subsample, + cr, + derived["visual"], + final_seq, + ) + else: + log.info( + "[%s/%s] %s %d/%d (%.2f%%) (local %d/%d) CR=%.3f | seq %s", + source_label, + plan_name, + bar, + global_generated, + quota, + pct, + local_generated, + max_per_subsample, + cr, + final_seq, + ) if leaderboard_every > 0 and global_generated % max(1, leaderboard_every) == 0: failure_totals = _summarize_failure_totals( @@ -2339,56 +2441,58 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): input_name=source_label, plan_name=plan_name, ) - log.info( - "[%s/%s] Progress %s %d/%d (%.2f%%) | resamples=%d dup_out=%d " - "dup_sol=%d fails=%d stalls=%d | %s", - source_label, - plan_name, - bar, - global_generated, - quota, - pct, - total_resamples, - duplicate_records, - duplicate_solutions, - failed_solutions, - stall_events, - failure_totals, - ) - log.info( - "[%s/%s] Leaderboard (TF): %s", - source_label, - plan_name, - _summarize_leaderboard(tf_usage_counts, top=5), - ) - log.info( - "[%s/%s] Leaderboard (TFBS): %s", - source_label, - plan_name, - _summarize_leaderboard(usage_counts, top=5), - ) - log.info( - "[%s/%s] Failed TFBS: %s", - source_label, - plan_name, - _summarize_failure_leaderboard( - failure_counts, - input_name=source_label, - plan_name=plan_name, - top=5, - ), - ) - log.info( - "[%s/%s] Diversity: %s", - source_label, - plan_name, - _summarize_diversity( - usage_counts, - tf_usage_counts, - library_tfs=library_tfs, - library_tfbs=library_tfbs, - ), - ) + latest_failure_totals = failure_totals + if progress_style != "screen": + log.info( + "[%s/%s] Progress %s %d/%d (%.2f%%) | resamples=%d dup_out=%d " + "dup_sol=%d fails=%d stalls=%d | %s", + source_label, + plan_name, + bar, + global_generated, + quota, + pct, + total_resamples, + duplicate_records, + duplicate_solutions, + failed_solutions, + stall_events, + failure_totals, + ) + log.info( + "[%s/%s] Leaderboard (TF): %s", + source_label, + plan_name, + _summarize_leaderboard(tf_usage_counts, top=5), + ) + log.info( + "[%s/%s] Leaderboard (TFBS): %s", + source_label, + plan_name, + _summarize_leaderboard(usage_counts, top=5), + ) + log.info( + "[%s/%s] Failed TFBS: %s", + source_label, + plan_name, + _summarize_failure_leaderboard( + failure_counts, + input_name=source_label, + plan_name=plan_name, + top=5, + ), + ) + log.info( + "[%s/%s] Diversity: %s", + source_label, + plan_name, + _summarize_diversity( + usage_counts, + tf_usage_counts, + library_tfs=library_tfs, + library_tfbs=library_tfbs, + ), + ) log.info( "[%s/%s] Example: %s", source_label, @@ -2702,6 +2806,11 @@ def _accumulate_stats(key: tuple[str, str], stats: dict) -> None: # Round-robin scheduler round_robin = bool(cfg.runtime.round_robin) + if round_robin and str(cfg.generation.sampling.pool_strategy) == "iterative_subsample": + log.warning( + "round_robin=true with pool_strategy=iterative_subsample will rebuild libraries more frequently; " + "expect higher runtime for multi-plan runs." + ) inputs = cfg.inputs checkpoint_every = int(cfg.runtime.checkpoint_every) state_counts: dict[tuple[str, str], int] = {} diff --git a/src/dnadesign/densegen/tests/test_cli_summarize_library.py b/src/dnadesign/densegen/tests/test_cli_summarize_library.py index 49618288..5445905a 100644 --- a/src/dnadesign/densegen/tests/test_cli_summarize_library.py +++ b/src/dnadesign/densegen/tests/test_cli_summarize_library.py @@ -65,6 +65,11 @@ def _base_meta(library_hash: str, library_index: int) -> dict: "input_pwm_pvalue_threshold": None, "input_pwm_pvalue_bins": None, "input_pwm_pvalue_bin_ids": None, + "input_pwm_mining_batch_size": None, + "input_pwm_mining_max_batches": None, + "input_pwm_mining_max_seconds": None, + "input_pwm_mining_retain_bin_ids": None, + "input_pwm_mining_log_every_batches": None, "input_pwm_selection_policy": None, "input_pwm_bgfile": None, "input_pwm_keep_all_candidates_debug": None, diff --git a/src/dnadesign/densegen/tests/test_outputs_parquet.py b/src/dnadesign/densegen/tests/test_outputs_parquet.py index 606bd03f..83d5c7da 100644 --- a/src/dnadesign/densegen/tests/test_outputs_parquet.py +++ b/src/dnadesign/densegen/tests/test_outputs_parquet.py @@ -60,6 +60,11 @@ def _dummy_meta() -> dict: "input_pwm_pvalue_threshold": None, "input_pwm_pvalue_bins": None, "input_pwm_pvalue_bin_ids": None, + "input_pwm_mining_batch_size": None, + "input_pwm_mining_max_batches": None, + "input_pwm_mining_max_seconds": None, + "input_pwm_mining_retain_bin_ids": None, + "input_pwm_mining_log_every_batches": None, "input_pwm_selection_policy": None, "input_pwm_bgfile": None, "input_pwm_keep_all_candidates_debug": None, diff --git a/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py b/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py index 95c2aa8b..6195073f 100644 --- a/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py +++ b/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py @@ -41,7 +41,7 @@ def test_write_minimal_meme_motif(tmp_path: Path) -> None: def test_write_candidates_fasta(tmp_path: Path) -> None: - records = build_candidate_records("My Motif", ["ACG", "TTT"]) + records = build_candidate_records("My Motif", ["ACG", "TTT"], start_index=5) out = tmp_path / "candidates.fasta" write_candidates_fasta(records, out) lines = out.read_text().splitlines() @@ -49,8 +49,8 @@ def test_write_candidates_fasta(tmp_path: Path) -> None: assert lines[1] == "ACG" assert lines[2].startswith(">") assert lines[3] == "TTT" - assert records[0][0].endswith("|cand0") - assert records[1][0].endswith("|cand1") + assert records[0][0].endswith("|cand5") + assert records[1][0].endswith("|cand6") def test_parse_fimo_tsv_and_best_hits() -> None: diff --git a/src/dnadesign/densegen/tests/test_pwm_sampling_mining.py b/src/dnadesign/densegen/tests/test_pwm_sampling_mining.py new file mode 100644 index 00000000..9895fdcb --- /dev/null +++ b/src/dnadesign/densegen/tests/test_pwm_sampling_mining.py @@ -0,0 +1,80 @@ +from __future__ import annotations + +from pathlib import Path + +import numpy as np + +from dnadesign.densegen.src.adapters.sources import pwm_fimo +from dnadesign.densegen.src.adapters.sources.pwm_sampling import PWMMotif, sample_pwm_sites + + +def _parse_fasta(path: Path) -> list[str]: + ids: list[str] = [] + with path.open() as handle: + for line in handle: + if line.startswith(">"): + ids.append(line.strip().lstrip(">")) + return ids + + +def test_pwm_sampling_fimo_mining_retain_bins(monkeypatch) -> None: + motif = PWMMotif( + motif_id="M1", + matrix=[ + {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + ], + background={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + ) + + def fake_run_fimo(*, meme_motif_path, fasta_path, **_kwargs): # type: ignore[override] + ids = _parse_fasta(Path(fasta_path)) + rows = [] + for idx, rec_id in enumerate(ids): + pval = 1e-6 if idx % 2 == 0 else 1e-2 + rows.append( + { + "sequence_name": rec_id, + "start": 1, + "stop": 3, + "strand": "+", + "score": 5.0, + "p_value": pval, + "matched_sequence": "AAA", + } + ) + return rows, None + + monkeypatch.setattr(pwm_fimo, "run_fimo", fake_run_fimo) + + rng = np.random.default_rng(0) + selected, meta = sample_pwm_sites( + rng, + motif, + strategy="stochastic", + n_sites=2, + oversample_factor=2, + max_candidates=None, + max_seconds=None, + score_threshold=None, + score_percentile=None, + scoring_backend="fimo", + pvalue_threshold=1e-1, + pvalue_bins=[1e-5, 1e-3, 1.0], + selection_policy="random_uniform", + mining={ + "batch_size": 2, + "max_batches": 2, + "retain_bin_ids": [0], + "log_every_batches": 1, + }, + include_matched_sequence=True, + return_metadata=True, + ) + + assert len(selected) == 2 + for seq in selected: + info = meta[seq] + assert info["fimo_bin_id"] == 0 + assert info["fimo_matched_sequence"] == "AAA" diff --git a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml index e5794dd6..cd0bcb45 100644 --- a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +++ b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml @@ -21,11 +21,16 @@ densegen: sampling: strategy: stochastic n_sites: 80 - oversample_factor: 12 - max_candidates: 50000 # bounded candidate generation + oversample_factor: 200 + max_candidates: 20000 # bounded candidate generation (cap across mining batches) scoring_backend: fimo pvalue_threshold: 1e-4 selection_policy: stratified + mining: + batch_size: 5000 + max_batches: 4 + retain_bin_ids: [0, 1, 2, 3] + log_every_batches: 1 length_policy: range length_range: [22, 28] @@ -92,7 +97,10 @@ densegen: log_dir: outputs/logs level: INFO suppress_solver_stderr: true - print_visual: false + print_visual: true + progress_style: screen + progress_every: 1 + progress_refresh_seconds: 1.0 plots: source: parquet From 14279f0ece5f23a9bbf3c71454708056d805c1b7 Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 11:04:27 -0500 Subject: [PATCH 06/40] densegen: cache input sampling and improve run UX --- .../densegen/docs/demo/demo_basic.md | 4 + .../densegen/docs/guide/generation.md | 3 + src/dnadesign/densegen/docs/reference/cli.md | 10 + src/dnadesign/densegen/src/cli.py | 21 ++ src/dnadesign/densegen/src/core/pipeline.py | 17 +- .../densegen/tests/test_source_cache.py | 212 ++++++++++++++++++ 6 files changed, 264 insertions(+), 3 deletions(-) create mode 100644 src/dnadesign/densegen/tests/test_source_cache.py diff --git a/src/dnadesign/densegen/docs/demo/demo_basic.md b/src/dnadesign/densegen/docs/demo/demo_basic.md index d322242e..15c4377e 100644 --- a/src/dnadesign/densegen/docs/demo/demo_basic.md +++ b/src/dnadesign/densegen/docs/demo/demo_basic.md @@ -95,6 +95,10 @@ Example output: ✨ Run staged: /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml ``` +If you re-run the demo in the same run root and DenseGen’s schema has changed, you may see a +Parquet schema mismatch. Either delete `outputs/dense_arrays.parquet` + +`outputs/_densegen_ids.sqlite` or stage a fresh workspace. + ## 3) Validate config ```bash diff --git a/src/dnadesign/densegen/docs/guide/generation.md b/src/dnadesign/densegen/docs/guide/generation.md index e3f29a8f..ab7f6cba 100644 --- a/src/dnadesign/densegen/docs/guide/generation.md +++ b/src/dnadesign/densegen/docs/guide/generation.md @@ -122,6 +122,9 @@ Round‑robin is **distinct from Stage‑B sampling** (`generation.sampling`): l uses the same policy per plan, but round‑robin can trigger more frequent library rebuilds when `pool_strategy: iterative_subsample` is used. Expect extra compute if many plans are active. +Input PWM sampling is performed **once per run** and cached across round‑robin passes. If you +need a fresh PWM sample, start a new run (or stage a new workspace). + --- ### Regulator constraints diff --git a/src/dnadesign/densegen/docs/reference/cli.md b/src/dnadesign/densegen/docs/reference/cli.md index 2c139a4b..bc1a9b69 100644 --- a/src/dnadesign/densegen/docs/reference/cli.md +++ b/src/dnadesign/densegen/docs/reference/cli.md @@ -74,6 +74,8 @@ Options: - `--log-file PATH` - override the log file path. Otherwise DenseGen writes to `logging.log_dir/.log` inside the workspace. The override path must still resolve inside `densegen.run.root`. +Notes: +- If you enable `scoring_backend: fimo`, run via `pixi run dense ...` (or ensure `fimo` is on PATH). --- @@ -114,6 +116,8 @@ Options: - `--by-library/--no-by-library` - group library summaries per build attempt. - `--top-per-tf` - limit TFBS rows per TF when summarizing. - `--show-library-hash/--short-library-hash` - toggle full vs short library hashes. +Tip: +- For large runs, prefer `--no-by-library` or lower `--top`/`--top-per-tf` to keep output readable. --- @@ -143,6 +147,12 @@ Demo run (small, Parquet-only config): uv run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --no-plot ``` +FIMO-backed sampling (pixi): + +```bash +pixi run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --no-plot +``` + --- @e-south diff --git a/src/dnadesign/densegen/src/cli.py b/src/dnadesign/densegen/src/cli.py index 9fbd1195..04a58201 100644 --- a/src/dnadesign/densegen/src/cli.py +++ b/src/dnadesign/densegen/src/cli.py @@ -247,6 +247,23 @@ def _render_missing_input_hint(cfg_path: Path, loaded, exc: Exception) -> None: console.print(f" - {hint}") +def _render_output_schema_hint(exc: Exception) -> bool: + msg = str(exc) + if "Existing Parquet schema does not match the current DenseGen schema" in msg: + console.print(f"[bold red]Output schema mismatch:[/] {msg}") + console.print("[bold]Next steps[/]:") + console.print(" - Remove outputs/dense_arrays.parquet and outputs/_densegen_ids.sqlite, or") + console.print(" - Stage a fresh workspace with `dense stage --copy-inputs` and re-run.") + return True + if "Output sinks are out of sync before run" in msg: + console.print(f"[bold red]Output sink mismatch:[/] {msg}") + console.print("[bold]Next steps[/]:") + console.print(" - Remove stale outputs so sinks align, or") + console.print(" - Run with a single output target to rebuild from scratch.") + return True + return False + + def _warn_pwm_sampling_configs(loaded, cfg_path: Path) -> None: warnings: list[str] = [] for inp in loaded.root.densegen.inputs: @@ -1189,6 +1206,10 @@ def run( except FileNotFoundError as exc: _render_missing_input_hint(cfg_path, loaded, exc) raise typer.Exit(code=1) + except RuntimeError as exc: + if _render_output_schema_hint(exc): + raise typer.Exit(code=1) + raise console.print(":tada: [bold green]Run complete[/].") console.print("[bold]Next steps[/]:") diff --git a/src/dnadesign/densegen/src/core/pipeline.py b/src/dnadesign/densegen/src/core/pipeline.py index df88519c..4b997aa7 100644 --- a/src/dnadesign/densegen/src/core/pipeline.py +++ b/src/dnadesign/densegen/src/core/pipeline.py @@ -1283,6 +1283,7 @@ def _process_plan_for_source( checkpoint_every: int = 0, write_state: Callable[[], None] | None = None, site_failure_counts: dict[tuple[str, str, str, str, str | None], dict[str, int]] | None = None, + source_cache: dict[str, tuple[list, pd.DataFrame | None]] | None = None, ) -> tuple[int, dict]: source_label = source_cfg.name plan_name = plan_item.name @@ -1366,9 +1367,16 @@ def _process_plan_for_source( outputs_root = run_root_path / "outputs" existing_library_builds = _load_existing_library_index(outputs_root) - # Load source - src_obj = deps.source_factory(source_cfg, cfg_path) - data_entries, meta_df = src_obj.load_data(rng=np_rng, outputs_root=outputs_root) + # Load source (cache PWM sampling results across round-robin passes). + cache_key = source_label + cached = source_cache.get(cache_key) if source_cache is not None else None + if cached is None: + src_obj = deps.source_factory(source_cfg, cfg_path) + data_entries, meta_df = src_obj.load_data(rng=np_rng, outputs_root=outputs_root) + if source_cache is not None: + source_cache[cache_key] = (data_entries, meta_df) + else: + data_entries, meta_df = cached input_meta = _input_metadata(source_cfg, cfg_path) input_tf_tfbs_pair_count: int | None = None if meta_df is not None and isinstance(meta_df, pd.DataFrame): @@ -2705,6 +2713,7 @@ def run_pipeline(loaded: LoadedConfig, *, deps: PipelineDeps | None = None) -> R plan_order: list[tuple[str, str]] = [] plan_leaderboards: dict[tuple[str, str], dict] = {} inputs_manifest_entries: dict[str, dict] = {} + source_cache: dict[str, tuple[list, pd.DataFrame | None]] = {} outputs_root = run_outputs_root(run_root) outputs_root.mkdir(parents=True, exist_ok=True) ensure_run_meta_dir(run_root) @@ -2889,6 +2898,7 @@ def _write_state() -> None: checkpoint_every=checkpoint_every, write_state=_write_state, site_failure_counts=site_failure_counts, + source_cache=source_cache, ) per_plan[(s.name, item.name)] = per_plan.get((s.name, item.name), 0) + produced total += produced @@ -2936,6 +2946,7 @@ def _write_state() -> None: checkpoint_every=checkpoint_every, write_state=_write_state, site_failure_counts=site_failure_counts, + source_cache=source_cache, ) produced_counts[key] = current + produced leaderboard_latest = stats.get("leaderboard_latest") diff --git a/src/dnadesign/densegen/tests/test_source_cache.py b/src/dnadesign/densegen/tests/test_source_cache.py new file mode 100644 index 00000000..9c83e950 --- /dev/null +++ b/src/dnadesign/densegen/tests/test_source_cache.py @@ -0,0 +1,212 @@ +from __future__ import annotations + +import random +from pathlib import Path + +import numpy as np +import yaml + +from dnadesign.densegen.src.adapters.optimizer import OptimizerRun +from dnadesign.densegen.src.adapters.outputs.base import SinkBase +from dnadesign.densegen.src.config import load_config +from dnadesign.densegen.src.core.pipeline import PipelineDeps, _process_plan_for_source + + +class _DummySink(SinkBase): + def __init__(self) -> None: + self.records = [] + + def add(self, record): + self.records.append(record) + return True + + def flush(self) -> None: + return None + + +class _DummyOpt: + def forbid(self, _sol) -> None: + return None + + +class _DummySol: + def __init__(self, sequence: str, library: list[str]) -> None: + self.sequence = sequence + self.library = library + self._indices = [0] + self.compression_ratio = 1.0 + + def offset_indices_in_order(self): + return [(0, idx) for idx in self._indices] + + +class _DummyAdapter: + def probe_solver(self, backend: str, *, test_length: int = 10) -> None: + return None + + def build( + self, + *, + library, + sequence_length, + solver, + strategy, + solver_options, + fixed_elements, + strands="double", + regulator_by_index=None, + required_regulators=None, + min_count_by_regulator=None, + min_required_regulators=None, + ): + opt = _DummyOpt() + seqs = ["AAA", "CCC"] + + def _gen(): + for seq in seqs: + yield _DummySol(sequence=seq, library=library) + + return OptimizerRun(optimizer=opt, generator=_gen()) + + +class _DummySource: + def __init__(self, entries: list[str]) -> None: + self.entries = entries + self.calls = 0 + + def load_data(self, *, rng, outputs_root): + self.calls += 1 + return self.entries, None + + +def test_source_cache_reuses_loaded_inputs(tmp_path: Path) -> None: + run_dir = tmp_path / "run" + run_dir.mkdir() + (run_dir / "outputs" / "parquet").mkdir(parents=True) + (run_dir / "logs").mkdir() + + seq_path = run_dir / "seqs.csv" + seq_path.write_text("sequence\nAAA\nCCC\nGGG\nTTT\n") + + cfg = { + "densegen": { + "schema_version": "2.2", + "run": {"id": "demo", "root": "."}, + "inputs": [ + { + "name": "demo", + "type": "sequence_library", + "path": str(seq_path), + "format": "csv", + "sequence_column": "sequence", + } + ], + "output": { + "targets": ["parquet"], + "schema": {"bio_type": "dna", "alphabet": "dna_4"}, + "parquet": {"path": "outputs/dense_arrays.parquet"}, + }, + "generation": { + "sequence_length": 3, + "quota": 2, + "sampling": { + "pool_strategy": "subsample", + "library_size": 2, + "subsample_over_length_budget_by": 0, + "library_sampling_strategy": "tf_balanced", + "cover_all_regulators": False, + "unique_binding_sites": True, + "max_sites_per_regulator": None, + "relax_on_exhaustion": False, + "allow_incomplete_coverage": False, + "iterative_max_libraries": 2, + "iterative_min_new_solutions": 0, + }, + "plan": [{"name": "default", "quota": 2}], + }, + "solver": {"backend": "CBC", "strategy": "iterate", "options": []}, + "runtime": { + "round_robin": True, + "arrays_generated_before_resample": 1, + "min_count_per_tf": 0, + "max_duplicate_solutions": 5, + "stall_seconds_before_resample": 10, + "stall_warning_every_seconds": 10, + "max_resample_attempts": 1, + "max_total_resamples": 1, + "max_seconds_per_plan": 0, + "max_failed_solutions": 0, + "random_seed": 1, + }, + "postprocess": {"gap_fill": {"mode": "off", "end": "5prime", "gc_min": 0.4, "gc_max": 0.6}}, + "logging": {"log_dir": "logs", "level": "INFO"}, + } + } + + cfg_path = run_dir / "config.yaml" + cfg_path.write_text(yaml.safe_dump(cfg)) + loaded = load_config(cfg_path) + + dummy_source = _DummySource(entries=["AAA", "CCC", "GGG", "TTT"]) + sink = _DummySink() + deps = PipelineDeps( + source_factory=lambda _cfg, _path: dummy_source, + sink_factory=lambda _cfg, _path: [sink], + optimizer=_DummyAdapter(), + gap_fill=lambda *args, **kwargs: "", + ) + + plan_item = loaded.root.densegen.generation.resolve_plan()[0] + source_cache: dict[str, tuple[list, None]] = {} + + _process_plan_for_source( + loaded.root.densegen.inputs[0], + plan_item, + loaded.root.densegen, + [sink], + chosen_solver="CBC", + deps=deps, + rng=random.Random(1), + np_rng=np.random.default_rng(1), + cfg_path=loaded.path, + run_id=loaded.root.densegen.run.id, + run_root=str(run_dir), + run_config_path="config.yaml", + run_config_sha256="sha", + random_seed=1, + dense_arrays_version=None, + dense_arrays_version_source="test", + output_bio_type="dna", + output_alphabet="dna_4", + one_subsample_only=True, + already_generated=0, + inputs_manifest={}, + source_cache=source_cache, + ) + + _process_plan_for_source( + loaded.root.densegen.inputs[0], + plan_item, + loaded.root.densegen, + [sink], + chosen_solver="CBC", + deps=deps, + rng=random.Random(1), + np_rng=np.random.default_rng(1), + cfg_path=loaded.path, + run_id=loaded.root.densegen.run.id, + run_root=str(run_dir), + run_config_path="config.yaml", + run_config_sha256="sha", + random_seed=1, + dense_arrays_version=None, + dense_arrays_version_source="test", + output_bio_type="dna", + output_alphabet="dna_4", + one_subsample_only=True, + already_generated=0, + inputs_manifest={}, + source_cache=source_cache, + ) + + assert dummy_source.calls == 1 From 176c4d1caa1d62b1f2c66361dc22e63ce2a32acf Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 13:06:47 -0500 Subject: [PATCH 07/40] densegen: tighten FIMO mining config and preflight checks --- .../densegen/src/adapters/outputs/parquet.py | 2 +- .../src/adapters/sources/pwm_artifact.py | 2 - .../src/adapters/sources/pwm_artifact_set.py | 2 - .../densegen/src/adapters/sources/pwm_fimo.py | 9 +- .../src/adapters/sources/pwm_jaspar.py | 2 - .../src/adapters/sources/pwm_matrix_csv.py | 2 - .../densegen/src/adapters/sources/pwm_meme.py | 2 - .../src/adapters/sources/pwm_meme_set.py | 2 - .../src/adapters/sources/pwm_sampling.py | 110 ++- src/dnadesign/densegen/src/cli.py | 703 ++++++++++++++---- src/dnadesign/densegen/src/config/__init__.py | 57 +- src/dnadesign/densegen/src/core/metadata.py | 2 +- .../densegen/src/core/metadata_schema.py | 16 +- src/dnadesign/densegen/src/core/pipeline.py | 567 ++++++++------ src/dnadesign/densegen/src/core/reporting.py | 49 +- .../densegen/src/integrations/meme_suite.py | 10 + .../densegen/tests/test_cli_config_option.py | 4 +- .../densegen/tests/test_cli_describe.py | 2 +- .../tests/test_cli_summarize_library.py | 4 +- .../densegen/tests/test_config_strict.py | 23 + .../densegen/tests/test_outputs_parquet.py | 2 +- .../densegen/tests/test_pwm_fimo_utils.py | 5 +- .../tests/test_pwm_sampling_mining.py | 29 + .../workspaces/demo_meme_two_tf/config.yaml | 2 +- 24 files changed, 1107 insertions(+), 501 deletions(-) diff --git a/src/dnadesign/densegen/src/adapters/outputs/parquet.py b/src/dnadesign/densegen/src/adapters/outputs/parquet.py index a45456f2..0751907b 100644 --- a/src/dnadesign/densegen/src/adapters/outputs/parquet.py +++ b/src/dnadesign/densegen/src/adapters/outputs/parquet.py @@ -36,7 +36,6 @@ def _meta_arrow_type(name: str, pa): "input_pwm_pvalue_bins", } list_int = { - "input_pwm_pvalue_bin_ids", "input_pwm_mining_retain_bin_ids", } int_fields = { @@ -48,6 +47,7 @@ def _meta_arrow_type(name: str, pa): "input_pwm_oversample_factor", "input_pwm_mining_batch_size", "input_pwm_mining_max_batches", + "input_pwm_mining_max_candidates", "input_pwm_mining_log_every_batches", "input_row_count", "input_tf_count", diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py index e193617d..1339aa68 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py @@ -176,7 +176,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") - pvalue_bin_ids = sampling.get("pvalue_bin_ids") mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) @@ -205,7 +204,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend=scoring_backend, pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, - pvalue_bin_ids=pvalue_bin_ids, mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py index 6a87a1f0..9a9353af 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py @@ -72,7 +72,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend = str(sampling_cfg.get("scoring_backend", "densegen")).lower() pvalue_threshold = sampling_cfg.get("pvalue_threshold") pvalue_bins = sampling_cfg.get("pvalue_bins") - pvalue_bin_ids = sampling_cfg.get("pvalue_bin_ids") mining = sampling_cfg.get("mining") bgfile = sampling_cfg.get("bgfile") selection_policy = str(sampling_cfg.get("selection_policy", "random_uniform")) @@ -100,7 +99,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend=scoring_backend, pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, - pvalue_bin_ids=pvalue_bin_ids, mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py b/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py index 1cb2fc4b..353e6900 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_fimo.py @@ -19,7 +19,7 @@ from pathlib import Path from typing import Iterable, Sequence -from ...integrations.meme_suite import resolve_executable +from ...integrations.meme_suite import require_executable from .pwm_sampling import PWMMotif, normalize_background _HEADER_RE = re.compile(r"[\s\-]+") @@ -158,12 +158,7 @@ def run_fimo( include_matched_sequence: bool = False, return_tsv: bool = False, ) -> tuple[list[dict], str | None]: - exe = resolve_executable("fimo", tool_path=None) - if exe is None: - raise FileNotFoundError( - "FIMO executable not found. Install MEME Suite and ensure `fimo` is on PATH, " - "or set MEME_BIN to the MEME bin directory (pixi users: `pixi run dense ...`)." - ) + exe = require_executable("fimo", tool_path=None) cmd = [str(exe), "--text"] if not include_matched_sequence: cmd.append("--skip-matched-sequence") diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py index 4ce3594f..bb08ba6d 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py @@ -116,7 +116,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") - pvalue_bin_ids = sampling.get("pvalue_bin_ids") mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) @@ -148,7 +147,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend=scoring_backend, pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, - pvalue_bin_ids=pvalue_bin_ids, mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py index 049eecfd..7e313dad 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py @@ -80,7 +80,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") - pvalue_bin_ids = sampling.get("pvalue_bin_ids") mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) @@ -109,7 +108,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend=scoring_backend, pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, - pvalue_bin_ids=pvalue_bin_ids, mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py index dc3facb0..bce0a6fe 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py @@ -94,7 +94,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") - pvalue_bin_ids = sampling.get("pvalue_bin_ids") mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) @@ -127,7 +126,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend=scoring_backend, pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, - pvalue_bin_ids=pvalue_bin_ids, mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py index cafece29..c081095b 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py @@ -88,7 +88,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend = str(sampling.get("scoring_backend", "densegen")).lower() pvalue_threshold = sampling.get("pvalue_threshold") pvalue_bins = sampling.get("pvalue_bins") - pvalue_bin_ids = sampling.get("pvalue_bin_ids") mining = sampling.get("mining") bgfile = sampling.get("bgfile") selection_policy = str(sampling.get("selection_policy", "random_uniform")) @@ -121,7 +120,6 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): scoring_backend=scoring_backend, pvalue_threshold=pvalue_threshold, pvalue_bins=pvalue_bins, - pvalue_bin_ids=pvalue_bin_ids, mining=mining, bgfile=bgfile_path, selection_policy=selection_policy, diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py b/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py index 5b84db1a..630b5291 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_sampling.py @@ -249,7 +249,10 @@ def select_by_score( "increase oversample_factor", ] if context.get("cap_applied"): - suggestions.append("increase max_candidates (cap was hit)") + if context.get("mining_max_candidates") is not None: + suggestions.append("increase mining.max_candidates (cap was hit)") + else: + suggestions.append("increase max_candidates (cap was hit)") if context.get("time_limited"): suggestions.append("increase max_seconds (time limit was hit)") if context.get("width") is not None and int(context.get("width")) <= 6: @@ -378,19 +381,21 @@ def _select_fimo_candidates( msg_lines.append(f"Observed candidate lengths={context.get('length_observed')}.") if context.get("pvalue_bins_label") is not None: msg_lines.append(f"P-value bins={context.get('pvalue_bins_label')}.") - if context.get("pvalue_bin_ids") is not None: - msg_lines.append(f"Retained bins={context.get('pvalue_bin_ids')}.") + if context.get("retain_bin_ids") is not None: + msg_lines.append(f"Retained bins={context.get('retain_bin_ids')}.") suggestions = [ "reduce n_sites", "relax pvalue_threshold (e.g., 1e-4 → 1e-3)", "increase oversample_factor", ] - if context.get("pvalue_bin_ids") is not None: + if context.get("retain_bin_ids") is not None: suggestions.append("broaden mining.retain_bin_ids (or remove bin filtering)") if context.get("cap_applied"): suggestions.append("increase max_candidates (cap was hit)") if context.get("time_limited"): suggestions.append("increase max_seconds (time limit was hit)") + if context.get("mining_max_candidates") is not None and context.get("mining_candidates_limited"): + suggestions.append("increase mining.max_candidates") if context.get("mining_max_batches") is not None and context.get("mining_batches_limited"): suggestions.append("increase mining.max_batches") if context.get("mining_max_seconds") is not None and context.get("mining_time_limited"): @@ -434,7 +439,6 @@ def sample_pwm_sites( scoring_backend: str = "densegen", pvalue_threshold: Optional[float] = None, pvalue_bins: Optional[Sequence[float]] = None, - pvalue_bin_ids: Optional[Sequence[int]] = None, mining: Optional[object] = None, bgfile: Optional[str | Path] = None, selection_policy: str = "random_uniform", @@ -462,8 +466,6 @@ def sample_pwm_sites( raise ValueError("PWM sampling requires exactly one of score_threshold or score_percentile") if pvalue_bins is not None: raise ValueError("pvalue_bins is only valid when scoring_backend='fimo'") - if pvalue_bin_ids is not None: - raise ValueError("pvalue_bin_ids is only valid when scoring_backend='fimo'") if mining is not None: raise ValueError("mining is only valid when scoring_backend='fimo'") if include_matched_sequence: @@ -474,12 +476,13 @@ def sample_pwm_sites( pvalue_threshold = float(pvalue_threshold) if not (0.0 < pvalue_threshold <= 1.0): raise ValueError("pwm.sampling.pvalue_threshold must be between 0 and 1") + if max_candidates is not None or max_seconds is not None: + raise ValueError( + "max_candidates/max_seconds are only supported for densegen scoring; " + "use mining.max_candidates or mining.max_seconds for fimo." + ) if selection_policy not in {"random_uniform", "top_n", "stratified"}: raise ValueError(f"Unsupported pwm selection_policy: {selection_policy}") - if mining is not None: - retain_bins = _mining_attr(mining, "retain_bin_ids") - if retain_bins is not None and pvalue_bin_ids is not None: - raise ValueError("Provide retain_bin_ids in mining or pvalue_bin_ids, not both.") if score_threshold is not None or score_percentile is not None: log.warning( "PWM sampling scoring_backend=fimo ignores score_threshold/score_percentile for motif %s.", @@ -524,16 +527,25 @@ def sample_pwm_sites( if length_policy == "range" and length_range is not None and len(length_range) == 2: length_label = f"{length_policy}({length_range[0]}..{length_range[1]})" - def _cap_label(cap_applied: bool, time_limited: bool) -> str: + def _cap_label( + cap_applied: bool, + time_limited: bool, + *, + mining_max_candidates: Optional[int] = None, + ) -> str: cap_label = "" - if cap_applied and max_candidates is not None: - cap_label = f" (capped by max_candidates={max_candidates})" + if cap_applied: + if mining_max_candidates is not None: + cap_label = f" (capped by mining.max_candidates={mining_max_candidates})" + elif max_candidates is not None: + cap_label = f" (capped by max_candidates={max_candidates})" if time_limited and max_seconds is not None: cap_label = f"{cap_label}; max_seconds={max_seconds}" if cap_label else f" (max_seconds={max_seconds})" return cap_label def _context(length_obs: str, cap_applied: bool, requested: int, generated: int, time_limited: bool) -> dict: mining_cfg = mining + mining_max_candidates = _mining_attr(mining_cfg, "max_candidates") return { "motif_id": motif.motif_id, "width": width, @@ -548,13 +560,14 @@ def _context(length_obs: str, cap_applied: bool, requested: int, generated: int, "requested_candidates": requested, "generated_candidates": generated, "cap_applied": cap_applied, - "cap_label": _cap_label(cap_applied, time_limited), + "cap_label": _cap_label(cap_applied, time_limited, mining_max_candidates=mining_max_candidates), "time_limited": time_limited, "mining_batch_size": _mining_attr(mining_cfg, "batch_size"), "mining_max_batches": _mining_attr(mining_cfg, "max_batches"), "mining_max_seconds": _mining_attr(mining_cfg, "max_seconds"), "mining_log_every_batches": _mining_attr(mining_cfg, "log_every_batches"), "mining_retain_bin_ids": _mining_attr(mining_cfg, "retain_bin_ids"), + "mining_max_candidates": mining_max_candidates, } def _select( @@ -620,8 +633,6 @@ def _score_with_fimo( raise ValueError("pvalue_threshold required for fimo backend") resolved_bins = _resolve_pvalue_edges(pvalue_bins) retain_bins = _mining_attr(mining, "retain_bin_ids") - if retain_bins is None and pvalue_bin_ids is not None: - retain_bins = list(pvalue_bin_ids) allowed_bins: Optional[set[int]] = None if retain_bins is not None: allowed_bins = {int(idx) for idx in retain_bins} @@ -631,8 +642,20 @@ def _score_with_fimo( keep_weak = keep_low mining_batch_size = int(_mining_attr(mining, "batch_size", n_candidates)) mining_max_batches = _mining_attr(mining, "max_batches") + mining_max_candidates = _mining_attr(mining, "max_candidates") mining_max_seconds = _mining_attr(mining, "max_seconds") mining_log_every = int(_mining_attr(mining, "log_every_batches", 1)) + log.info( + "FIMO mining config for %s: target=%d batch=%d " + "max_batches=%s max_candidates=%s max_seconds=%s retain_bins=%s", + motif.motif_id, + n_candidates, + mining_batch_size, + str(mining_max_batches) if mining_max_batches is not None else "-", + str(mining_max_candidates) if mining_max_candidates is not None else "-", + str(mining_max_seconds) if mining_max_seconds is not None else "-", + str(sorted(allowed_bins)) if allowed_bins is not None else "all", + ) debug_path: Optional[Path] = None debug_dir = debug_output_dir if keep_all_candidates_debug: @@ -693,6 +716,7 @@ def _generate_batch(count: int) -> tuple[list[str], list[int], bool]: time_limited = False mining_time_limited = False mining_batches_limited = False + mining_candidates_limited = False batches = 0 tsv_lines: list[str] = [] provided_sequences = sequences @@ -759,6 +783,9 @@ def _generate_batch(count: int) -> tuple[list[str], list[int], bool]: if mining_max_batches is not None and batches >= int(mining_max_batches): mining_batches_limited = True break + if mining_max_candidates is not None and generated_total >= int(mining_max_candidates): + mining_candidates_limited = True + break if mining_max_seconds is not None and (time.monotonic() - mining_start) >= float( mining_max_seconds ): @@ -827,11 +854,12 @@ def _generate_batch(count: int) -> tuple[list[str], list[int], bool]: bins_label = _format_pvalue_bins(resolved_bins, total_bin_counts, only_bins=retain_bins) accepted_label = _format_pvalue_bins(resolved_bins, accepted_bin_counts, only_bins=retain_bins) log.info( - "FIMO mining %s batch %d/%s: generated=%d accepted=%d bins=%s accepted_bins=%s", + "FIMO mining %s batch %d/%s: generated=%d/%d accepted=%d bins=%s accepted_bins=%s", motif.motif_id, batches, str(mining_max_batches) if mining_max_batches is not None else "-", generated_total, + n_candidates, len(candidates), bins_label, accepted_label, @@ -855,12 +883,14 @@ def _generate_batch(count: int) -> tuple[list[str], list[int], bool]: context = _context(length_obs, cap_applied, requested, generated_total, time_limited) context["pvalue_bins_label"] = bins_label - context["pvalue_bin_ids"] = sorted(allowed_bins) if allowed_bins is not None else None + context["retain_bin_ids"] = sorted(allowed_bins) if allowed_bins is not None else None context["mining_batch_size"] = mining_batch_size context["mining_max_batches"] = mining_max_batches + context["mining_max_candidates"] = mining_max_candidates context["mining_max_seconds"] = mining_max_seconds context["mining_time_limited"] = mining_time_limited context["mining_batches_limited"] = mining_batches_limited + context["mining_candidates_limited"] = mining_candidates_limited picked = _select_fimo_candidates( candidates, n_sites=n_sites, @@ -929,19 +959,35 @@ def _generate_batch(count: int) -> tuple[list[str], list[int], bool]: requested_candidates = max(1, n_sites * oversample_factor) n_candidates = requested_candidates cap_applied = False - if max_candidates is not None: - cap_val = int(max_candidates) - if cap_val <= 0: - raise ValueError("max_candidates must be > 0 when set") - if requested_candidates > cap_val: - n_candidates = cap_val - cap_applied = True - log.warning( - "PWM sampling capped candidate generation for motif %s: requested=%d max_candidates=%d", - motif.motif_id, - requested_candidates, - cap_val, - ) + mining_max_candidates = _mining_attr(mining, "max_candidates") + if scoring_backend == "densegen": + if max_candidates is not None: + cap_val = int(max_candidates) + if cap_val <= 0: + raise ValueError("max_candidates must be > 0 when set") + if requested_candidates > cap_val: + n_candidates = cap_val + cap_applied = True + log.warning( + "PWM sampling capped candidate generation for motif %s: requested=%d max_candidates=%d", + motif.motif_id, + requested_candidates, + cap_val, + ) + else: + if mining_max_candidates is not None: + mining_cap = int(mining_max_candidates) + if mining_cap < n_sites: + raise ValueError("pwm.sampling.mining.max_candidates must be >= n_sites") + if mining_cap != requested_candidates: + cap_applied = mining_cap < requested_candidates + n_candidates = mining_cap + log.info( + "PWM mining candidate target for motif %s: requested=%d mining.max_candidates=%d", + motif.motif_id, + requested_candidates, + mining_cap, + ) n_candidates = max(1, n_candidates) if scoring_backend == "densegen": candidates: List[Tuple[str, str]] = [] diff --git a/src/dnadesign/densegen/src/cli.py b/src/dnadesign/densegen/src/cli.py index 04a58201..807b26d9 100644 --- a/src/dnadesign/densegen/src/cli.py +++ b/src/dnadesign/densegen/src/cli.py @@ -6,14 +6,18 @@ Typer/Rich CLI entrypoint for DenseGen. Commands: - - validate : Validate YAML config (schema + sanity). - - plan : Show resolved per-constraint quota plan. - - stage : Scaffold a new workspace with config.yaml + subfolders. - - run : Execute generation pipeline; optionally auto-plot. - - plot : Generate plots from outputs using config YAML. - - ls-plots : List available plot names and descriptions. - - summarize : Print an outputs/meta/run_manifest.json summary table. - - report : Generate audit-grade report tables for a run. + - validate-config : Validate YAML config (schema + sanity). + - inspect inputs : Show resolved inputs + PWM sampling. + - inspect plan : Show resolved per-constraint quota plan. + - inspect config : Describe resolved config (inputs/outputs/solver). + - inspect run : Summarize run manifest or list workspaces. + - workspace init : Scaffold a new workspace with config.yaml + subfolders. + - stage-a build-pool : Build Stage-A TFBS pools from inputs. + - stage-b build-libraries : Build Stage-B libraries from pools/inputs. + - run : Execute generation pipeline; optionally auto-plot. + - plot : Generate plots from outputs using config YAML. + - ls-plots : List available plot names and descriptions. + - report : Generate audit-grade report tables for a run. Run: python -m dnadesign.densegen.src.cli --help @@ -27,15 +31,20 @@ import contextlib import io +import json +import logging import os import platform +import random import re import shutil import sys import tempfile +from datetime import datetime, timezone from pathlib import Path from typing import Iterator, Optional +import numpy as np import pandas as pd import typer import yaml @@ -50,17 +59,27 @@ resolve_relative_path, resolve_run_root, resolve_run_scoped_path, + schema_version_at_least, +) +from .core.pipeline import ( + _load_existing_library_index, + _load_failure_counts_from_attempts, + build_library_for_plan, + default_deps, + resolve_plan, + run_pipeline, ) -from .core.pipeline import resolve_plan, run_pipeline from .core.reporting import collect_report_data, write_report from .core.run_manifest import load_run_manifest from .core.run_paths import run_manifest_path, run_state_path from .core.run_state import load_run_state +from .integrations.meme_suite import require_executable from .utils.logging_utils import install_native_stderr_filters, setup_logging rich_traceback(show_locals=False) console = Console() _PYARROW_SYSCTL_PATTERN = re.compile(r"sysctlbyname failed for 'hw\.") +log = logging.getLogger(__name__) @contextlib.contextmanager @@ -102,6 +121,36 @@ def _densegen_root_from(file_path: Path) -> Path: DEFAULT_WORKSPACES_ROOT = DENSEGEN_ROOT / "workspaces" +def _input_uses_fimo(input_cfg) -> bool: + sampling = getattr(input_cfg, "sampling", None) + backend = str(getattr(sampling, "scoring_backend", "densegen")).lower() if sampling is not None else "" + if backend == "fimo": + return True + overrides = getattr(input_cfg, "overrides_by_motif_id", None) + if isinstance(overrides, dict): + for override in overrides.values(): + try: + override_backend = str(override.get("scoring_backend", "")).lower() + except Exception: + continue + if override_backend == "fimo": + return True + return False + + +def _ensure_fimo_available(cfg, *, strict: bool = True) -> None: + if not any(_input_uses_fimo(inp) for inp in cfg.inputs): + return + try: + require_executable("fimo", tool_path=None) + except FileNotFoundError as exc: + msg = f"FIMO is required for this config but was not found. {exc}" + if strict: + console.print(f"[bold red]{msg}[/]") + raise typer.Exit(code=1) + log.warning(msg) + + def _default_config_path() -> Path: # Prefer a realistic, self-contained MEME demo config inside the package tree. return DENSEGEN_ROOT / "workspaces" / "demo_meme_two_tf" / "config.yaml" @@ -180,6 +229,141 @@ def _short_hash(val: str, *, n: int = 8) -> str: return val[:n] +def _print_inputs_summary(loaded) -> None: + cfg = loaded.root.densegen + inputs = Table("name", "type", "source") + for inp in cfg.inputs: + if hasattr(inp, "path"): + src = str(resolve_relative_path(loaded.path, inp.path)) + elif hasattr(inp, "paths"): + resolved = [str(resolve_relative_path(loaded.path, p)) for p in getattr(inp, "paths") or []] + src = f"{len(resolved)} files" + if resolved: + src = f"{len(resolved)} files ({resolved[0]})" + elif hasattr(inp, "dataset"): + src = f"{inp.dataset} (root={resolve_relative_path(loaded.path, inp.root)})" + else: + src = "-" + inputs.add_row(inp.name, inp.type, src) + console.print(inputs) + + pwm_inputs = [ + inp + for inp in cfg.inputs + if getattr(inp, "type", "") + in { + "pwm_meme", + "pwm_meme_set", + "pwm_jaspar", + "pwm_matrix_csv", + "pwm_artifact", + "pwm_artifact_set", + } + ] + if not pwm_inputs: + return + pwm_table = Table( + "name", + "motifs", + "n_sites", + "strategy", + "backend", + "score", + "selection", + "bins", + "mining", + "bgfile", + "oversample", + "max_candidates", + "max_seconds", + "length", + ) + for inp in pwm_inputs: + sampling = getattr(inp, "sampling", None) + if sampling is None: + continue + if inp.type == "pwm_matrix_csv": + motif_label = str(getattr(inp, "motif_id", "-")) + elif inp.type in {"pwm_meme", "pwm_meme_set", "pwm_jaspar"}: + motif_ids = getattr(inp, "motif_ids", None) or [] + motif_label = ", ".join(motif_ids) if motif_ids else "all" + if inp.type == "pwm_meme_set": + file_count = len(getattr(inp, "paths", []) or []) + motif_label = f"{motif_label} ({file_count} files)" + elif inp.type == "pwm_artifact_set": + motif_label = f"{len(getattr(inp, 'paths', []) or [])} artifacts" + else: + motif_label = "from artifact" + backend = getattr(sampling, "scoring_backend", "densegen") + score_label = "-" + if backend == "fimo" and sampling.pvalue_threshold is not None: + comparator = ">=" if sampling.strategy == "background" else "<=" + score_label = f"pvalue{comparator}{sampling.pvalue_threshold}" + elif sampling.score_threshold is not None: + score_label = f"threshold={sampling.score_threshold}" + elif sampling.score_percentile is not None: + score_label = f"percentile={sampling.score_percentile}" + selection_label = "-" if backend != "fimo" else (getattr(sampling, "selection_policy", None) or "-") + bins_label = "-" + if backend == "fimo": + bins_label = "canonical" + if getattr(sampling, "pvalue_bins", None) is not None: + bins_label = "custom" + mining_cfg = getattr(sampling, "mining", None) + bin_ids = getattr(mining_cfg, "retain_bin_ids", None) + if bin_ids: + bins_label = f"{bins_label} retain={bin_ids}" + mining_label = "-" + mining_cfg = getattr(sampling, "mining", None) + if backend == "fimo" and mining_cfg is not None: + parts = [f"batch={mining_cfg.batch_size}"] + if mining_cfg.max_batches is not None: + parts.append(f"max_batches={mining_cfg.max_batches}") + if getattr(mining_cfg, "max_candidates", None) is not None: + parts.append(f"max_candidates={mining_cfg.max_candidates}") + if mining_cfg.max_seconds is not None: + parts.append(f"max_seconds={mining_cfg.max_seconds}s") + if mining_cfg.retain_bin_ids: + parts.append(f"retain={mining_cfg.retain_bin_ids}") + mining_label = ", ".join(parts) + bgfile_label = getattr(sampling, "bgfile", None) or "-" + length_label = str(sampling.length_policy) + if sampling.length_policy == "range" and sampling.length_range is not None: + length_label = f"range({sampling.length_range[0]}..{sampling.length_range[1]})" + pwm_table.add_row( + inp.name, + motif_label, + str(sampling.n_sites), + str(sampling.strategy), + str(backend), + score_label, + str(selection_label), + str(bins_label), + str(mining_label), + str(bgfile_label), + str(sampling.oversample_factor), + str(sampling.max_candidates) if sampling.max_candidates is not None else "-", + str(sampling.max_seconds) if sampling.max_seconds is not None else "-", + length_label, + ) + console.print("[bold]Input-stage PWM sampling[/]") + console.print(pwm_table) + console.print( + " -> Produces the realized TFBS pool (input_tfbs_count), captured in inputs_manifest.json after runs." + ) + + +def _pool_manifest_path(out_dir: Path) -> Path: + return out_dir / "pool_manifest.json" + + +def _load_pool_manifest(out_dir: Path) -> dict: + manifest_path = _pool_manifest_path(out_dir) + if not manifest_path.exists(): + raise FileNotFoundError(f"Pool manifest not found: {manifest_path}") + return json.loads(manifest_path.read_text()) + + def _list_dir_entries(path: Path, *, limit: int = 10) -> list[str]: if not path.exists() or not path.is_dir(): return [] @@ -229,7 +413,9 @@ def _render_missing_input_hint(cfg_path: Path, loaded, exc: Exception) -> None: hints = [] if (cfg_path.parent / "inputs").exists(): - hints.append("If this is a staged run dir, use `dense stage --copy-inputs` or copy files into run/inputs.") + hints.append( + "If this is a staged run dir, use `dense workspace init --copy-inputs` or copy files into run/inputs." + ) missing_str = " ".join(str(p) for p in missing) demo_paths = ( "cruncher/workspaces/demo_basics_two_tf", @@ -253,7 +439,7 @@ def _render_output_schema_hint(exc: Exception) -> bool: console.print(f"[bold red]Output schema mismatch:[/] {msg}") console.print("[bold]Next steps[/]:") console.print(" - Remove outputs/dense_arrays.parquet and outputs/_densegen_ids.sqlite, or") - console.print(" - Stage a fresh workspace with `dense stage --copy-inputs` and re-run.") + console.print(" - Stage a fresh workspace with `dense workspace init --copy-inputs` and re-run.") return True if "Output sinks are out of sync before run" in msg: console.print(f"[bold red]Output sink mismatch:[/] {msg}") @@ -425,6 +611,15 @@ def _list_workspaces_table(workspaces_root: Path, *, limit: int, show_all: bool) no_args_is_help=True, help="DenseGen — Dense Array Generator (Typer/Rich CLI)", ) +inspect_app = typer.Typer(add_completion=False, no_args_is_help=True, help="Inspect configs, inputs, and runs.") +stage_a_app = typer.Typer(add_completion=False, no_args_is_help=True, help="Stage A helpers (input TFBS pools).") +stage_b_app = typer.Typer(add_completion=False, no_args_is_help=True, help="Stage B helpers (library sampling).") +workspace_app = typer.Typer(add_completion=False, no_args_is_help=True, help="Workspace scaffolding.") + +app.add_typer(inspect_app, name="inspect") +app.add_typer(stage_a_app, name="stage-a") +app.add_typer(stage_b_app, name="stage-b") +app.add_typer(workspace_app, name="workspace") @app.callback() @@ -440,8 +635,8 @@ def _root( ctx.obj = {"config_path": config} -@app.command(help="Validate the config YAML (schema + sanity).") -def validate( +@app.command("validate-config", help="Validate the config YAML (schema + sanity).") +def validate_config( ctx: typer.Context, probe_solver: bool = typer.Option(False, help="Also probe the solver backend."), config: Optional[Path] = typer.Option(None, "--config", "-c", help="Path to config YAML."), @@ -450,6 +645,7 @@ def validate( loaded = _load_config_or_exit(cfg_path) _warn_pwm_sampling_configs(loaded, cfg_path) _warn_full_pool_strategy(loaded) + _ensure_fimo_available(loaded.root.densegen, strict=True) if probe_solver: from .adapters.optimizer import DenseArraysAdapter from .core.pipeline import select_solver_strict @@ -473,8 +669,8 @@ def ls_plots(): console.print(table) -@app.command(help="Stage a new workspace with config.yaml and standard subfolders.") -def stage( +@workspace_app.command("init", help="Stage a new workspace with config.yaml and standard subfolders.") +def workspace_init( run_id: str = typer.Option(..., "--id", "-i", help="Run identifier (directory name)."), root: Path = typer.Option(DEFAULT_WORKSPACES_ROOT, "--root", help="Workspaces root directory."), template: Optional[Path] = typer.Option(None, "--template", help="Template config YAML to copy."), @@ -560,8 +756,8 @@ def stage( console.print(f":sparkles: [bold green]Workspace staged[/]: {config_path}") -@app.command(help="Summarize a run manifest.") -def summarize( +@inspect_app.command("run", help="Summarize a run manifest or list workspaces.") +def inspect_run( ctx: typer.Context, run: Optional[Path] = typer.Option(None, "--run", "-r", help="Run directory (defaults to config run root)."), root: Optional[Path] = typer.Option(None, "--root", help="Workspaces root directory (lists workspaces)."), @@ -602,7 +798,7 @@ def summarize( if not cfg_path.exists(): console.print( f"[bold red]Config not found for --library:[/] {cfg_path}. " - "Provide --config or run summarize without --library." + "Provide --config or run inspect run without --library." ) raise typer.Exit(code=1) loaded = _load_config_or_exit(cfg_path) @@ -872,18 +1068,40 @@ def _render_tfbs_tables(lib_hash: str) -> None: @app.command(help="Generate audit-grade report summary for a run.") def report( ctx: typer.Context, + run: Optional[Path] = typer.Option(None, "--run", "-r", help="Run directory (defaults to config run root)."), config: Optional[Path] = typer.Option(None, "--config", "-c", help="Path to config YAML."), out: str = typer.Option("outputs", "--out", help="Output directory (relative to run root)."), + format: str = typer.Option( + "all", + "--format", + "-f", + help="Report format: json, md, html, or all (comma-separated allowed).", + ), ): - cfg_path = _resolve_config_path(ctx, config) + if run is not None and config is not None: + console.print("[bold red]Choose either --run or --config, not both.[/]") + raise typer.Exit(code=1) + if run is not None: + cfg_path = Path(run) / "config.yaml" + if not cfg_path.exists(): + console.print(f"[bold red]Config not found under run:[/] {cfg_path}") + raise typer.Exit(code=1) + else: + cfg_path = _resolve_config_path(ctx, config) loaded = _load_config_or_exit(cfg_path) + raw_formats = {f.strip().lower() for f in format.split(",") if f.strip()} + if not raw_formats: + raw_formats = {"all"} + allowed_formats = {"json", "md", "html", "all"} + unknown = sorted(raw_formats - allowed_formats) + if unknown: + console.print(f"[bold red]Unknown report format(s):[/] {', '.join(unknown)}") + console.print("Allowed: json, md, html, all.") + raise typer.Exit(code=1) + formats_used = {"json", "md", "html"} if "all" in raw_formats else raw_formats try: with _suppress_pyarrow_sysctl_warnings(): - write_report( - loaded.root, - cfg_path, - out_dir=out, - ) + write_report(loaded.root, cfg_path, out_dir=out, formats=raw_formats) except FileNotFoundError as exc: console.print(f"[bold red]Report failed:[/] {exc}") run_root = _run_root_for(loaded) @@ -896,11 +1114,18 @@ def report( run_root = _run_root_for(loaded) out_dir = resolve_run_scoped_path(cfg_path, run_root, out, label="report.out") console.print(f":sparkles: [bold green]Report written[/]: {out_dir}") - console.print("[bold]Outputs[/]: report.json, report.md") - - -@app.command(help="Show the resolved per-constraint quota plan.") -def plan( + outputs = [] + if "json" in formats_used: + outputs.append("report.json") + if "md" in formats_used: + outputs.append("report.md") + if "html" in formats_used: + outputs.append("report.html") + console.print(f"[bold]Outputs[/]: {', '.join(outputs) if outputs else '-'}") + + +@inspect_app.command("plan", help="Show the resolved per-constraint quota plan.") +def inspect_plan( ctx: typer.Context, config: Optional[Path] = typer.Option(None, "--config", "-c", help="Path to config YAML."), ): @@ -915,8 +1140,8 @@ def plan( console.print(table) -@app.command(help="Describe resolved config, inputs, outputs, and solver details.") -def describe( +@inspect_app.command("config", help="Describe resolved config, inputs, outputs, and solver details.") +def inspect_config( ctx: typer.Context, show_constraints: bool = typer.Option(False, help="Print full fixed elements per plan item."), probe_solver: bool = typer.Option(False, help="Probe the solver backend before reporting."), @@ -926,6 +1151,7 @@ def describe( loaded = _load_config_or_exit(cfg_path) root = loaded.root cfg = root.densegen + _ensure_fimo_available(cfg, strict=True) run_root = _run_root_for(loaded) if probe_solver: @@ -937,126 +1163,7 @@ def describe( console.print(f"[bold]Config[/]: {cfg_path}") console.print(f"[bold]Run[/]: id={cfg.run.id} root={run_root}") - inputs = Table("name", "type", "source") - for inp in cfg.inputs: - if hasattr(inp, "path"): - src = str(resolve_relative_path(loaded.path, inp.path)) - elif hasattr(inp, "paths"): - resolved = [str(resolve_relative_path(loaded.path, p)) for p in getattr(inp, "paths") or []] - src = f"{len(resolved)} files" - if resolved: - src = f"{len(resolved)} files ({resolved[0]})" - elif hasattr(inp, "dataset"): - src = f"{inp.dataset} (root={resolve_relative_path(loaded.path, inp.root)})" - else: - src = "-" - inputs.add_row(inp.name, inp.type, src) - console.print(inputs) - - # Alignment (8): make two-stage sampling explicit in CLI describe output. - pwm_inputs = [ - inp - for inp in cfg.inputs - if getattr(inp, "type", "") - in { - "pwm_meme", - "pwm_meme_set", - "pwm_jaspar", - "pwm_matrix_csv", - "pwm_artifact", - "pwm_artifact_set", - } - ] - if pwm_inputs: - pwm_table = Table( - "name", - "motifs", - "n_sites", - "strategy", - "backend", - "score", - "selection", - "bins", - "mining", - "bgfile", - "oversample", - "max_candidates", - "max_seconds", - "length", - ) - for inp in pwm_inputs: - sampling = getattr(inp, "sampling", None) - if sampling is None: - continue - if inp.type == "pwm_matrix_csv": - motif_label = str(getattr(inp, "motif_id", "-")) - elif inp.type in {"pwm_meme", "pwm_meme_set", "pwm_jaspar"}: - motif_ids = getattr(inp, "motif_ids", None) or [] - motif_label = ", ".join(motif_ids) if motif_ids else "all" - if inp.type == "pwm_meme_set": - file_count = len(getattr(inp, "paths", []) or []) - motif_label = f"{motif_label} ({file_count} files)" - elif inp.type == "pwm_artifact_set": - motif_label = f"{len(getattr(inp, 'paths', []) or [])} artifacts" - else: - motif_label = "from artifact" - backend = getattr(sampling, "scoring_backend", "densegen") - score_label = "-" - if backend == "fimo" and sampling.pvalue_threshold is not None: - comparator = ">=" if sampling.strategy == "background" else "<=" - score_label = f"pvalue{comparator}{sampling.pvalue_threshold}" - elif sampling.score_threshold is not None: - score_label = f"threshold={sampling.score_threshold}" - elif sampling.score_percentile is not None: - score_label = f"percentile={sampling.score_percentile}" - selection_label = "-" if backend != "fimo" else (getattr(sampling, "selection_policy", None) or "-") - bins_label = "-" - if backend == "fimo": - bins_label = "canonical" - if getattr(sampling, "pvalue_bins", None) is not None: - bins_label = "custom" - mining_cfg = getattr(sampling, "mining", None) - bin_ids = getattr(mining_cfg, "retain_bin_ids", None) - if bin_ids is None: - bin_ids = getattr(sampling, "pvalue_bin_ids", None) - if bin_ids: - bins_label = f"{bins_label} retain={bin_ids}" - mining_label = "-" - mining_cfg = getattr(sampling, "mining", None) - if backend == "fimo" and mining_cfg is not None: - parts = [f"batch={mining_cfg.batch_size}"] - if mining_cfg.max_batches is not None: - parts.append(f"max_batches={mining_cfg.max_batches}") - if mining_cfg.max_seconds is not None: - parts.append(f"max_seconds={mining_cfg.max_seconds}s") - if mining_cfg.retain_bin_ids: - parts.append(f"retain={mining_cfg.retain_bin_ids}") - mining_label = ", ".join(parts) - bgfile_label = getattr(sampling, "bgfile", None) or "-" - length_label = str(sampling.length_policy) - if sampling.length_policy == "range" and sampling.length_range is not None: - length_label = f"range({sampling.length_range[0]}..{sampling.length_range[1]})" - pwm_table.add_row( - inp.name, - motif_label, - str(sampling.n_sites), - str(sampling.strategy), - str(backend), - score_label, - str(selection_label), - str(bins_label), - str(mining_label), - str(bgfile_label), - str(sampling.oversample_factor), - str(sampling.max_candidates) if sampling.max_candidates is not None else "-", - str(sampling.max_seconds) if sampling.max_seconds is not None else "-", - length_label, - ) - console.print("[bold]Input-stage PWM sampling[/]") - console.print(pwm_table) - console.print( - " -> Produces the realized TFBS pool (input_tfbs_count), captured in inputs_manifest.json after runs." - ) + _print_inputs_summary(loaded) plan_table = Table( "name", @@ -1171,6 +1278,298 @@ def describe( console.print("[bold]Plots[/]: none") +@inspect_app.command("inputs", help="Show resolved inputs and PWM sampling summary.") +def inspect_inputs( + ctx: typer.Context, + config: Optional[Path] = typer.Option(None, "--config", "-c", help="Path to config YAML."), +): + cfg_path = _resolve_config_path(ctx, config) + loaded = _load_config_or_exit(cfg_path) + console.print(f"[bold]Config[/]: {cfg_path}") + _ensure_fimo_available(loaded.root.densegen, strict=False) + _print_inputs_summary(loaded) + + +@stage_a_app.command("build-pool", help="Build Stage-A TFBS pools from inputs.") +def stage_a_build_pool( + ctx: typer.Context, + out: str = typer.Option("outputs/pools", "--out", help="Output directory (relative to run root)."), + input_name: Optional[list[str]] = typer.Option( + None, + "--input", + "-i", + help="Input name(s) to build (defaults to all inputs).", + ), + overwrite: bool = typer.Option(False, help="Overwrite existing pool files."), + config: Optional[Path] = typer.Option(None, "--config", "-c", help="Path to config YAML."), +): + cfg_path = _resolve_config_path(ctx, config) + loaded = _load_config_or_exit(cfg_path) + cfg = loaded.root.densegen + _ensure_fimo_available(cfg, strict=True) + run_root = _run_root_for(loaded) + out_dir = resolve_run_scoped_path(cfg_path, run_root, out, label="stage-a.out") + out_dir.mkdir(parents=True, exist_ok=True) + + selected = {name for name in (input_name or [])} + if selected: + available = {inp.name for inp in cfg.inputs} + missing = sorted(selected - available) + if missing: + raise typer.BadParameter(f"Unknown input name(s): {', '.join(missing)}") + + rng = np.random.default_rng(int(cfg.runtime.random_seed)) + deps = default_deps() + outputs_root = run_root / "outputs" + outputs_root.mkdir(parents=True, exist_ok=True) + + rows = [] + manifest_inputs: list[dict] = [] + for inp in cfg.inputs: + if selected and inp.name not in selected: + continue + src = deps.source_factory(inp, cfg_path) + data_entries, meta_df = src.load_data(rng=rng, outputs_root=outputs_root) + if meta_df is None: + df = pd.DataFrame({"sequence": [str(s) for s in data_entries]}) + else: + df = meta_df.copy() + df.insert(0, "input_name", inp.name) + filename = f"{_sanitize_filename(inp.name)}__pool.parquet" + dest = out_dir / filename + if dest.exists() and not overwrite: + console.print(f"[bold red]Pool already exists:[/] {dest}") + raise typer.Exit(code=1) + df.to_parquet(dest, index=False) + if "fimo_bin_id" in df.columns: + bin_counts = df["fimo_bin_id"].value_counts().sort_index() + bin_table = Table("bin_id", "pvalue_range", "count") + for bin_id, count in bin_counts.items(): + low = None + high = None + if "fimo_bin_low" in df.columns: + low_vals = df.loc[df["fimo_bin_id"] == bin_id, "fimo_bin_low"] + if not low_vals.empty: + low = float(low_vals.iloc[0]) + if "fimo_bin_high" in df.columns: + high_vals = df.loc[df["fimo_bin_id"] == bin_id, "fimo_bin_high"] + if not high_vals.empty: + high = float(high_vals.iloc[0]) + if low is not None and high is not None: + range_label = f"({low:g}, {high:g}]" + else: + range_label = "-" + bin_table.add_row(str(bin_id), range_label, str(int(count))) + console.print(f"[bold]FIMO p-value bins for {inp.name}[/]") + console.print(bin_table) + manifest_inputs.append( + { + "name": inp.name, + "type": inp.type, + "pool_path": dest.name, + "rows": int(len(df)), + "columns": list(df.columns), + } + ) + rows.append((inp.name, inp.type, str(len(df)), dest.name)) + + if not rows: + console.print("[yellow]No pools built (no matching inputs).[/]") + raise typer.Exit(code=1) + + manifest = { + "schema_version": "1.0", + "created_at": datetime.now(timezone.utc).isoformat(), + "run_id": cfg.run.id, + "run_root": str(run_root), + "config_path": str(cfg_path), + "inputs": manifest_inputs, + } + manifest_path = _pool_manifest_path(out_dir) + manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True)) + + table = Table("input", "type", "rows", "pool_file") + for row in rows: + table.add_row(*row) + console.print(table) + console.print(f":sparkles: [bold green]Pool manifest written[/]: {manifest_path}") + + +@stage_b_app.command("build-libraries", help="Build Stage-B libraries from pools or inputs.") +def stage_b_build_libraries( + ctx: typer.Context, + out: str = typer.Option("outputs/libraries", "--out", help="Output directory (relative to run root)."), + pool: Optional[Path] = typer.Option( + None, + "--pool", + help="Optional pool directory from `stage-a build-pool` (defaults to reading inputs).", + ), + input_name: Optional[list[str]] = typer.Option( + None, + "--input", + "-i", + help="Input name(s) to build (defaults to all inputs).", + ), + plan: Optional[list[str]] = typer.Option( + None, + "--plan", + "-p", + help="Plan item name(s) to build (defaults to all plans).", + ), + overwrite: bool = typer.Option(False, help="Overwrite existing library_builds.parquet."), + config: Optional[Path] = typer.Option(None, "--config", "-c", help="Path to config YAML."), +): + cfg_path = _resolve_config_path(ctx, config) + loaded = _load_config_or_exit(cfg_path) + cfg = loaded.root.densegen + if pool is None: + _ensure_fimo_available(cfg, strict=True) + run_root = _run_root_for(loaded) + out_dir = resolve_run_scoped_path(cfg_path, run_root, out, label="stage-b.out") + out_dir.mkdir(parents=True, exist_ok=True) + out_path = out_dir / "library_builds.parquet" + if out_path.exists() and not overwrite: + console.print(f"[bold red]library_builds.parquet already exists:[/] {out_path}") + raise typer.Exit(code=1) + + selected_inputs = {name for name in (input_name or [])} + if selected_inputs: + available = {inp.name for inp in cfg.inputs} + missing = sorted(selected_inputs - available) + if missing: + raise typer.BadParameter(f"Unknown input name(s): {', '.join(missing)}") + + selected_plans = {name for name in (plan or [])} + resolved_plan = resolve_plan(loaded) + if selected_plans: + available_plans = {p.name for p in resolved_plan} + missing = sorted(selected_plans - available_plans) + if missing: + raise typer.BadParameter(f"Unknown plan name(s): {', '.join(missing)}") + + deps = default_deps() + seed = int(cfg.runtime.random_seed) + rng = random.Random(seed) + np_rng = np.random.default_rng(seed) + sampling_cfg = cfg.generation.sampling + schema_is_22 = schema_version_at_least(cfg.schema_version, major=2, minor=2) + outputs_root = run_root / "outputs" + failure_counts = _load_failure_counts_from_attempts(outputs_root) + libraries_built = _load_existing_library_index(outputs_root) if outputs_root.exists() else 0 + + pool_manifest = None + pool_dir = None + if pool is not None: + pool_dir = resolve_relative_path(cfg_path, pool) + if not pool_dir.exists() or not pool_dir.is_dir(): + raise typer.BadParameter(f"Pool directory not found: {pool_dir}") + pool_manifest = _load_pool_manifest(pool_dir) + + rows = [] + table = Table("input", "plan", "library_index", "library_hash", "size", "achieved/target", "pool", "sampling") + for inp in cfg.inputs: + if selected_inputs and inp.name not in selected_inputs: + continue + if pool_manifest is not None and pool_dir is not None: + entry = next((e for e in pool_manifest.get("inputs", []) if e.get("name") == inp.name), None) + if entry is None: + raise typer.BadParameter(f"Pool manifest missing input: {inp.name}") + pool_path = pool_dir / str(entry.get("pool_path") or "") + if not pool_path.exists(): + raise typer.BadParameter(f"Pool file not found for input {inp.name}: {pool_path}") + df = pd.read_parquet(pool_path) + if "tf" in df.columns and "tfbs" in df.columns: + meta_df = df + data_entries = df["tfbs"].tolist() + elif "sequence" in df.columns: + meta_df = None + data_entries = df["sequence"].tolist() + else: + raise typer.BadParameter( + f"Pool file for {inp.name} must contain tf/tfbs or sequence columns: {pool_path}" + ) + else: + src = deps.source_factory(inp, cfg_path) + data_entries, meta_df = src.load_data(rng=np_rng, outputs_root=outputs_root) + + for plan_item in resolved_plan: + if selected_plans and plan_item.name not in selected_plans: + continue + library, _parts, reg_labels, info = build_library_for_plan( + source_label=inp.name, + plan_item=plan_item, + data_entries=data_entries, + meta_df=meta_df, + sampling_cfg=sampling_cfg, + seq_len=int(cfg.generation.sequence_length), + min_count_per_tf=int(cfg.runtime.min_count_per_tf), + usage_counts={}, + failure_counts=failure_counts if failure_counts else None, + rng=rng, + np_rng=np_rng, + schema_is_22=schema_is_22, + library_index_start=libraries_built, + ) + libraries_built = int(info.get("library_index", libraries_built)) + library_hash = str(info.get("library_hash") or "") + target_len = int(info.get("target_length") or 0) + achieved_len = int(info.get("achieved_length") or 0) + pool_strategy = str(info.get("pool_strategy") or sampling_cfg.pool_strategy) + sampling_strategy = str(info.get("library_sampling_strategy") or sampling_cfg.library_sampling_strategy) + row = { + "created_at": datetime.now(timezone.utc).isoformat(), + "input_name": inp.name, + "input_type": inp.type, + "plan_name": plan_item.name, + "library_index": int(info.get("library_index") or 0), + "library_hash": library_hash, + "library_tfbs": list(library), + "library_tfs": list(reg_labels) if reg_labels else [], + "library_site_ids": list(info.get("site_id_by_index") or []), + "library_sources": list(info.get("source_by_index") or []), + "pool_strategy": pool_strategy, + "library_sampling_strategy": sampling_strategy, + "library_size": int(info.get("library_size") or len(library)), + "target_length": target_len, + "achieved_length": achieved_len, + "relaxed_cap": bool(info.get("relaxed_cap") or False), + "final_cap": info.get("final_cap"), + "iterative_max_libraries": int(info.get("iterative_max_libraries") or 0), + "iterative_min_new_solutions": int(info.get("iterative_min_new_solutions") or 0), + "required_regulators_selected": info.get("required_regulators_selected"), + } + rows.append(row) + table.add_row( + inp.name, + plan_item.name, + str(row["library_index"]), + _short_hash(library_hash), + str(len(library)), + f"{achieved_len}/{target_len}", + pool_strategy, + sampling_strategy, + ) + + if not rows: + console.print("[yellow]No libraries built (no matching inputs/plans).[/]") + raise typer.Exit(code=1) + + df_out = pd.DataFrame(rows) + df_out.to_parquet(out_path, index=False) + manifest = { + "schema_version": "1.0", + "created_at": datetime.now(timezone.utc).isoformat(), + "run_id": cfg.run.id, + "run_root": str(run_root), + "config_path": str(cfg_path), + "library_builds_path": str(out_path), + } + manifest_path = out_dir / "library_manifest.json" + manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True)) + console.print(table) + console.print(f":sparkles: [bold green]Library builds written[/]: {out_path}") + + @app.command(help="Run generation for the job. Optionally auto-run plots declared in YAML.") def run( ctx: typer.Context, @@ -1213,7 +1612,7 @@ def run( console.print(":tada: [bold green]Run complete[/].") console.print("[bold]Next steps[/]:") - console.print(f" - dense summarize --library -c {cfg_path}") + console.print(f" - dense inspect run --library -c {cfg_path}") console.print(f" - dense report -c {cfg_path}") # Auto-plot if configured diff --git a/src/dnadesign/densegen/src/config/__init__.py b/src/dnadesign/densegen/src/config/__init__.py index fb882183..e8f7b920 100644 --- a/src/dnadesign/densegen/src/config/__init__.py +++ b/src/dnadesign/densegen/src/config/__init__.py @@ -13,7 +13,6 @@ from __future__ import annotations import os -import warnings from dataclasses import dataclass from pathlib import Path from typing import Annotated, Any, Dict, List, Optional, Union @@ -159,7 +158,8 @@ class PWMMiningConfig(BaseModel): model_config = ConfigDict(extra="forbid") batch_size: int = 100000 max_batches: Optional[int] = None - max_seconds: Optional[float] = None + max_candidates: Optional[int] = None + max_seconds: Optional[float] = 60.0 retain_bin_ids: Optional[List[int]] = None log_every_batches: int = 1 @@ -177,6 +177,13 @@ def _max_batches_ok(cls, v: Optional[int]): raise ValueError("pwm.sampling.mining.max_batches must be > 0 when set") return v + @field_validator("max_candidates") + @classmethod + def _max_candidates_ok(cls, v: Optional[int]): + if v is not None and v <= 0: + raise ValueError("pwm.sampling.mining.max_candidates must be > 0 when set") + return v + @field_validator("max_seconds") @classmethod def _max_seconds_ok(cls, v: Optional[float]): @@ -220,7 +227,6 @@ class PWMSamplingConfig(BaseModel): scoring_backend: Literal["densegen", "fimo"] = "densegen" pvalue_threshold: Optional[float] = None pvalue_bins: Optional[List[float]] = None - pvalue_bin_ids: Optional[List[int]] = None mining: Optional[PWMMiningConfig] = None bgfile: Optional[str] = None selection_policy: Literal["random_uniform", "top_n", "stratified"] = "random_uniform" @@ -312,20 +318,6 @@ def _pvalue_bins_ok(cls, v: Optional[List[float]]): raise ValueError("pwm.sampling.pvalue_bins must end with 1.0") return bins - @field_validator("pvalue_bin_ids") - @classmethod - def _pvalue_bin_ids_ok(cls, v: Optional[List[int]]): - if v is None: - return v - if not v: - raise ValueError("pwm.sampling.pvalue_bin_ids must be non-empty when set") - ids = [int(x) for x in v] - if any(idx < 0 for idx in ids): - raise ValueError("pwm.sampling.pvalue_bin_ids values must be >= 0") - if len(set(ids)) != len(ids): - raise ValueError("pwm.sampling.pvalue_bin_ids must be unique") - return ids - @model_validator(mode="after") def _score_mode(self): has_thresh = self.score_threshold is not None @@ -337,8 +329,6 @@ def _score_mode(self): raise ValueError("pwm.sampling.pvalue_threshold is only valid when scoring_backend='fimo'") if self.pvalue_bins is not None: raise ValueError("pwm.sampling.pvalue_bins is only valid when scoring_backend='fimo'") - if self.pvalue_bin_ids is not None: - raise ValueError("pwm.sampling.pvalue_bin_ids is only valid when scoring_backend='fimo'") if self.mining is not None: raise ValueError("pwm.sampling.mining is only valid when scoring_backend='fimo'") if self.include_matched_sequence: @@ -348,25 +338,28 @@ def _score_mode(self): raise ValueError("pwm.sampling.pvalue_threshold is required when scoring_backend='fimo'") if not (0.0 < float(self.pvalue_threshold) <= 1.0): raise ValueError("pwm.sampling.pvalue_threshold must be between 0 and 1") - if self.pvalue_bin_ids is not None and self.mining is not None: + if "max_candidates" in self.model_fields_set and self.max_candidates is not None: raise ValueError( - "pwm.sampling.pvalue_bin_ids is deprecated; use pwm.sampling.mining.retain_bin_ids instead." + "pwm.sampling.max_candidates is not used with scoring_backend='fimo'. " + "Use pwm.sampling.mining.max_candidates instead." ) - if self.pvalue_bin_ids is not None and self.mining is None: - warnings.warn( - "pwm.sampling.pvalue_bin_ids is deprecated; use pwm.sampling.mining.retain_bin_ids.", - stacklevel=2, + if "max_seconds" in self.model_fields_set and self.max_seconds is not None: + raise ValueError( + "pwm.sampling.max_seconds is not used with scoring_backend='fimo'. " + "Use pwm.sampling.mining.max_seconds instead." ) - self.mining = PWMMiningConfig(retain_bin_ids=list(self.pvalue_bin_ids)) - bin_ids = None + if "max_candidates" not in self.model_fields_set: + self.max_candidates = None + if "max_seconds" not in self.model_fields_set: + self.max_seconds = None + if self.mining is None: + self.mining = PWMMiningConfig() + if self.pvalue_bins is None: + self.pvalue_bins = list(CANONICAL_PVALUE_BINS) if self.mining is not None and self.mining.retain_bin_ids is not None: - bin_ids = list(self.mining.retain_bin_ids) - elif self.pvalue_bin_ids is not None: - bin_ids = list(self.pvalue_bin_ids) - if bin_ids is not None: bins = list(self.pvalue_bins) if self.pvalue_bins is not None else list(CANONICAL_PVALUE_BINS) max_idx = len(bins) - 1 - if any(idx > max_idx for idx in bin_ids): + if any(idx > max_idx for idx in self.mining.retain_bin_ids): raise ValueError("pwm.sampling.mining.retain_bin_ids contains an index outside the available bins") if self.strategy == "consensus" and int(self.n_sites) != 1: raise ValueError("pwm.sampling.strategy=consensus requires n_sites=1") diff --git a/src/dnadesign/densegen/src/core/metadata.py b/src/dnadesign/densegen/src/core/metadata.py index 861de3ed..728fe697 100644 --- a/src/dnadesign/densegen/src/core/metadata.py +++ b/src/dnadesign/densegen/src/core/metadata.py @@ -146,9 +146,9 @@ def build_metadata( "input_pwm_score_percentile": input_meta.get("input_pwm_score_percentile"), "input_pwm_pvalue_threshold": input_meta.get("input_pwm_pvalue_threshold"), "input_pwm_pvalue_bins": input_meta.get("input_pwm_pvalue_bins"), - "input_pwm_pvalue_bin_ids": input_meta.get("input_pwm_pvalue_bin_ids"), "input_pwm_mining_batch_size": input_meta.get("input_pwm_mining_batch_size"), "input_pwm_mining_max_batches": input_meta.get("input_pwm_mining_max_batches"), + "input_pwm_mining_max_candidates": input_meta.get("input_pwm_mining_max_candidates"), "input_pwm_mining_max_seconds": input_meta.get("input_pwm_mining_max_seconds"), "input_pwm_mining_retain_bin_ids": input_meta.get("input_pwm_mining_retain_bin_ids"), "input_pwm_mining_log_every_batches": input_meta.get("input_pwm_mining_log_every_batches"), diff --git a/src/dnadesign/densegen/src/core/metadata_schema.py b/src/dnadesign/densegen/src/core/metadata_schema.py index ca0c2736..ea568c9d 100644 --- a/src/dnadesign/densegen/src/core/metadata_schema.py +++ b/src/dnadesign/densegen/src/core/metadata_schema.py @@ -98,14 +98,9 @@ class MetaField: MetaField("input_pwm_score_percentile", (numbers.Real,), "PWM score percentile.", allow_none=True), MetaField("input_pwm_pvalue_threshold", (numbers.Real,), "PWM p-value threshold (FIMO).", allow_none=True), MetaField("input_pwm_pvalue_bins", (list,), "PWM p-value bins (FIMO).", allow_none=True), - MetaField( - "input_pwm_pvalue_bin_ids", - (list,), - "Deprecated: selected p-value bin indices (use input_pwm_mining_retain_bin_ids).", - allow_none=True, - ), MetaField("input_pwm_mining_batch_size", (int,), "PWM mining batch size (FIMO).", allow_none=True), MetaField("input_pwm_mining_max_batches", (int,), "PWM mining max batches (FIMO).", allow_none=True), + MetaField("input_pwm_mining_max_candidates", (int,), "PWM mining max candidates (FIMO).", allow_none=True), MetaField("input_pwm_mining_max_seconds", (numbers.Real,), "PWM mining max seconds (FIMO).", allow_none=True), MetaField( "input_pwm_mining_retain_bin_ids", @@ -236,15 +231,6 @@ def _validate_list_fields(meta: Mapping[str, Any]) -> None: if not isinstance(item, numbers.Real): raise TypeError("Metadata field 'input_pwm_pvalue_bins' must contain only numbers") - if "input_pwm_pvalue_bin_ids" in meta: - vals = meta["input_pwm_pvalue_bin_ids"] - if vals is not None: - if isinstance(vals, (str, bytes)) or not isinstance(vals, Sequence): - raise TypeError("Metadata field 'input_pwm_pvalue_bin_ids' must be a list of integers") - for item in vals: - if not isinstance(item, int): - raise TypeError("Metadata field 'input_pwm_pvalue_bin_ids' must contain only integers") - if "input_pwm_mining_retain_bin_ids" in meta: vals = meta["input_pwm_mining_retain_bin_ids"] if vals is not None: diff --git a/src/dnadesign/densegen/src/core/pipeline.py b/src/dnadesign/densegen/src/core/pipeline.py index 4b997aa7..07a2012e 100644 --- a/src/dnadesign/densegen/src/core/pipeline.py +++ b/src/dnadesign/densegen/src/core/pipeline.py @@ -198,28 +198,39 @@ def _extract_pwm_sampling_config(source_cfg) -> dict | None: requested = None generated = None capped = False + backend = str(_sampling_attr(sampling, "scoring_backend") or "densegen").lower() if isinstance(n_sites, int) and isinstance(oversample, int): requested = int(n_sites) * int(oversample) generated = requested - if max_candidates is not None: - try: - cap_val = int(max_candidates) - except Exception: - cap_val = None - if cap_val is not None: - generated = min(requested, cap_val) - capped = generated < requested + if backend == "fimo": + mining_cfg = _sampling_attr(sampling, "mining") + mining_max_candidates = _mining_attr(mining_cfg, "max_candidates") + if mining_max_candidates is not None: + try: + cap_val = int(mining_max_candidates) + except Exception: + cap_val = None + if cap_val is not None: + generated = min(requested, cap_val) + capped = generated < requested + else: + if max_candidates is not None: + try: + cap_val = int(max_candidates) + except Exception: + cap_val = None + if cap_val is not None: + generated = min(requested, cap_val) + capped = generated < requested length_range = _sampling_attr(sampling, "length_range") if length_range is not None: length_range = list(length_range) mining = _sampling_attr(sampling, "mining") mining_batch_size = _mining_attr(mining, "batch_size") mining_max_batches = _mining_attr(mining, "max_batches") + mining_max_candidates = _mining_attr(mining, "max_candidates") mining_max_seconds = _mining_attr(mining, "max_seconds") mining_retain_bin_ids = _mining_attr(mining, "retain_bin_ids") - legacy_bin_ids = _sampling_attr(sampling, "pvalue_bin_ids") - if mining_retain_bin_ids is None: - mining_retain_bin_ids = legacy_bin_ids mining_log_every_batches = _mining_attr(mining, "log_every_batches") return { "strategy": _sampling_attr(sampling, "strategy"), @@ -235,7 +246,6 @@ def _extract_pwm_sampling_config(source_cfg) -> dict | None: "score_percentile": _sampling_attr(sampling, "score_percentile"), "pvalue_threshold": _sampling_attr(sampling, "pvalue_threshold"), "pvalue_bins": _resolve_pvalue_bins_meta(sampling), - "pvalue_bin_ids": legacy_bin_ids, "selection_policy": _sampling_attr(sampling, "selection_policy"), "bgfile": _sampling_attr(sampling, "bgfile"), "keep_all_candidates_debug": _sampling_attr(sampling, "keep_all_candidates_debug"), @@ -244,6 +254,7 @@ def _extract_pwm_sampling_config(source_cfg) -> dict | None: "mining": { "batch_size": mining_batch_size, "max_batches": mining_max_batches, + "max_candidates": mining_max_candidates, "max_seconds": mining_max_seconds, "retain_bin_ids": mining_retain_bin_ids, "log_every_batches": mining_log_every_batches, @@ -508,10 +519,9 @@ def _input_metadata(source_cfg, cfg_path: Path) -> dict: meta["input_pwm_pvalue_bins"] = _resolve_pvalue_bins_meta(sampling) mining_cfg = getattr(sampling, "mining", None) retained_bins = _mining_attr(mining_cfg, "retain_bin_ids") - legacy_bin_ids = getattr(sampling, "pvalue_bin_ids", None) - meta["input_pwm_pvalue_bin_ids"] = legacy_bin_ids if legacy_bin_ids is not None else retained_bins meta["input_pwm_mining_batch_size"] = _mining_attr(mining_cfg, "batch_size") meta["input_pwm_mining_max_batches"] = _mining_attr(mining_cfg, "max_batches") + meta["input_pwm_mining_max_candidates"] = _mining_attr(mining_cfg, "max_candidates") meta["input_pwm_mining_max_seconds"] = _mining_attr(mining_cfg, "max_seconds") meta["input_pwm_mining_retain_bin_ids"] = retained_bins meta["input_pwm_mining_log_every_batches"] = _mining_attr(mining_cfg, "log_every_batches") @@ -521,7 +531,6 @@ def _input_metadata(source_cfg, cfg_path: Path) -> dict: meta["input_pwm_include_matched_sequence"] = getattr(sampling, "include_matched_sequence", None) meta["input_pwm_n_sites"] = getattr(sampling, "n_sites", None) meta["input_pwm_oversample_factor"] = getattr(sampling, "oversample_factor", None) - meta["input_pwm_max_candidates"] = getattr(sampling, "max_candidates", None) else: meta["input_mode"] = source_type meta["input_pwm_ids"] = [] @@ -919,6 +928,264 @@ def _hash_library( return digest +def build_library_for_plan( + *, + source_label: str, + plan_item: ResolvedPlanItem, + data_entries: list, + meta_df: pd.DataFrame | None, + sampling_cfg: object, + seq_len: int, + min_count_per_tf: int, + usage_counts: dict[tuple[str, str], int], + failure_counts: dict[tuple[str, str, str, str, str | None], dict[str, int]] | None, + rng: random.Random, + np_rng: np.random.Generator, + schema_is_22: bool, + library_index_start: int, +) -> tuple[list[str], list[str], list[str], dict]: + pool_strategy = str(getattr(sampling_cfg, "pool_strategy", "subsample")) + library_size = int(getattr(sampling_cfg, "library_size", 0)) + subsample_over = int(getattr(sampling_cfg, "subsample_over_length_budget_by", 0)) + library_sampling_strategy = str(getattr(sampling_cfg, "library_sampling_strategy", "tf_balanced")) + cover_all_tfs = bool(getattr(sampling_cfg, "cover_all_regulators", True)) + unique_binding_sites = bool(getattr(sampling_cfg, "unique_binding_sites", True)) + max_sites_per_tf = getattr(sampling_cfg, "max_sites_per_regulator", None) + relax_on_exhaustion = bool(getattr(sampling_cfg, "relax_on_exhaustion", False)) + allow_incomplete_coverage = bool(getattr(sampling_cfg, "allow_incomplete_coverage", False)) + iterative_max_libraries = int(getattr(sampling_cfg, "iterative_max_libraries", 0)) + iterative_min_new_solutions = int(getattr(sampling_cfg, "iterative_min_new_solutions", 0)) + + fixed_elements = plan_item.fixed_elements + required_regulators = list(dict.fromkeys(plan_item.required_regulators or [])) + min_required_regulators = plan_item.min_required_regulators + plan_min_count_by_regulator = dict(plan_item.min_count_by_regulator or {}) + k_required = int(min_required_regulators) if min_required_regulators is not None else None + k_of_required = bool(required_regulators) and k_required is not None + if k_of_required and k_required > len(required_regulators): + raise ValueError( + "min_required_regulators cannot exceed required_regulators size " + f"({k_required} > {len(required_regulators)})." + ) + side_left, side_right = _extract_side_biases(fixed_elements) + required_bias_motifs = list(dict.fromkeys([*side_left, *side_right])) + + libraries_built = int(library_index_start) + + def _finalize( + library: list[str], + parts: list[str], + reg_labels: list[str], + info: dict, + *, + site_id_by_index: list[str | None] | None, + source_by_index: list[str | None] | None, + ) -> tuple[list[str], list[str], list[str], dict]: + nonlocal libraries_built + libraries_built += 1 + info["library_index"] = libraries_built + info["library_hash"] = _hash_library(library, reg_labels, site_id_by_index, source_by_index) + info["site_id_by_index"] = site_id_by_index + info["source_by_index"] = source_by_index + return library, parts, reg_labels, info + + if meta_df is not None and isinstance(meta_df, pd.DataFrame): + available_tfs = set(meta_df["tf"].tolist()) + missing = [t for t in required_regulators if t not in available_tfs] + if missing: + preview = ", ".join(missing[:10]) + raise ValueError(f"Required regulators not found in input: {preview}") + if plan_min_count_by_regulator: + missing_counts = [t for t in plan_min_count_by_regulator if t not in available_tfs] + if missing_counts: + preview = ", ".join(missing_counts[:10]) + raise ValueError(f"min_count_by_regulator TFs not found in input: {preview}") + if min_required_regulators is not None: + if not required_regulators and min_required_regulators > len(available_tfs): + raise ValueError( + f"min_required_regulators={min_required_regulators} exceeds available regulators " + f"({len(available_tfs)})." + ) + + if pool_strategy == "full": + lib_df = meta_df.copy() + if unique_binding_sites: + lib_df = lib_df.drop_duplicates(["tf", "tfbs"]) + if required_bias_motifs: + missing_bias = [m for m in required_bias_motifs if m not in set(lib_df["tfbs"])] + if missing_bias: + preview = ", ".join(missing_bias[:10]) + raise ValueError(f"Required side-bias motifs not found in input: {preview}") + lib_df = lib_df.reset_index(drop=True) + library = lib_df["tfbs"].tolist() + reg_labels = lib_df["tf"].tolist() + parts = [f"{tf}:{tfbs}" for tf, tfbs in zip(reg_labels, lib_df["tfbs"].tolist())] + site_id_by_index = lib_df["site_id"].tolist() if "site_id" in lib_df.columns else None + source_by_index = lib_df["source"].tolist() if "source" in lib_df.columns else None + info = { + "target_length": seq_len + subsample_over, + "achieved_length": sum(len(s) for s in library), + "relaxed_cap": False, + "final_cap": None, + "pool_strategy": pool_strategy, + "library_size": len(library), + "iterative_max_libraries": iterative_max_libraries, + "iterative_min_new_solutions": iterative_min_new_solutions, + } + return _finalize( + library, + parts, + reg_labels, + info, + site_id_by_index=site_id_by_index, + source_by_index=source_by_index, + ) + + sampler = TFSampler(meta_df, np_rng) + required_regulators_selected = required_regulators + if k_of_required: + candidates = sorted(required_regulators) + if k_required is not None and k_required < len(candidates): + chosen = np_rng.choice(len(candidates), size=k_required, replace=False) + required_regulators_selected = sorted([candidates[int(i)] for i in chosen]) + else: + required_regulators_selected = candidates + required_tfs_for_library = list( + dict.fromkeys([*required_regulators_selected, *plan_min_count_by_regulator.keys()]) + ) + if min_required_regulators is not None and not required_regulators: + if pool_strategy in {"subsample", "iterative_subsample"}: + if library_size < int(min_required_regulators): + raise ValueError( + "library_size is too small to satisfy min_required_regulators when " + f"required_regulators is empty. library_size={library_size} " + f"min_required_regulators={min_required_regulators}. " + "Increase library_size or lower min_required_regulators." + ) + if pool_strategy in {"subsample", "iterative_subsample"}: + required_slots = len(required_bias_motifs) + len(required_tfs_for_library) + if library_size < required_slots: + raise ValueError( + "library_size is too small for required motifs. " + f"library_size={library_size} but required_tfbs={len(required_bias_motifs)} " + f"+ required_tfs={len(required_tfs_for_library)} " + f"(min_required_regulators={min_required_regulators}). " + "Increase library_size or relax required constraints." + ) + if schema_is_22 and pool_strategy in {"subsample", "iterative_subsample"}: + failure_counts_by_tfbs: dict[tuple[str, str], int] | None = None + if library_sampling_strategy == "coverage_weighted" and getattr(sampling_cfg, "avoid_failed_motifs", False): + failure_counts_by_tfbs = _aggregate_failure_counts_for_sampling( + failure_counts, + input_name=source_label, + plan_name=plan_item.name, + ) + library, parts, reg_labels, info = sampler.generate_binding_site_library( + library_size, + sequence_length=seq_len, + budget_overhead=subsample_over, + required_tfbs=required_bias_motifs, + required_tfs=required_tfs_for_library, + cover_all_tfs=cover_all_tfs, + unique_binding_sites=unique_binding_sites, + max_sites_per_tf=max_sites_per_tf, + relax_on_exhaustion=relax_on_exhaustion, + allow_incomplete_coverage=allow_incomplete_coverage, + sampling_strategy=library_sampling_strategy, + usage_counts=usage_counts if library_sampling_strategy == "coverage_weighted" else None, + coverage_boost_alpha=float(getattr(sampling_cfg, "coverage_boost_alpha", 0.15)), + coverage_boost_power=float(getattr(sampling_cfg, "coverage_boost_power", 1.0)), + failure_counts=failure_counts_by_tfbs, + avoid_failed_motifs=bool(getattr(sampling_cfg, "avoid_failed_motifs", False)), + failure_penalty_alpha=float(getattr(sampling_cfg, "failure_penalty_alpha", 0.5)), + failure_penalty_power=float(getattr(sampling_cfg, "failure_penalty_power", 1.0)), + ) + else: + library, parts, reg_labels, info = sampler.generate_binding_site_subsample( + seq_len, + subsample_over, + required_tfbs=required_bias_motifs, + required_tfs=required_tfs_for_library, + cover_all_tfs=cover_all_tfs, + unique_binding_sites=unique_binding_sites, + max_sites_per_tf=max_sites_per_tf, + relax_on_exhaustion=relax_on_exhaustion, + allow_incomplete_coverage=allow_incomplete_coverage, + ) + info.update( + { + "pool_strategy": pool_strategy, + "library_size": library_size, + "library_sampling_strategy": library_sampling_strategy, + "coverage_boost_alpha": float(getattr(sampling_cfg, "coverage_boost_alpha", 0.15)), + "coverage_boost_power": float(getattr(sampling_cfg, "coverage_boost_power", 1.0)), + "iterative_max_libraries": iterative_max_libraries, + "iterative_min_new_solutions": iterative_min_new_solutions, + "required_regulators_selected": required_regulators_selected if k_of_required else None, + } + ) + site_id_by_index = info.get("site_id_by_index") + source_by_index = info.get("source_by_index") + return _finalize( + library, + parts, + reg_labels, + info, + site_id_by_index=site_id_by_index, + source_by_index=source_by_index, + ) + + if required_regulators or plan_min_count_by_regulator or min_required_regulators is not None: + preview = ", ".join(required_regulators[:10]) if required_regulators else "n/a" + raise ValueError( + "Regulator constraints are set (required/min_count/min_required) " + "but the input does not provide regulators. " + f"required_regulators={preview}." + ) + all_sequences = [s for s in data_entries] + if not all_sequences: + raise ValueError(f"No sequences found for source {source_label}") + pool = list(dict.fromkeys(all_sequences)) if unique_binding_sites else list(all_sequences) + if pool_strategy == "full": + if required_bias_motifs: + missing = [m for m in required_bias_motifs if m not in pool] + if missing: + preview = ", ".join(missing[:10]) + raise ValueError(f"Required side-bias motifs not found in sequences input: {preview}") + library = pool + else: + if library_size > len(pool): + raise ValueError(f"library_size={library_size} exceeds available unique sequences ({len(pool)}).") + take = min(max(1, int(library_size)), len(pool)) + if required_bias_motifs: + missing = [m for m in required_bias_motifs if m not in pool] + if missing: + preview = ", ".join(missing[:10]) + raise ValueError(f"Required side-bias motifs not found in sequences input: {preview}") + if take < len(required_bias_motifs): + raise ValueError( + f"library_size={take} is smaller than required side_biases ({len(required_bias_motifs)})." + ) + required_set = set(required_bias_motifs) + remaining = [s for s in pool if s not in required_set] + library = list(required_bias_motifs) + rng.sample(remaining, take - len(required_bias_motifs)) + else: + library = rng.sample(pool, take) + tf_parts: list[str] = [] + reg_labels: list[str] = [] + info = { + "target_length": seq_len + subsample_over, + "achieved_length": sum(len(s) for s in library), + "relaxed_cap": False, + "final_cap": None, + "pool_strategy": pool_strategy, + "library_size": len(library) if pool_strategy == "full" else library_size, + "iterative_max_libraries": iterative_max_libraries, + "iterative_min_new_solutions": iterative_min_new_solutions, + } + return _finalize(library, tf_parts, reg_labels, info, site_id_by_index=None, source_by_index=None) + + def _compute_sampling_fraction( library: list[str], *, @@ -1294,14 +1561,7 @@ def _process_plan_for_source( sampling_cfg = gen.sampling pool_strategy = str(sampling_cfg.pool_strategy) - library_size = int(sampling_cfg.library_size) - subsample_over = int(sampling_cfg.subsample_over_length_budget_by) library_sampling_strategy = str(sampling_cfg.library_sampling_strategy) - cover_all_tfs = bool(sampling_cfg.cover_all_regulators) - unique_binding_sites = bool(sampling_cfg.unique_binding_sites) - max_sites_per_tf = sampling_cfg.max_sites_per_regulator - relax_on_exhaustion = bool(sampling_cfg.relax_on_exhaustion) - allow_incomplete_coverage = bool(sampling_cfg.allow_incomplete_coverage) iterative_max_libraries = int(sampling_cfg.iterative_max_libraries) iterative_min_new_solutions = int(sampling_cfg.iterative_min_new_solutions) schema_is_22 = schema_version_at_least(global_cfg.schema_version, major=2, minor=2) @@ -1431,6 +1691,7 @@ def _process_plan_for_source( mining_cfg = _sampling_attr(input_sampling_cfg, "mining") mining_batch_size = _mining_attr(mining_cfg, "batch_size") mining_max_batches = _mining_attr(mining_cfg, "max_batches") + mining_max_candidates = _mining_attr(mining_cfg, "max_candidates") mining_max_seconds = _mining_attr(mining_cfg, "max_seconds") mining_retain_bins = _mining_attr(mining_cfg, "retain_bin_ids") if length_range is not None: @@ -1446,11 +1707,7 @@ def _process_plan_for_source( bins_label = "-" if scoring_backend == "fimo": bins_label = "canonical" if _sampling_attr(input_sampling_cfg, "pvalue_bins") is None else "custom" - bin_ids = ( - mining_retain_bins - if mining_retain_bins is not None - else _sampling_attr(input_sampling_cfg, "pvalue_bin_ids") - ) + bin_ids = mining_retain_bins if bin_ids: bins_label = f"{bins_label} retain={sorted(list(bin_ids))}" length_label = str(length_policy) @@ -1459,10 +1716,20 @@ def _process_plan_for_source( cap_label = "-" if isinstance(n_sites, int) and isinstance(oversample, int): requested = n_sites * oversample - if max_candidates is not None: - cap_label = f"{max_candidates} (requested={requested})" - if max_seconds is not None: - cap_label = f"{cap_label}; max_seconds={max_seconds}" if cap_label != "-" else f"{max_seconds}s" + if scoring_backend == "fimo": + if mining_max_candidates is not None: + cap_label = f"{mining_max_candidates} (requested={requested})" + if mining_max_seconds is not None: + cap_label = ( + f"{cap_label}; max_seconds={mining_max_seconds}s" + if cap_label != "-" + else f"{mining_max_seconds}s" + ) + else: + if max_candidates is not None: + cap_label = f"{max_candidates} (requested={requested})" + if max_seconds is not None: + cap_label = f"{cap_label}; max_seconds={max_seconds}" if cap_label != "-" else f"{max_seconds}s" counts_label = _summarize_tf_counts(meta_df["tf"].tolist()) selection_label = selection_policy if scoring_backend == "fimo" else "-" mining_label = "-" @@ -1472,6 +1739,8 @@ def _process_plan_for_source( parts.append(f"batch={mining_batch_size}") if mining_max_batches is not None: parts.append(f"max_batches={mining_max_batches}") + if mining_max_candidates is not None: + parts.append(f"max_candidates={mining_max_candidates}") if mining_max_seconds is not None: parts.append(f"max_seconds={mining_max_seconds}s") mining_label = ", ".join(parts) if parts else "enabled" @@ -1524,8 +1793,6 @@ def _process_plan_for_source( f"({k_required} > {len(required_regulators)})." ) metadata_min_counts = {tf: max(min_count_per_tf, int(val)) for tf, val in plan_min_count_by_regulator.items()} - side_left, side_right = _extract_side_biases(fixed_elements) - required_bias_motifs = list(dict.fromkeys([*side_left, *side_right])) fixed_elements_dump = _fixed_elements_dump(fixed_elements) fixed_elements_max_len = _max_fixed_element_len(fixed_elements_dump) @@ -1537,207 +1804,22 @@ def _process_plan_for_source( if pool_strategy != "iterative_subsample" and not one_subsample_only: max_per_subsample = quota - - def _build_library() -> tuple[list[str], list[str], list[str], dict]: - nonlocal libraries_built - if meta_df is not None and isinstance(meta_df, pd.DataFrame): - available_tfs = set(meta_df["tf"].tolist()) - missing = [t for t in required_regulators if t not in available_tfs] - if missing: - preview = ", ".join(missing[:10]) - raise ValueError(f"Required regulators not found in input: {preview}") - if plan_min_count_by_regulator: - missing_counts = [t for t in plan_min_count_by_regulator if t not in available_tfs] - if missing_counts: - preview = ", ".join(missing_counts[:10]) - raise ValueError(f"min_count_by_regulator TFs not found in input: {preview}") - if min_required_regulators is not None: - if not required_regulators and min_required_regulators > len(available_tfs): - raise ValueError( - f"min_required_regulators={min_required_regulators} exceeds available regulators " - f"({len(available_tfs)})." - ) - - if pool_strategy == "full": - lib_df = meta_df.copy() - if unique_binding_sites: - lib_df = lib_df.drop_duplicates(["tf", "tfbs"]) - if required_bias_motifs: - missing_bias = [m for m in required_bias_motifs if m not in set(lib_df["tfbs"])] - if missing_bias: - preview = ", ".join(missing_bias[:10]) - raise ValueError(f"Required side-bias motifs not found in input: {preview}") - lib_df = lib_df.reset_index(drop=True) - library = lib_df["tfbs"].tolist() - reg_labels = lib_df["tf"].tolist() - parts = [f"{tf}:{tfbs}" for tf, tfbs in zip(reg_labels, lib_df["tfbs"].tolist())] - site_id_by_index = lib_df["site_id"].tolist() if "site_id" in lib_df.columns else None - source_by_index = lib_df["source"].tolist() if "source" in lib_df.columns else None - info = { - "target_length": seq_len + subsample_over, - "achieved_length": sum(len(s) for s in library), - "relaxed_cap": False, - "final_cap": None, - "pool_strategy": pool_strategy, - "library_size": len(library), - "iterative_max_libraries": iterative_max_libraries, - "iterative_min_new_solutions": iterative_min_new_solutions, - } - libraries_built += 1 - info["library_index"] = libraries_built - info["library_hash"] = _hash_library(library, reg_labels, site_id_by_index, source_by_index) - info["site_id_by_index"] = site_id_by_index - info["source_by_index"] = source_by_index - return library, parts, reg_labels, info - - sampler = TFSampler(meta_df, np_rng) - required_regulators_selected = required_regulators - if k_of_required: - candidates = sorted(required_regulators) - if k_required is not None and k_required < len(candidates): - chosen = np_rng.choice(len(candidates), size=k_required, replace=False) - required_regulators_selected = sorted([candidates[int(i)] for i in chosen]) - else: - required_regulators_selected = candidates - required_tfs_for_library = list( - dict.fromkeys([*required_regulators_selected, *plan_min_count_by_regulator.keys()]) - ) - if min_required_regulators is not None and not required_regulators: - if pool_strategy in {"subsample", "iterative_subsample"}: - if library_size < int(min_required_regulators): - raise ValueError( - "library_size is too small to satisfy min_required_regulators when " - f"required_regulators is empty. library_size={library_size} " - f"min_required_regulators={min_required_regulators}. " - "Increase library_size or lower min_required_regulators." - ) - if pool_strategy in {"subsample", "iterative_subsample"}: - required_slots = len(required_bias_motifs) + len(required_tfs_for_library) - if library_size < required_slots: - raise ValueError( - "library_size is too small for required motifs. " - f"library_size={library_size} but required_tfbs={len(required_bias_motifs)} " - f"+ required_tfs={len(required_tfs_for_library)} " - f"(min_required_regulators={min_required_regulators}). " - "Increase library_size or relax required constraints." - ) - # Alignment (1,4): count-based library sizing with explicit sampling strategy under schema>=2.2. - if schema_is_22 and pool_strategy in {"subsample", "iterative_subsample"}: - failure_counts_by_tfbs: dict[tuple[str, str], int] | None = None - if library_sampling_strategy == "coverage_weighted" and sampling_cfg.avoid_failed_motifs: - failure_counts_by_tfbs = _aggregate_failure_counts_for_sampling( - failure_counts, - input_name=source_label, - plan_name=plan_name, - ) - library, parts, reg_labels, info = sampler.generate_binding_site_library( - library_size, - sequence_length=seq_len, - budget_overhead=subsample_over, - required_tfbs=required_bias_motifs, - required_tfs=required_tfs_for_library, - cover_all_tfs=cover_all_tfs, - unique_binding_sites=unique_binding_sites, - max_sites_per_tf=max_sites_per_tf, - relax_on_exhaustion=relax_on_exhaustion, - allow_incomplete_coverage=allow_incomplete_coverage, - sampling_strategy=library_sampling_strategy, - usage_counts=usage_counts if library_sampling_strategy == "coverage_weighted" else None, - coverage_boost_alpha=float(sampling_cfg.coverage_boost_alpha), - coverage_boost_power=float(sampling_cfg.coverage_boost_power), - failure_counts=failure_counts_by_tfbs, - avoid_failed_motifs=bool(sampling_cfg.avoid_failed_motifs), - failure_penalty_alpha=float(sampling_cfg.failure_penalty_alpha), - failure_penalty_power=float(sampling_cfg.failure_penalty_power), - ) - else: - library, parts, reg_labels, info = sampler.generate_binding_site_subsample( - seq_len, - subsample_over, - required_tfbs=required_bias_motifs, - required_tfs=required_tfs_for_library, - cover_all_tfs=cover_all_tfs, - unique_binding_sites=unique_binding_sites, - max_sites_per_tf=max_sites_per_tf, - relax_on_exhaustion=relax_on_exhaustion, - allow_incomplete_coverage=allow_incomplete_coverage, - ) - info.update( - { - "pool_strategy": pool_strategy, - "library_size": library_size, - "library_sampling_strategy": library_sampling_strategy, - "coverage_boost_alpha": float(sampling_cfg.coverage_boost_alpha), - "coverage_boost_power": float(sampling_cfg.coverage_boost_power), - "iterative_max_libraries": iterative_max_libraries, - "iterative_min_new_solutions": iterative_min_new_solutions, - "required_regulators_selected": required_regulators_selected if k_of_required else None, - } - ) - libraries_built += 1 - info["library_index"] = libraries_built - site_id_by_index = info.get("site_id_by_index") - source_by_index = info.get("source_by_index") - info["library_hash"] = _hash_library(library, reg_labels, site_id_by_index, source_by_index) - return library, parts, reg_labels, info - - # Sequence library (no regulators) - if required_regulators or plan_min_count_by_regulator or min_required_regulators is not None: - preview = ", ".join(required_regulators[:10]) if required_regulators else "n/a" - raise ValueError( - "Regulator constraints are set (required/min_count/min_required) " - "but the input does not provide regulators. " - f"required_regulators={preview}." - ) - all_sequences = [s for s in data_entries] - if not all_sequences: - raise ValueError(f"No sequences found for source {source_label}") - pool = list(dict.fromkeys(all_sequences)) if unique_binding_sites else list(all_sequences) - if pool_strategy == "full": - if required_bias_motifs: - missing = [m for m in required_bias_motifs if m not in pool] - if missing: - preview = ", ".join(missing[:10]) - raise ValueError(f"Required side-bias motifs not found in sequences input: {preview}") - library = pool - else: - if library_size > len(pool): - raise ValueError(f"library_size={library_size} exceeds available unique sequences ({len(pool)}).") - take = min(max(1, int(library_size)), len(pool)) - if required_bias_motifs: - missing = [m for m in required_bias_motifs if m not in pool] - if missing: - preview = ", ".join(missing[:10]) - raise ValueError(f"Required side-bias motifs not found in sequences input: {preview}") - if take < len(required_bias_motifs): - raise ValueError( - f"library_size={take} is smaller than required side_biases ({len(required_bias_motifs)})." - ) - required_set = set(required_bias_motifs) - remaining = [s for s in pool if s not in required_set] - library = list(required_bias_motifs) + rng.sample(remaining, take - len(required_bias_motifs)) - else: - library = rng.sample(pool, take) - tf_parts: list[str] = [] - reg_labels: list[str] = [] - info = { - "target_length": seq_len + subsample_over, - "achieved_length": sum(len(s) for s in library), - "relaxed_cap": False, - "final_cap": None, - "pool_strategy": pool_strategy, - "library_size": len(library) if pool_strategy == "full" else library_size, - "iterative_max_libraries": iterative_max_libraries, - "iterative_min_new_solutions": iterative_min_new_solutions, - } - libraries_built += 1 - info["library_index"] = libraries_built - info["library_hash"] = _hash_library(library, reg_labels, None, None) - info["site_id_by_index"] = None - info["source_by_index"] = None - return library, tf_parts, reg_labels, info - - library_for_opt, tfbs_parts, regulator_labels, sampling_info = _build_library() + library_for_opt, tfbs_parts, regulator_labels, sampling_info = build_library_for_plan( + source_label=source_label, + plan_item=plan_item, + data_entries=data_entries, + meta_df=meta_df, + sampling_cfg=sampling_cfg, + seq_len=seq_len, + min_count_per_tf=min_count_per_tf, + usage_counts=usage_counts, + failure_counts=failure_counts if failure_counts else None, + rng=rng, + np_rng=np_rng, + schema_is_22=schema_is_22, + library_index_start=libraries_built, + ) + libraries_built = int(sampling_info.get("library_index", libraries_built)) site_id_by_index = sampling_info.get("site_id_by_index") source_by_index = sampling_info.get("source_by_index") sampling_library_index = sampling_info.get("library_index", 0) @@ -2583,7 +2665,22 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): ) # New library - library_for_opt, tfbs_parts, regulator_labels, sampling_info = _build_library() + library_for_opt, tfbs_parts, regulator_labels, sampling_info = build_library_for_plan( + source_label=source_label, + plan_item=plan_item, + data_entries=data_entries, + meta_df=meta_df, + sampling_cfg=sampling_cfg, + seq_len=seq_len, + min_count_per_tf=min_count_per_tf, + usage_counts=usage_counts, + failure_counts=failure_counts if failure_counts else None, + rng=rng, + np_rng=np_rng, + schema_is_22=schema_is_22, + library_index_start=libraries_built, + ) + libraries_built = int(sampling_info.get("library_index", libraries_built)) site_id_by_index = sampling_info.get("site_id_by_index") source_by_index = sampling_info.get("source_by_index") sampling_library_index = sampling_info.get("library_index", sampling_library_index) diff --git a/src/dnadesign/densegen/src/core/reporting.py b/src/dnadesign/densegen/src/core/reporting.py index 1a35fa8b..c4049d75 100644 --- a/src/dnadesign/densegen/src/core/reporting.py +++ b/src/dnadesign/densegen/src/core/reporting.py @@ -565,20 +565,29 @@ def write_report( *, out_dir: str | Path = "outputs", include_combinatorics: bool = False, + formats: set[str] | None = None, ) -> ReportBundle: run_root = resolve_run_root(cfg_path, root_cfg.densegen.run.root) out_path = resolve_run_scoped_path(cfg_path, run_root, str(out_dir), label="report.out") out_path.mkdir(parents=True, exist_ok=True) bundle = collect_report_data(root_cfg, cfg_path, include_combinatorics=include_combinatorics) - report_path = out_path / "report.json" - report_path.write_text(json.dumps(bundle.run_report, indent=2, sort_keys=True)) - report_md = out_path / "report.md" - _write_report_md(report_md, bundle) + formats = {f.lower() for f in (formats or {"json", "md"})} + if "all" in formats: + formats = {"json", "md", "html"} + if "json" in formats: + report_path = out_path / "report.json" + report_path.write_text(json.dumps(bundle.run_report, indent=2, sort_keys=True)) + if "md" in formats: + report_md = out_path / "report.md" + _write_report_md(report_md, bundle) + if "html" in formats: + report_html = out_path / "report.html" + _write_report_html(report_html, bundle) return bundle -def _write_report_md(path: Path, bundle: ReportBundle) -> None: +def _render_report_md(bundle: ReportBundle) -> str: report = bundle.run_report lines = [ "# DenseGen Report", @@ -622,4 +631,32 @@ def _write_report_md(path: Path, bundle: ReportBundle) -> None: label = f"{tf}:{tfbs}" if tf else tfbs reason_suffix = f" (top reason: {reason})" if reason else "" lines.append(f"- {label} — failures={failures}{reason_suffix}") - path.write_text("\n".join(lines) + "\n") + return "\n".join(lines) + "\n" + + +def _write_report_md(path: Path, bundle: ReportBundle) -> None: + path.write_text(_render_report_md(bundle)) + + +def _write_report_html(path: Path, bundle: ReportBundle) -> None: + md = _render_report_md(bundle) + body = md.replace("&", "&").replace("<", "<").replace(">", ">") + html = "\n".join( + [ + "", + "", + "", + '', + "DenseGen Report", + "", + "", + "", + "
",
+            body,
+            "
", + "", + "", + ] + ) + path.write_text(html) diff --git a/src/dnadesign/densegen/src/integrations/meme_suite.py b/src/dnadesign/densegen/src/integrations/meme_suite.py index 9abdb34c..bbc1579f 100644 --- a/src/dnadesign/densegen/src/integrations/meme_suite.py +++ b/src/dnadesign/densegen/src/integrations/meme_suite.py @@ -39,3 +39,13 @@ def resolve_executable(tool: str, *, tool_path: Path | None = None) -> Path | No return candidate found = shutil.which(tool) return Path(found) if found else None + + +def require_executable(tool: str, *, tool_path: Path | None = None) -> Path: + exe = resolve_executable(tool, tool_path=tool_path) + if exe is None: + raise FileNotFoundError( + f"{tool} executable not found. Install MEME Suite and ensure `{tool}` is on PATH, " + "or set MEME_BIN to the MEME bin directory (pixi users: `pixi run dense ...`)." + ) + return exe diff --git a/src/dnadesign/densegen/tests/test_cli_config_option.py b/src/dnadesign/densegen/tests/test_cli_config_option.py index 2f231cfe..44be5935 100644 --- a/src/dnadesign/densegen/tests/test_cli_config_option.py +++ b/src/dnadesign/densegen/tests/test_cli_config_option.py @@ -53,7 +53,7 @@ def test_validate_accepts_config_after_command(tmp_path: Path) -> None: cfg_path = tmp_path / "config.yaml" _write_min_config(cfg_path) runner = CliRunner() - result = runner.invoke(app, ["validate", "-c", str(cfg_path)]) + result = runner.invoke(app, ["validate-config", "-c", str(cfg_path)]) assert result.exit_code == 0, result.output assert "Config is valid" in result.output @@ -62,6 +62,6 @@ def test_validate_reports_invalid_config(tmp_path: Path) -> None: cfg_path = tmp_path / "config.yaml" cfg_path.write_text("densegen:\n inputs: []\n") runner = CliRunner() - result = runner.invoke(app, ["validate", "-c", str(cfg_path)]) + result = runner.invoke(app, ["validate-config", "-c", str(cfg_path)]) assert result.exit_code != 0, result.output assert "Config error" in result.output diff --git a/src/dnadesign/densegen/tests/test_cli_describe.py b/src/dnadesign/densegen/tests/test_cli_describe.py index 6b2c9d1f..e0971164 100644 --- a/src/dnadesign/densegen/tests/test_cli_describe.py +++ b/src/dnadesign/densegen/tests/test_cli_describe.py @@ -53,7 +53,7 @@ def test_describe_outputs_summary(tmp_path: Path) -> None: cfg_path = tmp_path / "config.yaml" _write_min_config(cfg_path) runner = CliRunner() - result = runner.invoke(app, ["describe", "-c", str(cfg_path)]) + result = runner.invoke(app, ["inspect", "config", "-c", str(cfg_path)]) assert result.exit_code == 0, result.output assert "Config" in result.output assert "Gap fill" in result.output diff --git a/src/dnadesign/densegen/tests/test_cli_summarize_library.py b/src/dnadesign/densegen/tests/test_cli_summarize_library.py index 5445905a..670f0772 100644 --- a/src/dnadesign/densegen/tests/test_cli_summarize_library.py +++ b/src/dnadesign/densegen/tests/test_cli_summarize_library.py @@ -64,9 +64,9 @@ def _base_meta(library_hash: str, library_index: int) -> dict: "input_pwm_score_percentile": None, "input_pwm_pvalue_threshold": None, "input_pwm_pvalue_bins": None, - "input_pwm_pvalue_bin_ids": None, "input_pwm_mining_batch_size": None, "input_pwm_mining_max_batches": None, + "input_pwm_mining_max_candidates": None, "input_pwm_mining_max_seconds": None, "input_pwm_mining_retain_bin_ids": None, "input_pwm_mining_log_every_batches": None, @@ -234,7 +234,7 @@ def test_summarize_library_grouping(tmp_path: Path) -> None: manifest.write_json(run_manifest_path(run_root)) runner = CliRunner() - result = runner.invoke(app, ["summarize", "--run", str(run_root), "--library"]) + result = runner.invoke(app, ["inspect", "run", "--run", str(run_root), "--library"]) assert result.exit_code == 0, result.output assert "Library build summary" in result.output assert "abc123" in result.output diff --git a/src/dnadesign/densegen/tests/test_config_strict.py b/src/dnadesign/densegen/tests/test_config_strict.py index fe234a8e..1e437d9f 100644 --- a/src/dnadesign/densegen/tests/test_config_strict.py +++ b/src/dnadesign/densegen/tests/test_config_strict.py @@ -120,6 +120,29 @@ def test_promoter_constraint_motif_validation(tmp_path: Path) -> None: load_config(cfg_path) +def test_fimo_rejects_max_candidates(tmp_path: Path) -> None: + cfg = copy.deepcopy(MIN_CONFIG) + cfg["densegen"]["inputs"] = [ + { + "name": "motifs", + "type": "pwm_meme", + "path": "inputs.meme", + "sampling": { + "strategy": "stochastic", + "n_sites": 2, + "oversample_factor": 2, + "scoring_backend": "fimo", + "pvalue_threshold": 1e-4, + "max_candidates": 100, + "mining": {"batch_size": 10}, + }, + } + ] + cfg_path = _write(cfg, tmp_path / "cfg.yaml") + with pytest.raises(ConfigError, match="max_candidates is not used"): + load_config(cfg_path) + + def test_promoter_constraint_range_non_negative(tmp_path: Path) -> None: cfg = copy.deepcopy(MIN_CONFIG) cfg["densegen"]["generation"]["plan"] = [ diff --git a/src/dnadesign/densegen/tests/test_outputs_parquet.py b/src/dnadesign/densegen/tests/test_outputs_parquet.py index 83d5c7da..d3e53caf 100644 --- a/src/dnadesign/densegen/tests/test_outputs_parquet.py +++ b/src/dnadesign/densegen/tests/test_outputs_parquet.py @@ -59,9 +59,9 @@ def _dummy_meta() -> dict: "input_pwm_score_percentile": None, "input_pwm_pvalue_threshold": None, "input_pwm_pvalue_bins": None, - "input_pwm_pvalue_bin_ids": None, "input_pwm_mining_batch_size": None, "input_pwm_mining_max_batches": None, + "input_pwm_mining_max_candidates": None, "input_pwm_mining_max_seconds": None, "input_pwm_mining_retain_bin_ids": None, "input_pwm_mining_log_every_batches": None, diff --git a/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py b/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py index 6195073f..8b778a0d 100644 --- a/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py +++ b/src/dnadesign/densegen/tests/test_pwm_fimo_utils.py @@ -70,7 +70,10 @@ def test_parse_fimo_tsv_and_best_hits() -> None: assert best["cand1"].pvalue == pytest.approx(0.5) -@pytest.mark.skipif(resolve_executable("fimo", tool_path=None) is None, reason="fimo executable not available") +@pytest.mark.skipif( + resolve_executable("fimo", tool_path=None) is None, + reason="fimo executable not available (run tests via `pixi run pytest` or set MEME_BIN).", +) def test_run_fimo_smoke(tmp_path: Path) -> None: motif = PWMMotif( motif_id="M1", diff --git a/src/dnadesign/densegen/tests/test_pwm_sampling_mining.py b/src/dnadesign/densegen/tests/test_pwm_sampling_mining.py index 9895fdcb..4b653038 100644 --- a/src/dnadesign/densegen/tests/test_pwm_sampling_mining.py +++ b/src/dnadesign/densegen/tests/test_pwm_sampling_mining.py @@ -3,6 +3,7 @@ from pathlib import Path import numpy as np +import pytest from dnadesign.densegen.src.adapters.sources import pwm_fimo from dnadesign.densegen.src.adapters.sources.pwm_sampling import PWMMotif, sample_pwm_sites @@ -78,3 +79,31 @@ def fake_run_fimo(*, meme_motif_path, fasta_path, **_kwargs): # type: ignore[ov info = meta[seq] assert info["fimo_bin_id"] == 0 assert info["fimo_matched_sequence"] == "AAA" + + +def test_pwm_sampling_fimo_mining_max_candidates_guard() -> None: + motif = PWMMotif( + motif_id="M2", + matrix=[ + {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + {"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + ], + background={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + ) + rng = np.random.default_rng(0) + with pytest.raises(ValueError, match="mining.max_candidates must be >= n_sites"): + sample_pwm_sites( + rng, + motif, + strategy="stochastic", + n_sites=5, + oversample_factor=1, + max_candidates=None, + max_seconds=None, + score_threshold=None, + score_percentile=None, + scoring_backend="fimo", + pvalue_threshold=1e-2, + mining={"batch_size": 2, "max_candidates": 2}, + selection_policy="random_uniform", + ) diff --git a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml index cd0bcb45..3fed86a2 100644 --- a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +++ b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml @@ -22,12 +22,12 @@ densegen: strategy: stochastic n_sites: 80 oversample_factor: 200 - max_candidates: 20000 # bounded candidate generation (cap across mining batches) scoring_backend: fimo pvalue_threshold: 1e-4 selection_policy: stratified mining: batch_size: 5000 + max_candidates: 20000 max_batches: 4 retain_bin_ids: [0, 1, 2, 3] log_every_batches: 1 From 4d5eed71a0e2ba8c64a780d3febf65ed3da83c19 Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 13:07:02 -0500 Subject: [PATCH 08/40] densegen docs: clarify FIMO mining workflow --- src/dnadesign/densegen/README.md | 6 +- .../densegen/docs/demo/demo_basic.md | 68 ++++++--- .../densegen/docs/dev/improvements.md | 2 +- .../densegen/docs/guide/generation.md | 29 +++- src/dnadesign/densegen/docs/guide/inputs.md | 35 +++-- .../densegen/docs/guide/outputs-metadata.md | 2 +- .../densegen/docs/guide/workspace.md | 4 +- src/dnadesign/densegen/docs/reference/cli.md | 144 ++++++++++++------ .../densegen/docs/reference/config.md | 16 +- .../densegen/docs/reference/outputs.md | 17 +++ .../docs/workflows/cruncher_pwm_pipeline.md | 4 +- src/dnadesign/densegen/workspaces/README.md | 2 +- 12 files changed, 227 insertions(+), 102 deletions(-) diff --git a/src/dnadesign/densegen/README.md b/src/dnadesign/densegen/README.md index be196836..228eed89 100644 --- a/src/dnadesign/densegen/README.md +++ b/src/dnadesign/densegen/README.md @@ -21,10 +21,10 @@ FIMO-backed PWM sampling is supported when MEME Suite is available (`fimo` on PA Stratified FIMO sampling uses canonical p‑value bins by default; see the guide for mining workflows. ```bash -uv run dense validate -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense describe -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +pixi run dense validate-config -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +uv run dense inspect inputs -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml pixi run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --no-plot -uv run dense summarize -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --library --top-per-tf 5 +uv run dense inspect run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --library --top-per-tf 5 uv run dense plot -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --only tf_usage,tf_coverage ``` diff --git a/src/dnadesign/densegen/docs/demo/demo_basic.md b/src/dnadesign/densegen/docs/demo/demo_basic.md index 15c4377e..91bd3ad3 100644 --- a/src/dnadesign/densegen/docs/demo/demo_basic.md +++ b/src/dnadesign/densegen/docs/demo/demo_basic.md @@ -11,12 +11,13 @@ and uses the dense-arrays CBC backend. All paths are explicit; missing files fai - [2) Stage a workspace](#2-stage-a-workspace) - copy inputs and rewrite paths. - [3) Validate config](#3-validate-config) - schema and sanity checks. - [4) Plan constraints](#4-plan-constraints) - see resolved quotas and constraint buckets. -- [5) Describe the resolved run](#5-describe-the-resolved-run) - verify inputs, outputs, solver. -- [6) Run generation](#6-run-generation) - produce sequences and metadata. -- [7) Summarize the run](#7-summarize-the-run) - review run-level counts. -- [8) Audit report](#8-audit-report) - build offered-vs-used tables. -- [9) Inspect outputs](#9-inspect-outputs) - list Parquet artifacts. -- [10) Plot analysis](#10-plot-analysis) - render tf_usage and tf_coverage. +- [5) Inspect the resolved run config](#5-inspect-the-resolved-run-config) - verify inputs, outputs, solver. +- [6) (Optional) Stage‑A + Stage‑B previews](#6-optional-stagea--stageb-previews) - preview pools and libraries. +- [7) Run generation](#7-run-generation) - produce sequences and metadata. +- [8) Inspect run summary](#8-inspect-run-summary) - review run-level counts. +- [9) Audit report](#9-audit-report) - build offered-vs-used tables. +- [10) Inspect outputs](#10-inspect-outputs) - list Parquet artifacts. +- [11) Plot analysis](#11-plot-analysis) - render tf_usage and tf_coverage. - [Appendix (optional)](#appendix-optional) - PWM sampling + USR output. ## 0) Prereqs @@ -29,7 +30,7 @@ uv sync --locked This demo uses **FIMO** (MEME Suite) to adjudicate strong motif matches. Ensure `fimo` is on PATH or set `MEME_BIN` to the MEME bin directory. If you use pixi, run commands via -`pixi run dense ...` so MEME tools are available (recommended for the run step). +`pixi run dense ...` so MEME tools are available (recommended for validation + run steps). All commands below assume you are at the repo root. We will write the demo run to a scratch directory; set a run root: @@ -55,7 +56,13 @@ src/dnadesign/densegen/workspaces/demo_meme_two_tf/inputs/cpxR.txt These are MEME files parsed with Cruncher’s MEME parser (DenseGen reuses the same parsing logic for DRY). The demo uses LexA + CpxR motifs and exercises PWM sampling bounds. Sampling uses FIMO p-values to define “strong” matches and `selection_policy: stratified` to balance -across canonical p‑value bins (see the input-stage sampling table in `dense describe`). +across canonical p‑value bins (see the input-stage sampling table in `dense inspect inputs`). + +Inspect the resolved inputs + Stage‑A sampling table: + +```bash +pixi run dense inspect inputs -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +``` ### 1b) (Optional) Rebuild inputs from Cruncher @@ -84,7 +91,7 @@ Stage a self-contained workspace from the demo template (this copies inputs and paths): ```bash -uv run dense stage --id demo_press --root "$RUN_ROOT" \ +uv run dense workspace init --id demo_press --root "$RUN_ROOT" \ --template src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml \ --copy-inputs ``` @@ -102,7 +109,7 @@ Parquet schema mismatch. Either delete `outputs/dense_arrays.parquet` + ## 3) Validate config ```bash -uv run dense validate -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml +pixi run dense validate-config -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml ``` Example output: @@ -114,7 +121,7 @@ Example output: ## 4) Plan constraints ```bash -uv run dense plan -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml +uv run dense inspect plan -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml ``` Example output: @@ -127,12 +134,12 @@ Example output: └──────┴───────┴──────────────────────────┘ ``` -## 5) Describe the resolved run +## 5) Inspect the resolved run config This step shows the resolved inputs, outputs, solver selection, and the two-stage sampling knobs. ```bash -uv run dense describe -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml +uv run dense inspect config -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml ``` Example output (abridged): @@ -156,7 +163,22 @@ Solver-stage library sampling ... ``` -## 6) Run generation +## 6) (Optional) Stage‑A + Stage‑B previews + +Stage‑A: materialize the TFBS pool (FIMO mining + stratified selection). This is useful when +you want to inspect mining yields per p‑value bin before running the solver: + +```bash +pixi run dense stage-a build-pool -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml +``` + +Stage‑B: build a solver library from the pool without running the solver: + +```bash +pixi run dense stage-b build-libraries -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml +``` + +## 7) Run generation ```bash pixi run dense run -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml --no-plot @@ -172,7 +194,7 @@ Example output (abridged): 2026-01-15 14:02:02 | INFO | dnadesign.densegen.src.utils.logging_utils | Logging initialized (level=INFO) Quota plan: meme_demo=50 2026-01-15 14:02:02 | INFO | dnadesign.densegen.src.adapters.optimizer.dense_arrays | Solver selected: CBC -2026-01-15 14:02:05 | INFO | dnadesign.densegen.src.adapters.sources.pwm_sampling | FIMO yield for motif lexA: hits=960 accepted=120 selected=80 bins=(0e+00,1e-10]:0 ... selected_bins=(0e+00,1e-10]:0 ... +2026-01-15 14:02:05 | INFO | dnadesign.densegen.src.adapters.sources.pwm_sampling | FIMO yield for motif lexA: hits=120 accepted=120 selected=80 bins=(0e+00,1e-10]:40 (1e-10,1e-08]:35 ... selected_bins=(0e+00,1e-10]:26 ... 2026-01-15 14:02:06 | INFO | dnadesign.densegen.src.core.pipeline | [demo/demo] 2/50 (4.00%) (local 2/2) CR=1.050 | seq ATTGACAGTAAACCTGCGGGAAATATAATTTACTCCGTATTTGCACATGGTTATCCACAG 2026-01-15 14:02:05 | INFO | dnadesign.densegen.src.core.pipeline | Inputs manifest written: /private/tmp/densegen-demo-20260115-1405/demo_press/outputs/meta/inputs_manifest.json 🎉 Run complete. @@ -181,13 +203,13 @@ Quota plan: meme_demo=50 On macOS you may see Arrow sysctl warnings after generation; they are emitted by pyarrow and do not indicate a DenseGen failure. -## 7) Summarize the run +## 8) Inspect run summary DenseGen writes `outputs/meta/run_manifest.json` and `outputs/meta/inputs_manifest.json`. Summarize the run manifest: ```bash -uv run dense summarize --run /private/tmp/densegen-demo-20260115-1405/demo_press +uv run dense inspect run --run /private/tmp/densegen-demo-20260115-1405/demo_press ``` Example output: @@ -205,7 +227,7 @@ Use `--verbose` for constraint-failure breakdowns and duplicate-solution counts. Use `--library` to print offered-vs-used summaries for quick debugging: ```bash -uv run dense summarize --run /private/tmp/densegen-demo-20260115-1405/demo_press --library --top-per-tf 5 +uv run dense inspect run --run /private/tmp/densegen-demo-20260115-1405/demo_press --library --top-per-tf 5 ``` This library summary is the quickest way to audit which TFBS were offered vs @@ -214,17 +236,17 @@ used in the solver stage (Stage‑B sampling). If any solutions are rejected, DenseGen writes them to `outputs/attempts.parquet` in the run root. -## 8) Audit report +## 9) Audit report Generate an audit-grade summary of the run: ```bash -uv run dense report -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml +uv run dense report -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml --format all ``` -This writes `outputs/report.json` and `outputs/report.md`. +This writes `outputs/report.json`, `outputs/report.md`, and `outputs/report.html`. -## 9) Inspect outputs +## 10) Inspect outputs List the generated Parquet artifacts: @@ -251,7 +273,7 @@ Example output: attempts.parquet ``` -## 10) Plot analysis +## 11) Plot analysis First, list the available plots: diff --git a/src/dnadesign/densegen/docs/dev/improvements.md b/src/dnadesign/densegen/docs/dev/improvements.md index bacf195a..a6f86d33 100644 --- a/src/dnadesign/densegen/docs/dev/improvements.md +++ b/src/dnadesign/densegen/docs/dev/improvements.md @@ -121,7 +121,7 @@ Implement in phases aligned with impact and merge risk. - solver backend/strategy/options summary - optional histograms (compression_ratio, gc_total) -10. Add a CLI command like dense summarize (or extend workspace listing) to read and pretty-print the manifest. +10. Add a CLI command like dense inspect run (or extend workspace listing) to read and pretty-print the manifest. ### Phase 4 - Performance / resilience (optional but worthwhile) diff --git a/src/dnadesign/densegen/docs/guide/generation.md b/src/dnadesign/densegen/docs/guide/generation.md index ab7f6cba..041c458c 100644 --- a/src/dnadesign/densegen/docs/guide/generation.md +++ b/src/dnadesign/densegen/docs/guide/generation.md @@ -103,7 +103,7 @@ Key fields: Notes: - `pool_strategy: full` uses a single library (no resampling) and ignores `library_size`, `subsample_over_length_budget_by`, - and related sampling caps/strategies (DenseGen warns in `dense validate`/`dense plan`). + and related sampling caps/strategies (DenseGen warns in `dense validate-config`/`dense inspect plan`). - Under schema `2.2+`, `subsample` can resample reactively on stalls/duplicate guards. - `iterative_subsample` resamples proactively after `arrays_generated_before_resample` or when a library under-produces. @@ -111,6 +111,21 @@ Notes: - `coverage_weighted` dynamically boosts underused TFBS based on the run’s usage counts. - `avoid_failed_motifs: true` down-weights TFBS that repeatedly appear in failed solve attempts (tracked in attempts.parquet). +### Stage‑A vs Stage‑B sampling (mental model) + +**Stage‑A (input sampling)** lives under `densegen.inputs[].sampling` and defines how TFBS pools +are generated from PWMs (e.g., DenseGen log‑odds vs FIMO p‑values, thresholds, mining limits, +length policy). Stage‑A produces the realized TFBS pool (`input_tfbs_count`), which is cached +once per run and reused across round‑robin passes. + +**Stage‑B (library sampling)** lives under `densegen.generation.sampling` and selects a **solver +library** from the Stage‑A pool (or from a binding‑site table / sequence library). This is where +`pool_strategy`, `library_size`, and sampling strategies (tf‑balanced, uniform over pairs, +coverage‑weighted) apply. Stage‑B is the only place that resampling happens. + +Use `dense stage-a build-pool` to materialize pools and `dense stage-b build-libraries` to preview +solver libraries without running the solver. + ### Run scheduling (round‑robin) `runtime.round_robin` controls **scheduling**, not sampling. When enabled, DenseGen interleaves plan @@ -125,6 +140,18 @@ uses the same policy per plan, but round‑robin can trigger more frequent libra Input PWM sampling is performed **once per run** and cached across round‑robin passes. If you need a fresh PWM sample, start a new run (or stage a new workspace). +### Runtime policy knobs (resampling + stop conditions) + +Key `runtime.*` controls: +- `arrays_generated_before_resample` — number of successful arrays to emit before forcing a new + library (for iterative subsampling). +- `stall_seconds_before_resample` — idle time with no new solutions before resampling. +- `stall_warning_every_seconds` — how often to log stall warnings. +- `max_resample_attempts` / `max_total_resamples` — caps on resample retries. +- `max_seconds_per_plan` — time budget per plan item (0 = no limit). +- `max_failed_solutions` / `max_duplicate_solutions` — guardrails to stop when failure/duplicate + counts are too high. + --- ### Regulator constraints diff --git a/src/dnadesign/densegen/docs/guide/inputs.md b/src/dnadesign/densegen/docs/guide/inputs.md index 63911198..0d73925d 100644 --- a/src/dnadesign/densegen/docs/guide/inputs.md +++ b/src/dnadesign/densegen/docs/guide/inputs.md @@ -99,15 +99,15 @@ Required sampling fields: - `score_threshold` or `score_percentile` (exactly one; densegen backend only) - `pvalue_threshold` (float in (0, 1]; fimo backend only) - `oversample_factor`: oversampling multiplier for candidate generation -- `max_candidates` (optional): cap on candidate generation; helps bound long motifs -- `max_seconds` (optional): time limit for candidate generation per batch (best-effort cap) +- `max_candidates` (optional): cap on candidate generation; helps bound long motifs (**densegen** backend only) +- `max_seconds` (optional): time limit for candidate generation per batch (best-effort cap; **densegen** backend only) - `selection_policy`: `random_uniform | top_n | stratified` (default: `random_uniform`; fimo only) - `pvalue_bins` (optional): list of p‑value bin edges (strictly increasing; must end with `1.0`) -- `pvalue_bin_ids` (deprecated; use `mining.retain_bin_ids`) -- `mining` (optional; fimo only): batch/time controls for mining with FIMO +- `mining` (fimo only): batch/time controls for mining with FIMO - `batch_size` (int > 0): candidates per batch - - `max_batches` (optional int > 0): limit batches per motif - - `max_seconds` (optional float > 0): limit total mining time per motif + - `max_batches` (optional int > 0): limit batches per motif (quota-style) + - `max_candidates` (optional int > 0): total candidates per motif (quota-style) + - `max_seconds` (optional float > 0; default 60s): limit total mining time per motif - `retain_bin_ids` (optional list of ints): keep only specific p‑value bins - `log_every_batches` (int > 0): log yield summaries every N batches - `bgfile` (optional): MEME bfile-format background model for FIMO @@ -124,6 +124,12 @@ Notes: - `selection_policy: stratified` uses fixed p‑value bins to balance strong/weak matches. - Canonical p‑value bins (default): `[1e-10, 1e-8, 1e-6, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]`. Bin 0 is `(0, 1e-10]`, bin 1 is `(1e-10, 1e-8]`, etc. +- FIMO mining defaults to **time-based** limits (`mining.max_seconds: 60`). To switch to a quota, + set `mining.max_seconds: null` and use `mining.max_candidates` or `mining.max_batches` + (with `mining.batch_size`) as the primary cap. +- `mining.max_candidates` must be >= `n_sites`; DenseGen fails fast otherwise. +- If you omit `mining` entirely, DenseGen uses the default mining settings (batch size + time cap) + for FIMO-backed sampling. #### FIMO p-values (beginner-friendly) - A **p-value** is the probability that a random sequence (under the background model) @@ -136,7 +142,7 @@ Notes: specific affinity ranges). - FIMO adds per‑TFBS metadata columns: `fimo_score`, `fimo_pvalue`, `fimo_start`, `fimo_stop`, `fimo_strand`, `fimo_bin_id`, `fimo_bin_low`, `fimo_bin_high`, and (optionally) - `fimo_matched_sequence` (the best‑hit window within the TFBS). + `fimo_matched_sequence` (the best‑hit window within the TFBS; includes strand-aware match). - `length_policy` defaults to `exact`. Use `length_policy: range` with `length_range: [min, max]` to sample variable lengths (min must be >= motif length). - `trim_window_length` optionally trims the PWM to a max‑information window before sampling (useful @@ -177,9 +183,9 @@ inputs: selection_policy: top_n n_sites: 80 oversample_factor: 200 - max_candidates: 20000 mining: batch_size: 5000 + max_candidates: 20000 max_batches: 4 retain_bin_ids: [0, 1, 2, 3] log_every_batches: 1 @@ -189,13 +195,16 @@ inputs: If you want to **mine** sequences across affinity strata, use `selection_policy: stratified` plus canonical p‑value bins and the `mining` block. A typical workflow: -1) Oversample candidates (`oversample_factor`, `max_candidates`) and score with FIMO in batches - (`mining.batch_size`). +1) Oversample candidates (`oversample_factor`) or set a direct quota (`mining.max_candidates`), + then score with FIMO in batches (`mining.batch_size`). 2) Accept candidates using `pvalue_threshold` (global strength cutoff). 3) Use `mining.retain_bin_ids` to select one or more bins (e.g., moderate matches only). -4) Repeat runs (or increase `mining.max_batches` / `mining.max_seconds`) to accumulate a deduplicated - reservoir of sequences per bin. -5) Use `dense summarize --library` to inspect which TFBS were offered vs used in Stage‑B sampling. +4) Repeat runs (or increase `mining.max_candidates` / `mining.max_batches` / `mining.max_seconds`) + to accumulate a deduplicated reservoir of sequences per bin. By default mining runs for 60 + seconds per motif; set `mining.max_seconds: null` to make quotas the primary cap. +5) Use `dense stage-a build-pool` to materialize the pool, then `dense stage-b build-libraries` + to preview Stage‑B library sampling without running the solver. +6) Use `dense inspect run --library` to inspect which TFBS were offered vs used in Stage‑B sampling. DenseGen reports per‑bin yield summaries (hits, accepted, selected) for retained bins only (or all bins if `retain_bin_ids` is unset), so you can track how many candidates land in each stratum and diff --git a/src/dnadesign/densegen/docs/guide/outputs-metadata.md b/src/dnadesign/densegen/docs/guide/outputs-metadata.md index 2aec0c0e..69c9eb23 100644 --- a/src/dnadesign/densegen/docs/guide/outputs-metadata.md +++ b/src/dnadesign/densegen/docs/guide/outputs-metadata.md @@ -36,7 +36,7 @@ the full outputs. Use the CLI to summarize a run: ``` -uv run dense summarize --run path/to/run +uv run dense inspect run --run path/to/run ``` --- diff --git a/src/dnadesign/densegen/docs/guide/workspace.md b/src/dnadesign/densegen/docs/guide/workspace.md index 0f2a30bd..d7048b48 100644 --- a/src/dnadesign/densegen/docs/guide/workspace.md +++ b/src/dnadesign/densegen/docs/guide/workspace.md @@ -63,7 +63,7 @@ plots: When a run is complete, archive or sync the workspace as a unit. -Tip: use `dense stage --id ` to scaffold a new workspace. Use -`dense summarize --root workspaces/_archive` to inspect archived workspaces. +Tip: use `dense workspace init --id ` to scaffold a new workspace. Use +`dense inspect run --root workspaces/_archive` to inspect archived workspaces. @e-south diff --git a/src/dnadesign/densegen/docs/reference/cli.md b/src/dnadesign/densegen/docs/reference/cli.md index bc1a9b69..a1f64a55 100644 --- a/src/dnadesign/densegen/docs/reference/cli.md +++ b/src/dnadesign/densegen/docs/reference/cli.md @@ -7,15 +7,18 @@ the run root. USR is optional and is only imported when configured. ### Contents - [Invocation](#invocation) - how to call the CLI. - [Config option](#config-option) - global or per-command config path. -- [Commands](#commands) - validate, plan, describe, run, plot, and utilities. -- [`dense validate`](#dense-validate) - schema and sanity checks. -- [`dense plan`](#dense-plan) - resolved quota plan. -- [`dense describe`](#dense-describe) - resolved inputs, outputs, and solver. +- [Commands](#commands) - validate, inspect, stage helpers, run, plot, report. +- [`dense validate-config`](#dense-validate-config) - schema and sanity checks. +- [`dense inspect inputs`](#dense-inspect-inputs) - resolved inputs + PWM sampling summary. +- [`dense inspect plan`](#dense-inspect-plan) - resolved quota plan. +- [`dense inspect config`](#dense-inspect-config) - resolved inputs/outputs/solver details. +- [`dense inspect run`](#dense-inspect-run) - summarize run manifests or list workspaces. +- [`dense stage-a build-pool`](#dense-stage-a-build-pool) - build TFBS pools (Stage‑A). +- [`dense stage-b build-libraries`](#dense-stage-b-build-libraries) - build solver libraries (Stage‑B). +- [`dense workspace init`](#dense-workspace-init) - scaffold a workspace. - [`dense run`](#dense-run) - end-to-end generation. - [`dense plot`](#dense-plot) - render plots from outputs. - [`dense ls-plots`](#dense-ls-plots) - list available plots. -- [`dense stage`](#dense-stage) - scaffold a workspace. -- [`dense summarize`](#dense-summarize) - summarize outputs/meta/run_manifest.json or list workspaces. - [`dense report`](#dense-report) - write audit-grade report summary. - [Examples](#examples) - common command sequences. @@ -35,8 +38,8 @@ python -m dnadesign.densegen --help - `-c, --config PATH` - config YAML path. Defaults to `src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml` inside the package. - - May be passed globally (`dense -c path validate`) or per command - (`dense validate -c path`). + - May be passed globally (`dense -c path inspect inputs`) or per command + (`dense inspect inputs -c path`). Input paths resolve against the config file directory. Outputs and logs must resolve inside `densegen.run.root` (run-scoped I/O). Config files must include `densegen.schema_version` @@ -46,54 +49,87 @@ inside `densegen.run.root` (run-scoped I/O). Config files must include `densegen ### Commands -### `dense validate` +### `dense validate-config` Validate the config YAML (schema + sanity checks). Fails fast on unknown keys or invalid values. + Options: - `--probe-solver` - also probe the solver backend (fails fast if unavailable). --- -#### `dense plan` +#### `dense inspect inputs` +Print resolved inputs plus a PWM sampling summary (Stage‑A details). + +--- + +#### `dense inspect plan` Print the resolved quota plan per constraint bucket. --- -#### `dense describe` +#### `dense inspect config` Summarize resolved inputs, outputs, plan items, and solver settings. + Options: - `--show-constraints` - print full fixed elements per plan item. - `--probe-solver` - verify the solver backend before reporting. --- -#### `dense run` -Run the full generation pipeline. +#### `dense inspect run` +Summarize a run manifest (`outputs/meta/run_manifest.json`) or list workspaces. Options: -- `--no-plot` - skip auto-plotting even if `plots` is configured in YAML. -- `--log-file PATH` - override the log file path. Otherwise DenseGen writes - to `logging.log_dir/.log` inside the workspace. The override path - must still resolve inside `densegen.run.root`. -Notes: -- If you enable `scoring_backend: fimo`, run via `pixi run dense ...` (or ensure `fimo` is on PATH). +- `--run` - workspace directory (defaults to `densegen.run.root` from config). +- `--root` - list workspaces under a root directory. +- `--limit` - limit workspaces displayed when using `--root`. +- `--all` - include directories without `config.yaml` when using `--root`. +- `--config` - config path (used to resolve run root when `--run` is not set). +- `--verbose` - show failure breakdown columns (constraint filters + duplicate solutions). +- `--library` - include offered-vs-used summaries (TF/TFBS usage). +- `--top` - number of rows to show in library summaries. +- `--by-library/--no-by-library` - group library summaries per build attempt. +- `--top-per-tf` - limit TFBS rows per TF when summarizing. +- `--show-library-hash/--short-library-hash` - toggle full vs short library hashes. + +Tip: +- For large runs, prefer `--no-by-library` or lower `--top`/`--top-per-tf` to keep output readable. --- -#### `dense plot` -Generate plots from existing outputs. +#### `dense stage-a build-pool` +Build Stage‑A TFBS pools from inputs and write a pool manifest. Options: -- `--only NAME1,NAME2` - run a subset of plots by name. +- `--out` - output directory relative to run root (default: `outputs/pools`). +- `--input/-i` - input name(s) to build (defaults to all). +- `--overwrite` - overwrite existing pool files. + +Outputs: +- `pool_manifest.json` +- `__pool.parquet` per input --- -#### `dense ls-plots` -List available plot names and descriptions. +#### `dense stage-b build-libraries` +Build Stage‑B libraries (one per input + plan) from pools or inputs. + +Options: +- `--out` - output directory relative to run root (default: `outputs/libraries`). +- `--pool` - optional pool directory from `stage-a build-pool` (defaults to reading inputs). +- `--input/-i` - input name(s) to build (defaults to all). +- `--plan/-p` - plan item name(s) to build (defaults to all). +- `--overwrite` - overwrite existing `library_builds.parquet`. + +Outputs: +- `library_builds.parquet` +- `library_manifest.json` --- -#### `dense stage` +#### `dense workspace init` Stage a new workspace with `config.yaml`, `inputs/`, `outputs/`, plus `outputs/logs/` and `outputs/meta/`. + Options: - `--id` - run identifier (directory name). - `--root` - workspaces root directory (default: package `workspaces/` directory). @@ -102,43 +138,55 @@ Options: --- -#### `dense summarize` -Summarize a run manifest (`outputs/meta/run_manifest.json`). +#### `dense run` +Run the full generation pipeline. + Options: -- `--run` - workspace directory (defaults to `densegen.run.root` from config). -- `--root` - list workspaces under a root directory. -- `--limit` - limit workspaces displayed when using `--root`. -- `--all` - include directories without `config.yaml` when using `--root`. -- `--config` - config path (used to resolve run root when `--run` is not set). -- `--verbose` - show failure breakdown columns (constraint filters + duplicate solutions). -- `--library` - include offered-vs-used summaries (TF/TFBS usage). -- `--top` - number of rows to show in library summaries. -- `--by-library/--no-by-library` - group library summaries per build attempt. -- `--top-per-tf` - limit TFBS rows per TF when summarizing. -- `--show-library-hash/--short-library-hash` - toggle full vs short library hashes. -Tip: -- For large runs, prefer `--no-by-library` or lower `--top`/`--top-per-tf` to keep output readable. +- `--no-plot` - skip auto-plotting even if `plots` is configured in YAML. +- `--log-file PATH` - override the log file path. Otherwise DenseGen writes + to `logging.log_dir/.log` inside the workspace. The override path + must still resolve inside `densegen.run.root`. + +Notes: +- If you enable `scoring_backend: fimo`, run via `pixi run dense ...` (or ensure `fimo` is on PATH). + +--- + +#### `dense plot` +Generate plots from existing outputs. + +Options: +- `--only NAME1,NAME2` - run a subset of plots by name. + +--- + +#### `dense ls-plots` +List available plot names and descriptions. --- #### `dense report` Generate an audit-grade report summary for a run. Outputs are run-scoped under `outputs/` by default. + Options: +- `--run` - run directory (defaults to config run root). - `--out` - output directory relative to run root (default: `outputs`). +- `--format` - `json`, `md`, `html`, or `all` (comma-separated allowed). --- ### Examples ```bash -uv run dense validate -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense plan -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense describe -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense plot -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --only tf_usage,tf_coverage,tfbs_positional_histogram,diversity_health -uv run dense summarize --run src/dnadesign/densegen/workspaces/demo_meme_two_tf -uv run dense summarize --root src/dnadesign/densegen/workspaces -uv run dense report -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +pixi run dense validate-config -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +uv run dense inspect inputs -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +uv run dense inspect plan -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +uv run dense inspect config -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +uv run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +uv run dense plot -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --only tf_usage,tf_coverage,tfbs_positional_histogram,diversity_health +uv run dense inspect run --run src/dnadesign/densegen/workspaces/demo_meme_two_tf +uv run dense inspect run --root src/dnadesign/densegen/workspaces +uv run dense report -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --format all ``` Demo run (small, Parquet-only config): diff --git a/src/dnadesign/densegen/docs/reference/config.md b/src/dnadesign/densegen/docs/reference/config.md index 0fd049a7..3ed9f2a7 100644 --- a/src/dnadesign/densegen/docs/reference/config.md +++ b/src/dnadesign/densegen/docs/reference/config.md @@ -60,18 +60,19 @@ PWM inputs perform **input sampling** (sampling sites from PWMs) via - `strategy`: `consensus | stochastic | background` - `n_sites` (int > 0) - `oversample_factor` (int > 0) - - `max_candidates` (optional int > 0; caps candidate generation) - - `max_seconds` (optional float > 0; time limit for candidate generation) + - `max_candidates` (optional int > 0; caps candidate generation; **densegen** backend only) + - `max_seconds` (optional float > 0; time limit for candidate generation; **densegen** backend only) - `scoring_backend`: `densegen | fimo` (default: `densegen`) - `score_threshold` or `score_percentile` (exactly one; **densegen** backend only) - `pvalue_threshold` (float in (0, 1]; **fimo** backend only) - `selection_policy`: `random_uniform | top_n | stratified` (default: `random_uniform`; fimo only) - `pvalue_bins` (optional list of floats; must end with `1.0`) - p‑value bin edges for stratified sampling - - `pvalue_bin_ids` (deprecated; use `mining.retain_bin_ids`) - - `mining` (optional; fimo only) - batch/time controls for mining via FIMO: + - `mining` (fimo only) - batch/time controls for mining via FIMO: - `batch_size` (int > 0; default 100000) - candidates per FIMO batch - `max_batches` (optional int > 0) - max batches per motif - - `max_seconds` (optional float > 0) - max seconds per motif mining loop + - `max_candidates` (optional int > 0) - total candidates to generate per motif (quota mode) + (must be >= `n_sites`) + - `max_seconds` (optional float > 0; default 60s) - max seconds per motif mining loop - `retain_bin_ids` (optional list of ints) - select p‑value bins to retain (0‑based indices); retained bins are the only bins reported in yield summaries - `log_every_batches` (int > 0; default 1) - log per‑bin yield summaries every N batches @@ -90,8 +91,9 @@ PWM inputs perform **input sampling** (sampling sites from PWMs) via - FIMO runs log per‑bin yield summaries (hits, accepted, selected). If `retain_bin_ids` is set, only those bins are reported; otherwise all bins are reported. `selection_policy: stratified` makes the selected‑bin distribution explicit for mining workflows. - - When `mining` is enabled, `max_seconds` caps per‑batch candidate generation while - `mining.max_seconds` caps the overall mining loop. + - For `scoring_backend: fimo`, use `mining.max_seconds` (time mode) or + `mining.max_candidates`/`mining.max_batches` (quota mode). The default is + `mining.max_seconds: 60`. Set `mining.max_seconds: null` to make quotas the primary cap. - `type: pwm_meme_set` - `paths` - list of MEME PWM files (merged into a single TF pool) - `motif_ids` (optional list) - choose motifs by ID across files diff --git a/src/dnadesign/densegen/docs/reference/outputs.md b/src/dnadesign/densegen/docs/reference/outputs.md index 23cef834..994603a9 100644 --- a/src/dnadesign/densegen/docs/reference/outputs.md +++ b/src/dnadesign/densegen/docs/reference/outputs.md @@ -98,9 +98,26 @@ The `dense report` command writes a compact audit summary under `outputs/`: - `outputs/report.json` - `outputs/report.md` +- `outputs/report.html` (basic HTML wrapper for quick sharing) These summarize run scope and link to the canonical outputs (`dense_arrays.parquet` and `attempts.parquet`). +Use `dense report --format json|md|html|all` to control which files are emitted. + +--- + +### Stage helper outputs (optional) + +DenseGen can materialize Stage‑A/Stage‑B artifacts without running the solver: + +- `dense stage-a build-pool` writes: + - `outputs/pools/pool_manifest.json` + - `outputs/pools/__pool.parquet` +- `dense stage-b build-libraries` writes: + - `outputs/libraries/library_builds.parquet` + - `outputs/libraries/library_manifest.json` + +These are optional inspection artifacts and are not required for a normal `dense run`. --- diff --git a/src/dnadesign/densegen/docs/workflows/cruncher_pwm_pipeline.md b/src/dnadesign/densegen/docs/workflows/cruncher_pwm_pipeline.md index 78d10cbc..151a97ae 100644 --- a/src/dnadesign/densegen/docs/workflows/cruncher_pwm_pipeline.md +++ b/src/dnadesign/densegen/docs/workflows/cruncher_pwm_pipeline.md @@ -52,8 +52,8 @@ after `arrays_generated_before_resample` or when a library under-produces. ### 4) Run DenseGen ```bash -uv run dense validate -c path/to/config.yaml -uv run dense describe -c path/to/config.yaml +pixi run dense validate-config -c path/to/config.yaml +uv run dense inspect config -c path/to/config.yaml uv run dense run -c path/to/config.yaml --no-plot ``` diff --git a/src/dnadesign/densegen/workspaces/README.md b/src/dnadesign/densegen/workspaces/README.md index a1c6e85b..13e80a8f 100644 --- a/src/dnadesign/densegen/workspaces/README.md +++ b/src/dnadesign/densegen/workspaces/README.md @@ -12,4 +12,4 @@ Archived or legacy artifacts live under `_archive/` so the active workspace list The canonical demo lives under `demo_meme_two_tf/` and uses MEME motif files copied from the basic Cruncher demo workspace (`inputs/local_motifs`). DenseGen reads these with the shared Cruncher MEME parser to keep parsing DRY and consistent. -Use `dense summarize --root workspaces/_archive` if you want to inspect archived workspaces. +Use `dense inspect run --root workspaces/_archive` if you want to inspect archived workspaces. From 59b5e535ab8850b2f22513aa00b2f00af0eec225 Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 15:14:07 -0500 Subject: [PATCH 09/40] densegen: add pool/library artifacts and audit reporting --- .../densegen/docs/demo/demo_basic.md | 29 +- src/dnadesign/densegen/docs/guide/inputs.md | 3 + .../densegen/docs/guide/outputs-metadata.md | 41 ++ .../densegen/docs/guide/workspace.md | 2 + src/dnadesign/densegen/docs/reference/cli.md | 33 +- .../densegen/docs/reference/config.md | 4 +- .../densegen/docs/reference/outputs.md | 25 +- .../densegen/src/adapters/outputs/parquet.py | 2 + .../src/adapters/sources/binding_sites.py | 12 + .../src/adapters/sources/pwm_artifact.py | 31 +- .../src/adapters/sources/pwm_artifact_set.py | 31 +- .../src/adapters/sources/pwm_jaspar.py | 31 +- .../src/adapters/sources/pwm_matrix_csv.py | 31 +- .../densegen/src/adapters/sources/pwm_meme.py | 31 +- .../src/adapters/sources/pwm_meme_set.py | 31 +- src/dnadesign/densegen/src/cli.py | 354 ++++++++++-------- src/dnadesign/densegen/src/config/__init__.py | 7 +- .../densegen/src/core/artifacts/ids.py | 76 ++++ .../densegen/src/core/artifacts/library.py | 97 +++++ .../densegen/src/core/artifacts/pool.py | 228 +++++++++++ .../densegen/src/core/metadata_schema.py | 4 +- src/dnadesign/densegen/src/core/pipeline.py | 335 ++++++++++++++++- src/dnadesign/densegen/src/core/reporting.py | 73 ++++ src/dnadesign/densegen/src/core/sampler.py | 16 + .../densegen/tests/test_artifacts_ids.py | 33 ++ .../densegen/tests/test_artifacts_library.py | 61 +++ .../densegen/tests/test_artifacts_pool.py | 83 ++++ .../densegen/tests/test_cli_workspace_init.py | 112 ++++++ .../densegen/tests/test_used_tfbs_offsets.py | 2 + .../workspaces/demo_meme_two_tf/config.yaml | 2 +- 30 files changed, 1597 insertions(+), 223 deletions(-) create mode 100644 src/dnadesign/densegen/src/core/artifacts/ids.py create mode 100644 src/dnadesign/densegen/src/core/artifacts/library.py create mode 100644 src/dnadesign/densegen/src/core/artifacts/pool.py create mode 100644 src/dnadesign/densegen/tests/test_artifacts_ids.py create mode 100644 src/dnadesign/densegen/tests/test_artifacts_library.py create mode 100644 src/dnadesign/densegen/tests/test_artifacts_pool.py create mode 100644 src/dnadesign/densegen/tests/test_cli_workspace_init.py diff --git a/src/dnadesign/densegen/docs/demo/demo_basic.md b/src/dnadesign/densegen/docs/demo/demo_basic.md index 91bd3ad3..9adf7e50 100644 --- a/src/dnadesign/densegen/docs/demo/demo_basic.md +++ b/src/dnadesign/densegen/docs/demo/demo_basic.md @@ -99,7 +99,7 @@ uv run dense workspace init --id demo_press --root "$RUN_ROOT" \ Example output: ```text -✨ Run staged: /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml +✨ Workspace staged: /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml ``` If you re-run the demo in the same run root and DenseGen’s schema has changed, you may see a @@ -215,7 +215,7 @@ uv run dense inspect run --run /private/tmp/densegen-demo-20260115-1405/demo_pre Example output: ```text -Run: demo_press Root: /private/tmp/densegen-demo-20260115-1405/demo_press Schema: 2.3 dense-arrays: () +Run: demo_press Root: /private/tmp/densegen-demo-20260115-1405/demo_press Schema: 2.4 dense-arrays: () ┏━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓ ┃ input ┃ plan ┃ generated ┃ duplica… ┃ failed ┃ resamples ┃ librari… ┃ stalls ┃ ┡━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩ @@ -248,29 +248,30 @@ This writes `outputs/report.json`, `outputs/report.md`, and `outputs/report.html ## 10) Inspect outputs -List the generated Parquet artifacts: +List the generated Parquet artifacts and manifests: ```bash -ls /private/tmp/densegen-demo-20260115-1405/demo_press/outputs/dense_arrays.parquet +ls /private/tmp/densegen-demo-20260115-1405/demo_press/outputs ``` Example output: ```text -_densegen_ids.sqlite -part-10ca57ae0c1d410d8b88206d194a2ff1.parquet +attempts.parquet +composition.parquet +dense_arrays.parquet +libraries +pools +report.html +report.json +report.md ``` -Inspect the library manifests: +Inspect Stage‑A pools and Stage‑B libraries: ```bash -ls /private/tmp/densegen-demo-20260115-1405/demo_press/outputs -``` - -Example output: - -```text -attempts.parquet +ls /private/tmp/densegen-demo-20260115-1405/demo_press/outputs/pools +ls /private/tmp/densegen-demo-20260115-1405/demo_press/outputs/libraries ``` ## 11) Plot analysis diff --git a/src/dnadesign/densegen/docs/guide/inputs.md b/src/dnadesign/densegen/docs/guide/inputs.md index 0d73925d..3ad6b50d 100644 --- a/src/dnadesign/densegen/docs/guide/inputs.md +++ b/src/dnadesign/densegen/docs/guide/inputs.md @@ -124,6 +124,9 @@ Notes: - `selection_policy: stratified` uses fixed p‑value bins to balance strong/weak matches. - Canonical p‑value bins (default): `[1e-10, 1e-8, 1e-6, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]`. Bin 0 is `(0, 1e-10]`, bin 1 is `(1e-10, 1e-8]`, etc. +- For FIMO, the candidate target is `n_sites * oversample_factor`, but mining caps or time limits + can stop early. Expect fewer candidates if `mining.max_seconds`, `mining.max_batches`, or + `mining.max_candidates` are binding. - FIMO mining defaults to **time-based** limits (`mining.max_seconds: 60`). To switch to a quota, set `mining.max_seconds: null` and use `mining.max_candidates` or `mining.max_batches` (with `mining.batch_size`) as the primary cap. diff --git a/src/dnadesign/densegen/docs/guide/outputs-metadata.md b/src/dnadesign/densegen/docs/guide/outputs-metadata.md index 69c9eb23..4f7e0065 100644 --- a/src/dnadesign/densegen/docs/guide/outputs-metadata.md +++ b/src/dnadesign/densegen/docs/guide/outputs-metadata.md @@ -50,6 +50,38 @@ per-motif site counts to make sampling behavior explicit. --- +### Stage‑A pools (TFBS pool artifacts) + +DenseGen materializes Stage‑A pools under `outputs/pools/`: + +- `outputs/pools/pool_manifest.json` — manifest of pool files by input. +- `outputs/pools/__pool.parquet` — TFBS pools (or sequence pools). + +TFBS pools include stable `motif_id` and `tfbs_id` hashes plus optional FIMO metadata +(`fimo_pvalue`, `fimo_bin_id`, etc.). Sequence pools include `tfbs_id` for joinability. + +--- + +### Library artifacts (Stage‑B) + +DenseGen writes Stage‑B libraries under `outputs/libraries/`: + +- `library_builds.parquet` — one row per library build (index, hash, size, strategy). +- `library_members.parquet` — normalized membership table (one row per TFBS in each library). +- `library_manifest.json` — manifest + schema version. + +These artifacts provide a stable join path from solver attempts to the exact library contents. + +--- + +### Composition table + +DenseGen writes `outputs/composition.parquet`, one row per TFBS placement in each accepted +sequence. Columns include `sequence_id`, `input_name`, `plan_name`, `library_index`, +`tf`, `tfbs`, `motif_id`, `tfbs_id`, and placement offsets. + +--- + ### Run state (checkpoint) DenseGen writes `outputs/meta/run_state.json` during execution. This checkpoint captures @@ -57,6 +89,14 @@ per-input/plan progress so long runs can resume safely after interruption. --- +### Events log + +DenseGen writes `outputs/meta/events.jsonl` (JSON lines) with structured events: +`POOL_BUILT`, `LIBRARY_BUILT`, `STALL_DETECTED`, and `RESAMPLE_TRIGGERED`. +This is a lightweight, machine-readable trace of the run’s control flow. + +--- + ### Attempts log DenseGen writes `outputs/attempts.parquet`, a consolidated log of solver attempts (success, @@ -85,6 +125,7 @@ source = densegen:{input_name}:{plan_name} This is always present and is separate from metadata. Detailed placement provenance lives in `densegen__used_tfbs_detail` and the run-scoped library manifests. +`densegen__used_tfbs_detail` includes `motif_id` and `tfbs_id` when available. --- diff --git a/src/dnadesign/densegen/docs/guide/workspace.md b/src/dnadesign/densegen/docs/guide/workspace.md index d7048b48..cabad8c2 100644 --- a/src/dnadesign/densegen/docs/guide/workspace.md +++ b/src/dnadesign/densegen/docs/guide/workspace.md @@ -65,5 +65,7 @@ When a run is complete, archive or sync the workspace as a unit. Tip: use `dense workspace init --id ` to scaffold a new workspace. Use `dense inspect run --root workspaces/_archive` to inspect archived workspaces. +If your config references local motif files, add `--copy-inputs` so the workspace +remains self-contained (or update paths in `config.yaml` after staging). @e-south diff --git a/src/dnadesign/densegen/docs/reference/cli.md b/src/dnadesign/densegen/docs/reference/cli.md index a1f64a55..b17b1c18 100644 --- a/src/dnadesign/densegen/docs/reference/cli.md +++ b/src/dnadesign/densegen/docs/reference/cli.md @@ -43,7 +43,7 @@ python -m dnadesign.densegen --help Input paths resolve against the config file directory. Outputs and logs must resolve inside `densegen.run.root` (run-scoped I/O). Config files must include `densegen.schema_version` -(currently `2.3`) and `densegen.run`. +(currently `2.4`) and `densegen.run`. --- @@ -112,17 +112,18 @@ Outputs: --- #### `dense stage-b build-libraries` -Build Stage‑B libraries (one per input + plan) from pools or inputs. +Build Stage‑B libraries (one per input + plan) from Stage‑A pools. Options: - `--out` - output directory relative to run root (default: `outputs/libraries`). -- `--pool` - optional pool directory from `stage-a build-pool` (defaults to reading inputs). +- `--pool` - pool directory from `stage-a build-pool` (defaults to `outputs/pools` in the workspace). - `--input/-i` - input name(s) to build (defaults to all). - `--plan/-p` - plan item name(s) to build (defaults to all). -- `--overwrite` - overwrite existing `library_builds.parquet`. +- `--overwrite` - overwrite existing library artifacts. Outputs: - `library_builds.parquet` +- `library_members.parquet` - `library_manifest.json` --- @@ -178,15 +179,21 @@ Options: ### Examples ```bash -pixi run dense validate-config -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense inspect inputs -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense inspect plan -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense inspect config -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense run -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml -uv run dense plot -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --only tf_usage,tf_coverage,tfbs_positional_histogram,diversity_health -uv run dense inspect run --run src/dnadesign/densegen/workspaces/demo_meme_two_tf -uv run dense inspect run --root src/dnadesign/densegen/workspaces -uv run dense report -c src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml --format all +RUN_ROOT=/tmp/densegen-demo-$(date +%Y%m%d-%H%M) +uv run dense workspace init --id demo_press --root "$RUN_ROOT" \ + --template src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml \ + --copy-inputs +CFG="$RUN_ROOT/demo_press/config.yaml" + +pixi run dense validate-config -c "$CFG" +uv run dense inspect inputs -c "$CFG" +uv run dense inspect plan -c "$CFG" +uv run dense inspect config -c "$CFG" +uv run dense run -c "$CFG" +uv run dense plot -c "$CFG" --only tf_usage,tf_coverage,tfbs_positional_histogram,diversity_health +uv run dense inspect run --run "$RUN_ROOT/demo_press" +uv run dense inspect run --root "$RUN_ROOT" +uv run dense report -c "$CFG" --format all ``` Demo run (small, Parquet-only config): diff --git a/src/dnadesign/densegen/docs/reference/config.md b/src/dnadesign/densegen/docs/reference/config.md index 3ed9f2a7..554f23d8 100644 --- a/src/dnadesign/densegen/docs/reference/config.md +++ b/src/dnadesign/densegen/docs/reference/config.md @@ -23,7 +23,7 @@ for conceptual flow. ### Top-level - `densegen` (required) -- `densegen.schema_version` (required; supported: `2.1`, `2.2`, `2.3`) +- `densegen.schema_version` (required; supported: `2.1`, `2.2`, `2.3`, `2.4`) - `densegen.run` (required; run-scoped I/O root) - `plots` (optional; required `source` when `output.targets` has multiple sinks) @@ -268,7 +268,7 @@ binding-site and PWM-sampled inputs. ```yaml densegen: - schema_version: "2.3" + schema_version: "2.4" run: id: demo root: "." diff --git a/src/dnadesign/densegen/docs/reference/outputs.md b/src/dnadesign/densegen/docs/reference/outputs.md index 994603a9..544ac5ad 100644 --- a/src/dnadesign/densegen/docs/reference/outputs.md +++ b/src/dnadesign/densegen/docs/reference/outputs.md @@ -83,12 +83,24 @@ These are produced alongside Parquet/USR outputs and provide a compact audit tra --- -### Library provenance (attempts log) +### Events log + +DenseGen writes `outputs/meta/events.jsonl` (JSON lines) with structured events +for pool builds, library builds, stalls, and resamples. This is a lightweight +machine-readable trace of runtime control flow. + +--- + +### Library provenance (library artifacts + attempts) + +DenseGen records solver library provenance in two places: + +- `outputs/libraries/library_builds.parquet` + `library_members.parquet` (canonical library artifacts). +- `outputs/attempts.parquet` (attempt-level audit log with offered library lists). -DenseGen now records solver library provenance exclusively in `outputs/attempts.parquet`. Each attempt row stores the full library offered to the solver (`library_tfbs`, `library_tfs`, `library_site_ids`, `library_sources`) along with the library hash/index and solver status. -Output records carry `densegen__sampling_library_hash` so you can join placements to attempts. +Output records carry `densegen__sampling_library_hash` so you can join placements to libraries. --- @@ -115,9 +127,11 @@ DenseGen can materialize Stage‑A/Stage‑B artifacts without running the solve - `outputs/pools/__pool.parquet` - `dense stage-b build-libraries` writes: - `outputs/libraries/library_builds.parquet` + - `outputs/libraries/library_members.parquet` - `outputs/libraries/library_manifest.json` -These are optional inspection artifacts and are not required for a normal `dense run`. +Stage‑B expects Stage‑A pools (default `outputs/pools`). These are optional inspection artifacts +and are not required for a normal `dense run`. --- @@ -130,7 +144,8 @@ densegen:{input_name}:{plan_name} ``` Per-placement provenance (TFBS, offsets, orientations) is recorded in -`densegen__used_tfbs_detail` and the attempts log. +`densegen__used_tfbs_detail` (including `motif_id`/`tfbs_id`), `outputs/composition.parquet`, +and the attempts log. --- diff --git a/src/dnadesign/densegen/src/adapters/outputs/parquet.py b/src/dnadesign/densegen/src/adapters/outputs/parquet.py index 0751907b..a64335da 100644 --- a/src/dnadesign/densegen/src/adapters/outputs/parquet.py +++ b/src/dnadesign/densegen/src/adapters/outputs/parquet.py @@ -107,6 +107,8 @@ def _meta_arrow_type(name: str, pa): [ pa.field("tf", pa.string()), pa.field("tfbs", pa.string()), + pa.field("motif_id", pa.string()), + pa.field("tfbs_id", pa.string()), pa.field("orientation", pa.string()), pa.field("offset", pa.int64()), pa.field("offset_raw", pa.int64()), diff --git a/src/dnadesign/densegen/src/adapters/sources/binding_sites.py b/src/dnadesign/densegen/src/adapters/sources/binding_sites.py index 6c74a022..dd8b6479 100644 --- a/src/dnadesign/densegen/src/adapters/sources/binding_sites.py +++ b/src/dnadesign/densegen/src/adapters/sources/binding_sites.py @@ -19,6 +19,7 @@ import pandas as pd +from ...core.artifacts.ids import hash_label_motif, hash_tfbs_id from .base import BaseDataSource, infer_format, resolve_path log = logging.getLogger(__name__) @@ -125,6 +126,17 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): if source_col: out["source"] = df[source_col].astype(str).str.strip() + motif_id_map = {tf: hash_label_motif(label=tf, source_kind="binding_sites") for tf in tf_clean.unique()} + out["motif_id"] = tf_clean.map(motif_id_map) + out["tfbs_id"] = [ + hash_tfbs_id( + motif_id=motif_id_map[tf], + sequence=seq, + scoring_backend="binding_sites", + ) + for tf, seq in zip(tf_clean.tolist(), seq_clean.tolist()) + ] + out = out.reset_index(drop=True) source_default = str(data_path) src_vals = out.get("source") diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py index 1339aa68..d29783ae 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact.py @@ -18,6 +18,7 @@ from pathlib import Path from typing import Any, List +from ...core.artifacts.ids import hash_pwm_motif, hash_tfbs_id from .base import BaseDataSource, resolve_path from .pwm_sampling import PWMMotif, normalize_background, sample_pwm_sites @@ -160,6 +161,12 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): raise FileNotFoundError(f"PWM artifact not found. Looked here:\n - {artifact_path}") motif = _load_artifact(artifact_path) + motif_hash = hash_pwm_motif( + motif_label=motif.motif_id, + matrix=motif.matrix, + background=motif.background, + source_kind="pwm_artifact", + ) sampling = dict(self.sampling or {}) strategy = str(sampling.get("strategy", "stochastic")) @@ -228,9 +235,27 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): rows = [] for seq in selected: - row = {"tf": motif.motif_id, "tfbs": seq, "source": str(artifact_path)} - if meta_by_seq: - row.update(meta_by_seq.get(seq, {})) + meta = meta_by_seq.get(seq, {}) if meta_by_seq else {} + start = meta.get("fimo_start") + stop = meta.get("fimo_stop") + strand = meta.get("fimo_strand") + tfbs_id = hash_tfbs_id( + motif_id=motif_hash, + sequence=seq, + scoring_backend=scoring_backend, + matched_start=int(start) if start is not None else None, + matched_stop=int(stop) if stop is not None else None, + matched_strand=str(strand) if strand is not None else None, + ) + row = { + "tf": motif.motif_id, + "tfbs": seq, + "source": str(artifact_path), + "motif_id": motif_hash, + "tfbs_id": tfbs_id, + } + if meta: + row.update(meta) rows.append(row) df_out = pd.DataFrame(rows) return entries, df_out diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py index 9a9353af..c3b620a5 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_artifact_set.py @@ -16,6 +16,7 @@ from pathlib import Path from typing import List +from ...core.artifacts.ids import hash_pwm_motif, hash_tfbs_id from .base import BaseDataSource, resolve_path from .pwm_artifact import load_artifact from .pwm_sampling import sample_pwm_sites @@ -54,6 +55,12 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): entries = [] all_rows = [] for motif, path in zip(motifs, resolved): + motif_hash = hash_pwm_motif( + motif_label=motif.motif_id, + matrix=motif.matrix, + background=motif.background, + source_kind="pwm_artifact_set", + ) sampling_cfg = sampling override = overrides.get(motif.motif_id) if override: @@ -120,9 +127,27 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): for seq in selected: entries.append((motif.motif_id, seq, str(path))) - row = {"tf": motif.motif_id, "tfbs": seq, "source": str(path)} - if meta_by_seq: - row.update(meta_by_seq.get(seq, {})) + meta = meta_by_seq.get(seq, {}) if meta_by_seq else {} + start = meta.get("fimo_start") + stop = meta.get("fimo_stop") + strand = meta.get("fimo_strand") + tfbs_id = hash_tfbs_id( + motif_id=motif_hash, + sequence=seq, + scoring_backend=scoring_backend, + matched_start=int(start) if start is not None else None, + matched_stop=int(stop) if stop is not None else None, + matched_strand=str(strand) if strand is not None else None, + ) + row = { + "tf": motif.motif_id, + "tfbs": seq, + "source": str(path), + "motif_id": motif_hash, + "tfbs_id": tfbs_id, + } + if meta: + row.update(meta) all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py index bb08ba6d..e0062364 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_jaspar.py @@ -17,6 +17,7 @@ from pathlib import Path from typing import List, Optional +from ...core.artifacts.ids import hash_pwm_motif, hash_tfbs_id from .base import BaseDataSource, resolve_path from .pwm_sampling import PWMMotif, normalize_background, sample_pwm_sites @@ -133,6 +134,12 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): entries = [] all_rows = [] for motif in motifs: + motif_hash = hash_pwm_motif( + motif_label=motif.motif_id, + matrix=motif.matrix, + background=motif.background, + source_kind="pwm_jaspar", + ) return_meta = scoring_backend == "fimo" result = sample_pwm_sites( rng, @@ -167,9 +174,27 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): meta_by_seq = {} for seq in selected: entries.append((motif.motif_id, seq, str(jaspar_path))) - row = {"tf": motif.motif_id, "tfbs": seq, "source": str(jaspar_path)} - if meta_by_seq: - row.update(meta_by_seq.get(seq, {})) + meta = meta_by_seq.get(seq, {}) if meta_by_seq else {} + start = meta.get("fimo_start") + stop = meta.get("fimo_stop") + strand = meta.get("fimo_strand") + tfbs_id = hash_tfbs_id( + motif_id=motif_hash, + sequence=seq, + scoring_backend=scoring_backend, + matched_start=int(start) if start is not None else None, + matched_stop=int(stop) if stop is not None else None, + matched_strand=str(strand) if strand is not None else None, + ) + row = { + "tf": motif.motif_id, + "tfbs": seq, + "source": str(jaspar_path), + "motif_id": motif_hash, + "tfbs_id": tfbs_id, + } + if meta: + row.update(meta) all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py index 7e313dad..496d4d5c 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_matrix_csv.py @@ -17,6 +17,7 @@ import pandas as pd +from ...core.artifacts.ids import hash_pwm_motif, hash_tfbs_id from .base import BaseDataSource, resolve_path from .pwm_sampling import PWMMotif, normalize_background, sample_pwm_sites @@ -64,6 +65,12 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): matrix.append({b: v / total for b, v in vals.items()}) motif = PWMMotif(motif_id=str(self.motif_id).strip(), matrix=matrix, background=normalize_background(None)) + motif_hash = hash_pwm_motif( + motif_label=motif.motif_id, + matrix=motif.matrix, + background=motif.background, + source_kind="pwm_matrix_csv", + ) sampling = dict(self.sampling or {}) strategy = str(sampling.get("strategy", "stochastic")) @@ -130,9 +137,27 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): entries = [(motif.motif_id, seq, str(csv_path)) for seq in selected] rows = [] for seq in selected: - row = {"tf": motif.motif_id, "tfbs": seq, "source": str(csv_path)} - if meta_by_seq: - row.update(meta_by_seq.get(seq, {})) + meta = meta_by_seq.get(seq, {}) if meta_by_seq else {} + start = meta.get("fimo_start") + stop = meta.get("fimo_stop") + strand = meta.get("fimo_strand") + tfbs_id = hash_tfbs_id( + motif_id=motif_hash, + sequence=seq, + scoring_backend=scoring_backend, + matched_start=int(start) if start is not None else None, + matched_stop=int(stop) if stop is not None else None, + matched_strand=str(strand) if strand is not None else None, + ) + row = { + "tf": motif.motif_id, + "tfbs": seq, + "source": str(csv_path), + "motif_id": motif_hash, + "tfbs_id": tfbs_id, + } + if meta: + row.update(meta) rows.append(row) df_out = pd.DataFrame(rows) return entries, df_out diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py index bce0a6fe..f7dac8dc 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme.py @@ -18,6 +18,7 @@ from dnadesign.cruncher.io.parsers.meme import MemeMotif, parse_meme_file +from ...core.artifacts.ids import hash_pwm_motif, hash_tfbs_id from .base import BaseDataSource, resolve_path from .pwm_sampling import PWMMotif, normalize_background, sample_pwm_sites @@ -112,6 +113,12 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): all_rows = [] for motif in motifs: pwm = _motif_to_pwm(motif, background) + motif_hash = hash_pwm_motif( + motif_label=pwm.motif_id, + matrix=pwm.matrix, + background=pwm.background, + source_kind="pwm_meme", + ) return_meta = scoring_backend == "fimo" result = sample_pwm_sites( rng, @@ -147,9 +154,27 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): for seq in selected: entries.append((pwm.motif_id, seq, str(meme_path))) - row = {"tf": pwm.motif_id, "tfbs": seq, "source": str(meme_path)} - if meta_by_seq: - row.update(meta_by_seq.get(seq, {})) + meta = meta_by_seq.get(seq, {}) if meta_by_seq else {} + start = meta.get("fimo_start") + stop = meta.get("fimo_stop") + strand = meta.get("fimo_strand") + tfbs_id = hash_tfbs_id( + motif_id=motif_hash, + sequence=seq, + scoring_backend=scoring_backend, + matched_start=int(start) if start is not None else None, + matched_stop=int(stop) if stop is not None else None, + matched_strand=str(strand) if strand is not None else None, + ) + row = { + "tf": pwm.motif_id, + "tfbs": seq, + "source": str(meme_path), + "motif_id": motif_hash, + "tfbs_id": tfbs_id, + } + if meta: + row.update(meta) all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py index c081095b..492fe857 100644 --- a/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py +++ b/src/dnadesign/densegen/src/adapters/sources/pwm_meme_set.py @@ -18,6 +18,7 @@ from dnadesign.cruncher.io.parsers.meme import MemeMotif, parse_meme_file +from ...core.artifacts.ids import hash_pwm_motif, hash_tfbs_id from .base import BaseDataSource, resolve_path from .pwm_meme import _background_from_meta, _motif_to_pwm from .pwm_sampling import sample_pwm_sites @@ -106,6 +107,12 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): all_rows = [] for motif, background, path in motifs_payload: pwm = _motif_to_pwm(motif, background) + motif_hash = hash_pwm_motif( + motif_label=pwm.motif_id, + matrix=pwm.matrix, + background=pwm.background, + source_kind="pwm_meme_set", + ) return_meta = scoring_backend == "fimo" result = sample_pwm_sites( rng, @@ -140,9 +147,27 @@ def load_data(self, *, rng=None, outputs_root: Path | None = None): meta_by_seq = {} for seq in selected: entries.append((pwm.motif_id, seq, str(path))) - row = {"tf": pwm.motif_id, "tfbs": seq, "source": str(path)} - if meta_by_seq: - row.update(meta_by_seq.get(seq, {})) + meta = meta_by_seq.get(seq, {}) if meta_by_seq else {} + start = meta.get("fimo_start") + stop = meta.get("fimo_stop") + strand = meta.get("fimo_strand") + tfbs_id = hash_tfbs_id( + motif_id=motif_hash, + sequence=seq, + scoring_backend=scoring_backend, + matched_start=int(start) if start is not None else None, + matched_stop=int(stop) if stop is not None else None, + matched_strand=str(strand) if strand is not None else None, + ) + row = { + "tf": pwm.motif_id, + "tfbs": seq, + "source": str(path), + "motif_id": motif_hash, + "tfbs_id": tfbs_id, + } + if meta: + row.update(meta) all_rows.append(row) import pandas as pd diff --git a/src/dnadesign/densegen/src/cli.py b/src/dnadesign/densegen/src/cli.py index 807b26d9..1c56a492 100644 --- a/src/dnadesign/densegen/src/cli.py +++ b/src/dnadesign/densegen/src/cli.py @@ -31,7 +31,6 @@ import contextlib import io -import json import logging import os import platform @@ -61,6 +60,13 @@ resolve_run_scoped_path, schema_version_at_least, ) +from .core.artifacts.library import write_library_artifact +from .core.artifacts.pool import ( + POOL_MODE_SEQUENCE, + POOL_MODE_TFBS, + build_pool_artifact, + load_pool_artifact, +) from .core.pipeline import ( _load_existing_library_index, _load_failure_counts_from_attempts, @@ -80,6 +86,7 @@ console = Console() _PYARROW_SYSCTL_PATTERN = re.compile(r"sysctlbyname failed for 'hw\.") log = logging.getLogger(__name__) +install_native_stderr_filters() @contextlib.contextmanager @@ -353,17 +360,6 @@ def _print_inputs_summary(loaded) -> None: ) -def _pool_manifest_path(out_dir: Path) -> Path: - return out_dir / "pool_manifest.json" - - -def _load_pool_manifest(out_dir: Path) -> dict: - manifest_path = _pool_manifest_path(out_dir) - if not manifest_path.exists(): - raise FileNotFoundError(f"Pool manifest not found: {manifest_path}") - return json.loads(manifest_path.read_text()) - - def _list_dir_entries(path: Path, *, limit: int = 10) -> list[str]: if not path.exists() or not path.is_dir(): return [] @@ -393,6 +389,25 @@ def _collect_missing_input_paths(loaded, cfg_path: Path) -> list[Path]: return missing +def _collect_relative_input_paths_from_raw(dense_cfg: dict) -> list[str]: + rel_paths: list[str] = [] + inputs_cfg = dense_cfg.get("inputs") or [] + for inp in inputs_cfg: + if not isinstance(inp, dict): + continue + raw_path = inp.get("path") + if isinstance(raw_path, str) and raw_path.strip(): + if not Path(raw_path).is_absolute(): + rel_paths.append(raw_path) + raw_paths = inp.get("paths") + if isinstance(raw_paths, list): + for path in raw_paths: + if isinstance(path, str) and path.strip(): + if not Path(path).is_absolute(): + rel_paths.append(path) + return rel_paths + + def _render_missing_input_hint(cfg_path: Path, loaded, exc: Exception) -> None: console.print(f"[bold red]Input error:[/] {exc}") missing = _collect_missing_input_paths(loaded, cfg_path) @@ -753,6 +768,16 @@ def workspace_init( config_path = run_dir / "config.yaml" config_path.write_text(yaml.safe_dump(raw, sort_keys=False)) + if not copy_inputs: + rel_paths = _collect_relative_input_paths_from_raw(dense) + if rel_paths: + console.print( + "[yellow]Workspace uses file-based inputs with relative paths.[/]" + " They will resolve relative to the new workspace." + ) + for rel_path in rel_paths[:6]: + console.print(f" - {rel_path}") + console.print("[yellow]Tip[/]: re-run with --copy-inputs or update paths in config.yaml.") console.print(f":sparkles: [bold green]Workspace staged[/]: {config_path}") @@ -1323,24 +1348,26 @@ def stage_a_build_pool( outputs_root = run_root / "outputs" outputs_root.mkdir(parents=True, exist_ok=True) - rows = [] - manifest_inputs: list[dict] = [] - for inp in cfg.inputs: - if selected and inp.name not in selected: - continue - src = deps.source_factory(inp, cfg_path) - data_entries, meta_df = src.load_data(rng=rng, outputs_root=outputs_root) - if meta_df is None: - df = pd.DataFrame({"sequence": [str(s) for s in data_entries]}) - else: - df = meta_df.copy() - df.insert(0, "input_name", inp.name) - filename = f"{_sanitize_filename(inp.name)}__pool.parquet" - dest = out_dir / filename - if dest.exists() and not overwrite: - console.print(f"[bold red]Pool already exists:[/] {dest}") + with _suppress_pyarrow_sysctl_warnings(): + try: + artifact, pool_data = build_pool_artifact( + cfg=cfg, + cfg_path=cfg_path, + deps=deps, + rng=rng, + outputs_root=outputs_root, + out_dir=out_dir, + overwrite=overwrite, + selected_inputs=selected if selected else None, + ) + except FileExistsError as exc: + console.print(f"[bold red]{exc}[/]") raise typer.Exit(code=1) - df.to_parquet(dest, index=False) + + for pool in pool_data.values(): + if pool.df is None: + continue + df = pool.df if "fimo_bin_id" in df.columns: bin_counts = df["fimo_bin_id"].value_counts().sort_index() bin_table = Table("bin_id", "pvalue_range", "count") @@ -1360,39 +1387,14 @@ def stage_a_build_pool( else: range_label = "-" bin_table.add_row(str(bin_id), range_label, str(int(count))) - console.print(f"[bold]FIMO p-value bins for {inp.name}[/]") + console.print(f"[bold]FIMO p-value bins for {pool.name}[/]") console.print(bin_table) - manifest_inputs.append( - { - "name": inp.name, - "type": inp.type, - "pool_path": dest.name, - "rows": int(len(df)), - "columns": list(df.columns), - } - ) - rows.append((inp.name, inp.type, str(len(df)), dest.name)) - - if not rows: - console.print("[yellow]No pools built (no matching inputs).[/]") - raise typer.Exit(code=1) - - manifest = { - "schema_version": "1.0", - "created_at": datetime.now(timezone.utc).isoformat(), - "run_id": cfg.run.id, - "run_root": str(run_root), - "config_path": str(cfg_path), - "inputs": manifest_inputs, - } - manifest_path = _pool_manifest_path(out_dir) - manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True)) table = Table("input", "type", "rows", "pool_file") - for row in rows: - table.add_row(*row) + for entry in artifact.inputs.values(): + table.add_row(entry.name, entry.input_type, str(entry.rows), entry.pool_path.name) console.print(table) - console.print(f":sparkles: [bold green]Pool manifest written[/]: {manifest_path}") + console.print(f":sparkles: [bold green]Pool manifest written[/]: {artifact.manifest_path}") @stage_b_app.command("build-libraries", help="Build Stage-B libraries from pools or inputs.") @@ -1402,7 +1404,7 @@ def stage_b_build_libraries( pool: Optional[Path] = typer.Option( None, "--pool", - help="Optional pool directory from `stage-a build-pool` (defaults to reading inputs).", + help="Pool directory from `stage-a build-pool` (defaults to outputs/pools for this workspace).", ), input_name: Optional[list[str]] = typer.Option( None, @@ -1422,15 +1424,9 @@ def stage_b_build_libraries( cfg_path = _resolve_config_path(ctx, config) loaded = _load_config_or_exit(cfg_path) cfg = loaded.root.densegen - if pool is None: - _ensure_fimo_available(cfg, strict=True) run_root = _run_root_for(loaded) out_dir = resolve_run_scoped_path(cfg_path, run_root, out, label="stage-b.out") out_dir.mkdir(parents=True, exist_ok=True) - out_path = out_dir / "library_builds.parquet" - if out_path.exists() and not overwrite: - console.print(f"[bold red]library_builds.parquet already exists:[/] {out_path}") - raise typer.Exit(code=1) selected_inputs = {name for name in (input_name or [])} if selected_inputs: @@ -1447,7 +1443,6 @@ def stage_b_build_libraries( if missing: raise typer.BadParameter(f"Unknown plan name(s): {', '.join(missing)}") - deps = default_deps() seed = int(cfg.runtime.random_seed) rng = random.Random(seed) np_rng = np.random.default_rng(seed) @@ -1457,117 +1452,147 @@ def stage_b_build_libraries( failure_counts = _load_failure_counts_from_attempts(outputs_root) libraries_built = _load_existing_library_index(outputs_root) if outputs_root.exists() else 0 - pool_manifest = None - pool_dir = None - if pool is not None: - pool_dir = resolve_relative_path(cfg_path, pool) - if not pool_dir.exists() or not pool_dir.is_dir(): - raise typer.BadParameter(f"Pool directory not found: {pool_dir}") - pool_manifest = _load_pool_manifest(pool_dir) + pool_dir = resolve_relative_path(cfg_path, pool) if pool is not None else (run_root / "outputs" / "pools") + if not pool_dir.exists() or not pool_dir.is_dir(): + raise typer.BadParameter(f"Pool directory not found: {pool_dir}") + try: + pool_artifact = load_pool_artifact(pool_dir) + except FileNotFoundError as exc: + console.print(f"[bold red]{exc}[/]") + entries = _list_dir_entries(pool_dir, limit=10) + if entries: + console.print(f"[bold]Pool directory contents[/]: {', '.join(entries)}") + console.print("[bold]Next steps[/]:") + console.print(f" - dense stage-a build-pool -c {cfg_path}") + console.print(" - ensure --pool points to the outputs/pools directory for this workspace") + raise typer.Exit(code=1) - rows = [] + build_rows = [] + member_rows = [] table = Table("input", "plan", "library_index", "library_hash", "size", "achieved/target", "pool", "sampling") - for inp in cfg.inputs: - if selected_inputs and inp.name not in selected_inputs: - continue - if pool_manifest is not None and pool_dir is not None: - entry = next((e for e in pool_manifest.get("inputs", []) if e.get("name") == inp.name), None) - if entry is None: - raise typer.BadParameter(f"Pool manifest missing input: {inp.name}") - pool_path = pool_dir / str(entry.get("pool_path") or "") + with _suppress_pyarrow_sysctl_warnings(): + for inp in cfg.inputs: + if selected_inputs and inp.name not in selected_inputs: + continue + entry = pool_artifact.entry_for(inp.name) + pool_path = pool_dir / entry.pool_path if not pool_path.exists(): raise typer.BadParameter(f"Pool file not found for input {inp.name}: {pool_path}") df = pd.read_parquet(pool_path) - if "tf" in df.columns and "tfbs" in df.columns: + if entry.pool_mode == POOL_MODE_TFBS: meta_df = df - data_entries = df["tfbs"].tolist() - elif "sequence" in df.columns: + data_entries = df["tfbs"].tolist() if "tfbs" in df.columns else [] + elif entry.pool_mode == POOL_MODE_SEQUENCE: meta_df = None data_entries = df["sequence"].tolist() else: - raise typer.BadParameter( - f"Pool file for {inp.name} must contain tf/tfbs or sequence columns: {pool_path}" + raise typer.BadParameter(f"Unsupported pool_mode for input {inp.name}: {entry.pool_mode}") + + for plan_item in resolved_plan: + if selected_plans and plan_item.name not in selected_plans: + continue + library, _parts, reg_labels, info = build_library_for_plan( + source_label=inp.name, + plan_item=plan_item, + data_entries=data_entries, + meta_df=meta_df, + sampling_cfg=sampling_cfg, + seq_len=int(cfg.generation.sequence_length), + min_count_per_tf=int(cfg.runtime.min_count_per_tf), + usage_counts={}, + failure_counts=failure_counts if failure_counts else None, + rng=rng, + np_rng=np_rng, + schema_is_22=schema_is_22, + library_index_start=libraries_built, + ) + libraries_built = int(info.get("library_index", libraries_built)) + library_hash = str(info.get("library_hash") or "") + target_len = int(info.get("target_length") or 0) + achieved_len = int(info.get("achieved_length") or 0) + pool_strategy = str(info.get("pool_strategy") or sampling_cfg.pool_strategy) + sampling_strategy = str(info.get("library_sampling_strategy") or sampling_cfg.library_sampling_strategy) + library_id = library_hash + tfbs_id_by_index = info.get("tfbs_id_by_index") or [] + motif_id_by_index = info.get("motif_id_by_index") or [] + row = { + "created_at": datetime.now(timezone.utc).isoformat(), + "input_name": inp.name, + "input_type": inp.type, + "plan_name": plan_item.name, + "library_index": int(info.get("library_index") or 0), + "library_id": library_id, + "library_hash": library_hash, + "library_tfbs": list(library), + "library_tfs": list(reg_labels) if reg_labels else [], + "library_site_ids": list(info.get("site_id_by_index") or []), + "library_sources": list(info.get("source_by_index") or []), + "library_tfbs_ids": list(tfbs_id_by_index), + "library_motif_ids": list(motif_id_by_index), + "pool_strategy": pool_strategy, + "library_sampling_strategy": sampling_strategy, + "library_size": int(info.get("library_size") or len(library)), + "target_length": target_len, + "achieved_length": achieved_len, + "relaxed_cap": bool(info.get("relaxed_cap") or False), + "final_cap": info.get("final_cap"), + "iterative_max_libraries": int(info.get("iterative_max_libraries") or 0), + "iterative_min_new_solutions": int(info.get("iterative_min_new_solutions") or 0), + "required_regulators_selected": info.get("required_regulators_selected"), + } + build_rows.append(row) + for idx, tfbs in enumerate(list(library)): + member_rows.append( + { + "library_id": library_id, + "library_hash": library_hash, + "library_index": int(info.get("library_index") or 0), + "input_name": inp.name, + "plan_name": plan_item.name, + "position": int(idx), + "tf": reg_labels[idx] if idx < len(reg_labels or []) else "", + "tfbs": tfbs, + "tfbs_id": tfbs_id_by_index[idx] if idx < len(tfbs_id_by_index) else None, + "motif_id": motif_id_by_index[idx] if idx < len(motif_id_by_index) else None, + "site_id": (info.get("site_id_by_index") or [None])[idx] + if idx < len(info.get("site_id_by_index") or []) + else None, + "source": (info.get("source_by_index") or [None])[idx] + if idx < len(info.get("source_by_index") or []) + else None, + } + ) + table.add_row( + inp.name, + plan_item.name, + str(row["library_index"]), + _short_hash(library_hash), + str(len(library)), + f"{achieved_len}/{target_len}", + pool_strategy, + sampling_strategy, ) - else: - src = deps.source_factory(inp, cfg_path) - data_entries, meta_df = src.load_data(rng=np_rng, outputs_root=outputs_root) - - for plan_item in resolved_plan: - if selected_plans and plan_item.name not in selected_plans: - continue - library, _parts, reg_labels, info = build_library_for_plan( - source_label=inp.name, - plan_item=plan_item, - data_entries=data_entries, - meta_df=meta_df, - sampling_cfg=sampling_cfg, - seq_len=int(cfg.generation.sequence_length), - min_count_per_tf=int(cfg.runtime.min_count_per_tf), - usage_counts={}, - failure_counts=failure_counts if failure_counts else None, - rng=rng, - np_rng=np_rng, - schema_is_22=schema_is_22, - library_index_start=libraries_built, - ) - libraries_built = int(info.get("library_index", libraries_built)) - library_hash = str(info.get("library_hash") or "") - target_len = int(info.get("target_length") or 0) - achieved_len = int(info.get("achieved_length") or 0) - pool_strategy = str(info.get("pool_strategy") or sampling_cfg.pool_strategy) - sampling_strategy = str(info.get("library_sampling_strategy") or sampling_cfg.library_sampling_strategy) - row = { - "created_at": datetime.now(timezone.utc).isoformat(), - "input_name": inp.name, - "input_type": inp.type, - "plan_name": plan_item.name, - "library_index": int(info.get("library_index") or 0), - "library_hash": library_hash, - "library_tfbs": list(library), - "library_tfs": list(reg_labels) if reg_labels else [], - "library_site_ids": list(info.get("site_id_by_index") or []), - "library_sources": list(info.get("source_by_index") or []), - "pool_strategy": pool_strategy, - "library_sampling_strategy": sampling_strategy, - "library_size": int(info.get("library_size") or len(library)), - "target_length": target_len, - "achieved_length": achieved_len, - "relaxed_cap": bool(info.get("relaxed_cap") or False), - "final_cap": info.get("final_cap"), - "iterative_max_libraries": int(info.get("iterative_max_libraries") or 0), - "iterative_min_new_solutions": int(info.get("iterative_min_new_solutions") or 0), - "required_regulators_selected": info.get("required_regulators_selected"), - } - rows.append(row) - table.add_row( - inp.name, - plan_item.name, - str(row["library_index"]), - _short_hash(library_hash), - str(len(library)), - f"{achieved_len}/{target_len}", - pool_strategy, - sampling_strategy, - ) - if not rows: - console.print("[yellow]No libraries built (no matching inputs/plans).[/]") - raise typer.Exit(code=1) + if not build_rows: + console.print("[yellow]No libraries built (no matching inputs/plans).[/]") + raise typer.Exit(code=1) - df_out = pd.DataFrame(rows) - df_out.to_parquet(out_path, index=False) - manifest = { - "schema_version": "1.0", - "created_at": datetime.now(timezone.utc).isoformat(), - "run_id": cfg.run.id, - "run_root": str(run_root), - "config_path": str(cfg_path), - "library_builds_path": str(out_path), - } - manifest_path = out_dir / "library_manifest.json" - manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True)) + try: + artifact = write_library_artifact( + out_dir=out_dir, + builds=build_rows, + members=member_rows, + cfg_path=cfg_path, + run_id=str(cfg.run.id), + run_root=run_root, + overwrite=overwrite, + ) + except FileExistsError as exc: + console.print(f"[bold red]{exc}[/]") + raise typer.Exit(code=1) console.print(table) - console.print(f":sparkles: [bold green]Library builds written[/]: {out_path}") + console.print(f":sparkles: [bold green]Library builds written[/]: {artifact.builds_path}") + console.print(f":sparkles: [bold green]Library members written[/]: {artifact.members_path}") @app.command(help="Run generation for the job. Optionally auto-run plots declared in YAML.") @@ -1582,6 +1607,7 @@ def run( root = loaded.root cfg = root.densegen run_root = _run_root_for(loaded) + _ensure_fimo_available(cfg, strict=True) # Logging setup log_cfg = cfg.logging diff --git a/src/dnadesign/densegen/src/config/__init__.py b/src/dnadesign/densegen/src/config/__init__.py index e8f7b920..a820c42b 100644 --- a/src/dnadesign/densegen/src/config/__init__.py +++ b/src/dnadesign/densegen/src/config/__init__.py @@ -42,8 +42,8 @@ def _construct_mapping(loader, node, deep: bool = False): _StrictLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG, _construct_mapping) -LATEST_SCHEMA_VERSION = "2.3" -SUPPORTED_SCHEMA_VERSIONS = {"2.1", "2.2", LATEST_SCHEMA_VERSION} +LATEST_SCHEMA_VERSION = "2.4" +SUPPORTED_SCHEMA_VERSIONS = {"2.1", "2.2", "2.3", LATEST_SCHEMA_VERSION} def parse_schema_version(value: str) -> tuple[int, int]: @@ -356,6 +356,9 @@ def _score_mode(self): self.mining = PWMMiningConfig() if self.pvalue_bins is None: self.pvalue_bins = list(CANONICAL_PVALUE_BINS) + if self.mining is not None and self.mining.max_candidates is not None: + if int(self.mining.max_candidates) < int(self.n_sites): + raise ValueError("pwm.sampling.mining.max_candidates must be >= n_sites") if self.mining is not None and self.mining.retain_bin_ids is not None: bins = list(self.pvalue_bins) if self.pvalue_bins is not None else list(CANONICAL_PVALUE_BINS) max_idx = len(bins) - 1 diff --git a/src/dnadesign/densegen/src/core/artifacts/ids.py b/src/dnadesign/densegen/src/core/artifacts/ids.py new file mode 100644 index 00000000..c31b3245 --- /dev/null +++ b/src/dnadesign/densegen/src/core/artifacts/ids.py @@ -0,0 +1,76 @@ +""" +Stable identifier helpers for DenseGen artifacts. + +These hashes are intended to be deterministic and join-friendly across runs. +""" + +from __future__ import annotations + +import hashlib +import json +from typing import Mapping, Sequence + +_BASES = ("A", "C", "G", "T") +_FLOAT_DIGITS = 10 + + +def _fmt_float(value: float) -> str: + return format(float(value), f".{_FLOAT_DIGITS}g") + + +def _stable_json(payload: dict) -> str: + return json.dumps(payload, sort_keys=True, separators=(",", ":"), ensure_ascii=True) + + +def _hash_payload(payload: dict) -> str: + return hashlib.sha256(_stable_json(payload).encode("utf-8")).hexdigest() + + +def hash_pwm_motif( + *, + motif_label: str, + matrix: Sequence[Mapping[str, float]], + background: Mapping[str, float], + source_kind: str, + source_label: str | None = None, +) -> str: + rows = [] + for row in matrix: + rows.append([_fmt_float(row.get(base, 0.0)) for base in _BASES]) + payload = { + "source_kind": source_kind, + "source_label": source_label or "", + "motif_label": str(motif_label), + "matrix": rows, + "background": {base: _fmt_float(background.get(base, 0.0)) for base in _BASES}, + } + return _hash_payload(payload) + + +def hash_label_motif(*, label: str | None, source_kind: str, source_label: str | None = None) -> str: + payload = { + "source_kind": source_kind, + "source_label": source_label or "", + "label": str(label or ""), + } + return _hash_payload(payload) + + +def hash_tfbs_id( + *, + motif_id: str | None, + sequence: str, + scoring_backend: str, + matched_start: int | None = None, + matched_stop: int | None = None, + matched_strand: str | None = None, +) -> str: + payload = { + "motif_id": str(motif_id or ""), + "sequence": str(sequence), + "scoring_backend": str(scoring_backend), + "matched_start": matched_start, + "matched_stop": matched_stop, + "matched_strand": matched_strand or "", + } + return _hash_payload(payload) diff --git a/src/dnadesign/densegen/src/core/artifacts/library.py b/src/dnadesign/densegen/src/core/artifacts/library.py new file mode 100644 index 00000000..e77b727e --- /dev/null +++ b/src/dnadesign/densegen/src/core/artifacts/library.py @@ -0,0 +1,97 @@ +""" +Stage-B library artifacts. +""" + +from __future__ import annotations + +import json +from dataclasses import dataclass +from datetime import datetime, timezone +from pathlib import Path + +import pandas as pd + +from ...utils.logging_utils import install_native_stderr_filters + +LIBRARY_SCHEMA_VERSION = "1.0" + + +@dataclass(frozen=True) +class LibraryArtifact: + manifest_path: Path + builds_path: Path + members_path: Path + schema_version: str + run_id: str + run_root: str + config_path: str + + @classmethod + def load(cls, manifest_path: Path) -> "LibraryArtifact": + payload = json.loads(manifest_path.read_text()) + return cls( + manifest_path=manifest_path, + builds_path=Path(payload.get("library_builds_path", "")), + members_path=Path(payload.get("library_members_path", "")), + schema_version=str(payload.get("schema_version")), + run_id=str(payload.get("run_id")), + run_root=str(payload.get("run_root")), + config_path=str(payload.get("config_path")), + ) + + +def _library_manifest_path(out_dir: Path) -> Path: + return out_dir / "library_manifest.json" + + +def write_library_artifact( + *, + out_dir: Path, + builds: list[dict], + members: list[dict], + cfg_path: Path, + run_id: str, + run_root: Path, + overwrite: bool = False, +) -> LibraryArtifact: + out_dir.mkdir(parents=True, exist_ok=True) + install_native_stderr_filters() + builds_path = out_dir / "library_builds.parquet" + members_path = out_dir / "library_members.parquet" + + if not overwrite: + if builds_path.exists(): + raise FileExistsError(f"Library builds already exist: {builds_path}") + if members_path.exists(): + raise FileExistsError(f"Library members already exist: {members_path}") + + pd.DataFrame(builds).to_parquet(builds_path, index=False) + pd.DataFrame(members).to_parquet(members_path, index=False) + + manifest = { + "schema_version": LIBRARY_SCHEMA_VERSION, + "created_at": datetime.now(timezone.utc).isoformat(), + "run_id": str(run_id), + "run_root": str(run_root), + "config_path": str(cfg_path), + "library_builds_path": str(builds_path), + "library_members_path": str(members_path), + } + manifest_path = _library_manifest_path(out_dir) + manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True)) + return LibraryArtifact( + manifest_path=manifest_path, + builds_path=builds_path, + members_path=members_path, + schema_version=LIBRARY_SCHEMA_VERSION, + run_id=str(run_id), + run_root=str(run_root), + config_path=str(cfg_path), + ) + + +def load_library_artifact(out_dir: Path) -> LibraryArtifact: + manifest_path = _library_manifest_path(out_dir) + if not manifest_path.exists(): + raise FileNotFoundError(f"Library manifest not found: {manifest_path}") + return LibraryArtifact.load(manifest_path) diff --git a/src/dnadesign/densegen/src/core/artifacts/pool.py b/src/dnadesign/densegen/src/core/artifacts/pool.py new file mode 100644 index 00000000..fb9ce5df --- /dev/null +++ b/src/dnadesign/densegen/src/core/artifacts/pool.py @@ -0,0 +1,228 @@ +""" +Stage-A TFBS pool artifacts. +""" + +from __future__ import annotations + +import json +import re +from dataclasses import dataclass +from datetime import datetime, timezone +from pathlib import Path +from typing import Iterable + +import pandas as pd + +from ...utils.logging_utils import install_native_stderr_filters +from .ids import hash_tfbs_id + +POOL_SCHEMA_VERSION = "1.0" +POOL_MODE_TFBS = "tfbs" +POOL_MODE_SEQUENCE = "sequence" +_SAFE_FILENAME_RE = re.compile(r"[^A-Za-z0-9_.-]+") + + +def _sanitize_filename(name: str) -> str: + cleaned = _SAFE_FILENAME_RE.sub("_", str(name).strip()) + return cleaned or "densegen" + + +@dataclass(frozen=True) +class PoolInputEntry: + name: str + input_type: str + pool_path: Path + rows: int + columns: list[str] + pool_mode: str + + +@dataclass(frozen=True) +class PoolData: + name: str + input_type: str + pool_mode: str + df: pd.DataFrame | None + sequences: list[str] + pool_path: Path + + +@dataclass(frozen=True) +class TFBSPoolArtifact: + manifest_path: Path + inputs: dict[str, PoolInputEntry] + schema_version: str + run_id: str + run_root: str + config_path: str + + @classmethod + def load(cls, manifest_path: Path) -> "TFBSPoolArtifact": + payload = json.loads(manifest_path.read_text()) + entries = {} + for item in payload.get("inputs", []): + entry = PoolInputEntry( + name=str(item.get("name")), + input_type=str(item.get("type")), + pool_path=Path(item.get("pool_path")), + rows=int(item.get("rows", 0)), + columns=list(item.get("columns") or []), + pool_mode=str(item.get("pool_mode") or POOL_MODE_TFBS), + ) + entries[entry.name] = entry + return cls( + manifest_path=manifest_path, + inputs=entries, + schema_version=str(payload.get("schema_version")), + run_id=str(payload.get("run_id")), + run_root=str(payload.get("run_root")), + config_path=str(payload.get("config_path")), + ) + + def entry_for(self, input_name: str) -> PoolInputEntry: + if input_name not in self.inputs: + raise KeyError(f"Pool manifest missing input: {input_name}") + return self.inputs[input_name] + + +def _pool_manifest_path(out_dir: Path) -> Path: + return out_dir / "pool_manifest.json" + + +def load_pool_artifact(out_dir: Path) -> TFBSPoolArtifact: + manifest_path = _pool_manifest_path(out_dir) + if not manifest_path.exists(): + raise FileNotFoundError(f"Pool manifest not found: {manifest_path}") + return TFBSPoolArtifact.load(manifest_path) + + +def _resolve_pool_mode(df: pd.DataFrame) -> str: + if "tf" in df.columns and "tfbs" in df.columns: + return POOL_MODE_TFBS + if "sequence" in df.columns: + return POOL_MODE_SEQUENCE + raise ValueError("Pool dataframe must contain tf/tfbs columns or a sequence column.") + + +def _ensure_tfbs_ids(df: pd.DataFrame) -> None: + missing = [col for col in ("motif_id", "tfbs_id") if col not in df.columns] + if missing: + raise ValueError(f"TFBS pool missing required columns: {', '.join(missing)}") + + +def _build_sequence_pool(sequences: Iterable[str]) -> pd.DataFrame: + seqs = [str(s) for s in sequences] + df = pd.DataFrame({"sequence": seqs}) + df["tfbs_id"] = [ + hash_tfbs_id( + motif_id=None, + sequence=seq, + scoring_backend="sequence_library", + ) + for seq in seqs + ] + return df + + +def build_pool_artifact( + *, + cfg, + cfg_path: Path, + deps, + rng, + outputs_root: Path, + out_dir: Path, + overwrite: bool = False, + selected_inputs: set[str] | None = None, +) -> tuple[TFBSPoolArtifact, dict[str, PoolData]]: + out_dir.mkdir(parents=True, exist_ok=True) + install_native_stderr_filters() + pool_entries: dict[str, PoolInputEntry] = {} + pool_data: dict[str, PoolData] = {} + used_names: dict[str, int] = {} + rows: list[tuple[str, str, str, Path]] = [] + + for inp in cfg.inputs: + if selected_inputs and inp.name not in selected_inputs: + continue + src = deps.source_factory(inp, cfg_path) + data_entries, meta_df = src.load_data(rng=rng, outputs_root=outputs_root) + if meta_df is None: + df = _build_sequence_pool(data_entries) + else: + df = meta_df.copy() + df.insert(0, "input_name", inp.name) + + pool_mode = _resolve_pool_mode(df) + if pool_mode == POOL_MODE_TFBS: + _ensure_tfbs_ids(df) + + base = _sanitize_filename(inp.name) + count = used_names.get(base, 0) + used_names[base] = count + 1 + suffix = f"{base}__{count}" if count else base + filename = f"{suffix}__pool.parquet" + dest = out_dir / filename + if dest.exists() and not overwrite: + raise FileExistsError(f"Pool already exists: {dest}") + df.to_parquet(dest, index=False) + + entry = PoolInputEntry( + name=inp.name, + input_type=str(inp.type), + pool_path=Path(filename), + rows=int(len(df)), + columns=list(df.columns), + pool_mode=pool_mode, + ) + pool_entries[inp.name] = entry + sequences: list[str] + if pool_mode == POOL_MODE_SEQUENCE: + sequences = df["sequence"].tolist() + pool_df = None + else: + sequences = df["tfbs"].tolist() if "tfbs" in df.columns else [] + pool_df = df + pool_data[inp.name] = PoolData( + name=inp.name, + input_type=str(inp.type), + pool_mode=pool_mode, + df=pool_df, + sequences=sequences, + pool_path=dest, + ) + rows.append((inp.name, str(inp.type), str(len(df)), dest)) + + if not rows: + raise ValueError("No pools built (no matching inputs).") + + manifest = { + "schema_version": POOL_SCHEMA_VERSION, + "created_at": datetime.now(timezone.utc).isoformat(), + "run_id": cfg.run.id, + "run_root": str(cfg.run.root), + "config_path": str(cfg_path), + "inputs": [ + { + "name": entry.name, + "type": entry.input_type, + "pool_path": entry.pool_path.name, + "rows": entry.rows, + "columns": entry.columns, + "pool_mode": entry.pool_mode, + } + for entry in pool_entries.values() + ], + } + manifest_path = _pool_manifest_path(out_dir) + manifest_path.write_text(json.dumps(manifest, indent=2, sort_keys=True)) + + artifact = TFBSPoolArtifact( + manifest_path=manifest_path, + inputs=pool_entries, + schema_version=POOL_SCHEMA_VERSION, + run_id=str(cfg.run.id), + run_root=str(cfg.run.root), + config_path=str(cfg_path), + ) + return artifact, pool_data diff --git a/src/dnadesign/densegen/src/core/metadata_schema.py b/src/dnadesign/densegen/src/core/metadata_schema.py index ea568c9d..45a85f6f 100644 --- a/src/dnadesign/densegen/src/core/metadata_schema.py +++ b/src/dnadesign/densegen/src/core/metadata_schema.py @@ -27,7 +27,7 @@ class MetaField: META_FIELDS: list[MetaField] = [ - MetaField("schema_version", (str,), "DenseGen schema version (e.g., 2.1)."), + MetaField("schema_version", (str,), "DenseGen schema version (e.g., 2.4)."), MetaField("created_at", (str,), "UTC ISO8601 timestamp for record creation."), MetaField("run_id", (str,), "Run identifier (densegen.run.id)."), MetaField("run_root", (str,), "Resolved run root path (densegen.run.root)."), @@ -54,7 +54,7 @@ class MetaField: MetaField( "used_tfbs_detail", (list,), - "Per-placement detail: tf/tfbs/orientation/offset (offset uses final sequence coordinates).", + "Per-placement detail: tf/tfbs/motif_id/tfbs_id/orientation/offset (offset uses final coordinates).", ), MetaField("used_tf_counts", (list,), "Per-TF placement counts ({tf, count})."), MetaField("used_tf_list", (list,), "TFs used in the final sequence."), diff --git a/src/dnadesign/densegen/src/core/pipeline.py b/src/dnadesign/densegen/src/core/pipeline.py index 07a2012e..3770d8a7 100644 --- a/src/dnadesign/densegen/src/core/pipeline.py +++ b/src/dnadesign/densegen/src/core/pipeline.py @@ -43,6 +43,10 @@ resolve_run_root, schema_version_at_least, ) +from ..utils.logging_utils import install_native_stderr_filters +from .artifacts.ids import hash_tfbs_id +from .artifacts.library import write_library_artifact +from .artifacts.pool import build_pool_artifact from .metadata import build_metadata from .postprocess import random_fill from .pvalue_bins import resolve_pvalue_bins @@ -537,7 +541,16 @@ def _input_metadata(source_cfg, cfg_path: Path) -> dict: return meta -def _compute_used_tf_info(sol, library_for_opt, regulator_labels, fixed_elements, site_id_by_index, source_by_index): +def _compute_used_tf_info( + sol, + library_for_opt, + regulator_labels, + fixed_elements, + site_id_by_index, + source_by_index, + tfbs_id_by_index, + motif_id_by_index, +): promoter_motifs = set() if fixed_elements is not None: if hasattr(fixed_elements, "promoter_constraints"): @@ -593,6 +606,14 @@ def _compute_used_tf_info(sol, library_for_opt, regulator_labels, fixed_elements source = source_by_index[base_idx] if source is not None: entry["source"] = source + if tfbs_id_by_index is not None and base_idx < len(tfbs_id_by_index): + tfbs_id = tfbs_id_by_index[base_idx] + if tfbs_id is not None: + entry["tfbs_id"] = tfbs_id + if motif_id_by_index is not None and base_idx < len(motif_id_by_index): + motif_id = motif_id_by_index[base_idx] + if motif_id is not None: + entry["motif_id"] = motif_id used_detail.append(entry) if tf_label: counts[tf_label] = counts.get(tf_label, 0) + 1 @@ -980,6 +1001,8 @@ def _finalize( *, site_id_by_index: list[str | None] | None, source_by_index: list[str | None] | None, + tfbs_id_by_index: list[str | None] | None, + motif_id_by_index: list[str | None] | None, ) -> tuple[list[str], list[str], list[str], dict]: nonlocal libraries_built libraries_built += 1 @@ -987,6 +1010,8 @@ def _finalize( info["library_hash"] = _hash_library(library, reg_labels, site_id_by_index, source_by_index) info["site_id_by_index"] = site_id_by_index info["source_by_index"] = source_by_index + info["tfbs_id_by_index"] = tfbs_id_by_index + info["motif_id_by_index"] = motif_id_by_index return library, parts, reg_labels, info if meta_df is not None and isinstance(meta_df, pd.DataFrame): @@ -1022,6 +1047,8 @@ def _finalize( parts = [f"{tf}:{tfbs}" for tf, tfbs in zip(reg_labels, lib_df["tfbs"].tolist())] site_id_by_index = lib_df["site_id"].tolist() if "site_id" in lib_df.columns else None source_by_index = lib_df["source"].tolist() if "source" in lib_df.columns else None + tfbs_id_by_index = lib_df["tfbs_id"].tolist() if "tfbs_id" in lib_df.columns else None + motif_id_by_index = lib_df["motif_id"].tolist() if "motif_id" in lib_df.columns else None info = { "target_length": seq_len + subsample_over, "achieved_length": sum(len(s) for s in library), @@ -1039,6 +1066,8 @@ def _finalize( info, site_id_by_index=site_id_by_index, source_by_index=source_by_index, + tfbs_id_by_index=tfbs_id_by_index, + motif_id_by_index=motif_id_by_index, ) sampler = TFSampler(meta_df, np_rng) @@ -1126,6 +1155,8 @@ def _finalize( ) site_id_by_index = info.get("site_id_by_index") source_by_index = info.get("source_by_index") + tfbs_id_by_index = info.get("tfbs_id_by_index") + motif_id_by_index = info.get("motif_id_by_index") return _finalize( library, parts, @@ -1133,6 +1164,8 @@ def _finalize( info, site_id_by_index=site_id_by_index, source_by_index=source_by_index, + tfbs_id_by_index=tfbs_id_by_index, + motif_id_by_index=motif_id_by_index, ) if required_regulators or plan_min_count_by_regulator or min_required_regulators is not None: @@ -1183,7 +1216,19 @@ def _finalize( "iterative_max_libraries": iterative_max_libraries, "iterative_min_new_solutions": iterative_min_new_solutions, } - return _finalize(library, tf_parts, reg_labels, info, site_id_by_index=None, source_by_index=None) + tfbs_id_by_index = [ + hash_tfbs_id(motif_id=None, sequence=seq, scoring_backend="sequence_library") for seq in library + ] + return _finalize( + library, + tf_parts, + reg_labels, + info, + site_id_by_index=None, + source_by_index=None, + tfbs_id_by_index=tfbs_id_by_index, + motif_id_by_index=None, + ) def _compute_sampling_fraction( @@ -1245,6 +1290,14 @@ def _consolidate_parts(outputs_root: Path, *, part_glob: str, final_name: str) - return True +def _emit_event(events_path: Path, *, event: str, payload: dict) -> None: + record = {"event": event, "created_at": datetime.now(timezone.utc).isoformat()} + record.update(payload) + events_path.parent.mkdir(parents=True, exist_ok=True) + with events_path.open("a", encoding="utf-8") as handle: + handle.write(json.dumps(record, sort_keys=True) + "\n") + + ATTEMPTS_CHUNK_SIZE = 256 @@ -1551,6 +1604,10 @@ def _process_plan_for_source( write_state: Callable[[], None] | None = None, site_failure_counts: dict[tuple[str, str, str, str, str | None], dict[str, int]] | None = None, source_cache: dict[str, tuple[list, pd.DataFrame | None]] | None = None, + library_build_rows: list[dict] | None = None, + library_member_rows: list[dict] | None = None, + composition_rows: list[dict] | None = None, + events_path: Path | None = None, ) -> tuple[int, dict]: source_label = source_cfg.name plan_name = plan_item.name @@ -1560,6 +1617,73 @@ def _process_plan_for_source( seq_len = int(gen.sequence_length) sampling_cfg = gen.sampling + def _record_library_build( + *, + sampling_info: dict, + library_tfbs: list[str], + library_tfs: list[str], + library_tfbs_ids: list[str], + library_motif_ids: list[str], + library_site_ids: list[str | None], + library_sources: list[str | None], + ) -> None: + if library_build_rows is None or library_member_rows is None: + return + library_index = int(sampling_info.get("library_index") or 0) + library_hash = str(sampling_info.get("library_hash") or "") + library_id = library_hash or f"{source_label}:{plan_name}:{library_index}" + row = { + "created_at": datetime.now(timezone.utc).isoformat(), + "input_name": source_label, + "plan_name": plan_name, + "library_index": library_index, + "library_id": library_id, + "library_hash": library_hash, + "pool_strategy": sampling_info.get("pool_strategy"), + "library_sampling_strategy": sampling_info.get("library_sampling_strategy"), + "library_size": int(sampling_info.get("library_size") or len(library_tfbs)), + "target_length": sampling_info.get("target_length"), + "achieved_length": sampling_info.get("achieved_length"), + "relaxed_cap": sampling_info.get("relaxed_cap"), + "final_cap": sampling_info.get("final_cap"), + "iterative_max_libraries": sampling_info.get("iterative_max_libraries"), + "iterative_min_new_solutions": sampling_info.get("iterative_min_new_solutions"), + "required_regulators_selected": sampling_info.get("required_regulators_selected"), + } + library_build_rows.append(row) + if events_path is not None: + try: + _emit_event( + events_path, + event="LIBRARY_BUILT", + payload={ + "input_name": source_label, + "plan_name": plan_name, + "library_index": library_index, + "library_hash": library_hash, + "library_size": int(row.get("library_size") or len(library_tfbs)), + }, + ) + except Exception: + log.debug("Failed to emit LIBRARY_BUILT event.", exc_info=True) + for idx, tfbs in enumerate(library_tfbs): + library_member_rows.append( + { + "library_id": library_id, + "library_hash": library_hash, + "library_index": library_index, + "input_name": source_label, + "plan_name": plan_name, + "position": int(idx), + "tf": library_tfs[idx] if idx < len(library_tfs) else "", + "tfbs": tfbs, + "tfbs_id": library_tfbs_ids[idx] if idx < len(library_tfbs_ids) else None, + "motif_id": library_motif_ids[idx] if idx < len(library_motif_ids) else None, + "site_id": library_site_ids[idx] if idx < len(library_site_ids) else None, + "source": library_sources[idx] if idx < len(library_sources) else None, + } + ) + pool_strategy = str(sampling_cfg.pool_strategy) library_sampling_strategy = str(sampling_cfg.library_sampling_strategy) iterative_max_libraries = int(sampling_cfg.iterative_max_libraries) @@ -1822,12 +1946,25 @@ def _process_plan_for_source( libraries_built = int(sampling_info.get("library_index", libraries_built)) site_id_by_index = sampling_info.get("site_id_by_index") source_by_index = sampling_info.get("source_by_index") + tfbs_id_by_index = sampling_info.get("tfbs_id_by_index") + motif_id_by_index = sampling_info.get("motif_id_by_index") sampling_library_index = sampling_info.get("library_index", 0) sampling_library_hash = sampling_info.get("library_hash", "") library_tfbs = list(library_for_opt) library_tfs = list(regulator_labels) if regulator_labels else [] library_site_ids = list(site_id_by_index) if site_id_by_index else [] library_sources = list(source_by_index) if source_by_index else [] + library_tfbs_ids = list(tfbs_id_by_index) if tfbs_id_by_index else [] + library_motif_ids = list(motif_id_by_index) if motif_id_by_index else [] + _record_library_build( + sampling_info=sampling_info, + library_tfbs=library_tfbs, + library_tfs=library_tfs, + library_tfbs_ids=library_tfbs_ids, + library_motif_ids=library_motif_ids, + library_site_ids=library_site_ids, + library_sources=library_sources, + ) max_tfbs_len = max((len(str(m)) for m in library_tfbs), default=0) required_len = max(max_tfbs_len, fixed_elements_max_len) if seq_len < required_len: @@ -2015,6 +2152,21 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): stall_seconds, ) stall_events += 1 + if events_path is not None: + try: + _emit_event( + events_path, + event="STALL_DETECTED", + payload={ + "input_name": source_label, + "plan_name": plan_name, + "stall_seconds": float(now - subsample_started), + "library_index": int(sampling_library_index), + "library_hash": str(sampling_library_hash), + }, + ) + except Exception: + log.debug("Failed to emit STALL_DETECTED event.", exc_info=True) stall_triggered = True break if (now - last_log_warn >= stall_warn_every) and (produced_this_library == 0): @@ -2051,6 +2203,8 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): fixed_elements, site_id_by_index, source_by_index, + tfbs_id_by_index, + motif_id_by_index, ) tf_list_from_library = sorted(set(regulator_labels)) if regulator_labels else [] solver_status = getattr(sol, "status", None) @@ -2412,6 +2566,30 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): ) continue + if composition_rows is not None: + for placement_index, entry in enumerate(used_tfbs_detail or []): + composition_rows.append( + { + "sequence_id": record.id, + "input_name": source_label, + "plan_name": plan_name, + "library_index": int(sampling_library_index), + "library_hash": str(sampling_library_hash), + "placement_index": int(placement_index), + "tf": entry.get("tf"), + "tfbs": entry.get("tfbs"), + "motif_id": entry.get("motif_id"), + "tfbs_id": entry.get("tfbs_id"), + "orientation": entry.get("orientation"), + "offset": entry.get("offset"), + "length": entry.get("length"), + "end": entry.get("end"), + "pad_left": entry.get("pad_left"), + "site_id": entry.get("site_id"), + "source": entry.get("source"), + } + ) + _append_attempt( outputs_root, run_id=run_id, @@ -2634,6 +2812,13 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): iterative_min_new_solutions, ) + resample_reason = "resample" + if produced_this_library == 0: + resample_reason = "stall_no_solution" if stall_triggered else "no_solution" + elif pool_strategy == "iterative_subsample" and iterative_min_new_solutions > 0: + if produced_this_library < iterative_min_new_solutions: + resample_reason = "min_new_solutions" + # Resample # Alignment (2): allow reactive resampling for subsample under schema>=2.2. allow_resample = pool_strategy == "iterative_subsample" or (schema_is_22 and pool_strategy == "subsample") @@ -2645,6 +2830,22 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): ) resamples_in_try += 1 total_resamples += 1 + if events_path is not None: + try: + _emit_event( + events_path, + event="RESAMPLE_TRIGGERED", + payload={ + "input_name": source_label, + "plan_name": plan_name, + "reason": resample_reason, + "produced_this_library": int(produced_this_library), + "library_index": int(sampling_library_index), + "library_hash": str(sampling_library_hash), + }, + ) + except Exception: + log.debug("Failed to emit RESAMPLE_TRIGGERED event.", exc_info=True) if max_total_resamples > 0 and total_resamples > max_total_resamples: raise RuntimeError(f"[{source_label}/{plan_name}] Exceeded max_total_resamples={max_total_resamples}.") if resamples_in_try > max_resample_attempts: @@ -2683,12 +2884,25 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): libraries_built = int(sampling_info.get("library_index", libraries_built)) site_id_by_index = sampling_info.get("site_id_by_index") source_by_index = sampling_info.get("source_by_index") + tfbs_id_by_index = sampling_info.get("tfbs_id_by_index") + motif_id_by_index = sampling_info.get("motif_id_by_index") sampling_library_index = sampling_info.get("library_index", sampling_library_index) sampling_library_hash = sampling_info.get("library_hash", sampling_library_hash) library_tfbs = list(library_for_opt) library_tfs = list(regulator_labels) if regulator_labels else [] library_site_ids = list(site_id_by_index) if site_id_by_index else [] library_sources = list(source_by_index) if source_by_index else [] + library_tfbs_ids = list(tfbs_id_by_index) if tfbs_id_by_index else [] + library_motif_ids = list(motif_id_by_index) if motif_id_by_index else [] + _record_library_build( + sampling_info=sampling_info, + library_tfbs=library_tfbs, + library_tfs=library_tfs, + library_tfbs_ids=library_tfbs_ids, + library_motif_ids=library_motif_ids, + library_site_ids=library_site_ids, + library_sources=library_sources, + ) # Alignment (7): sampling_fraction uses unique TFBS strings and is bounded. sampling_fraction = _compute_sampling_fraction( library_for_opt, @@ -2779,6 +2993,7 @@ def _make_generator(_library_for_opt: List[str], _regulator_labels: List[str]): def run_pipeline(loaded: LoadedConfig, *, deps: PipelineDeps | None = None) -> RunSummary: deps = deps or default_deps() + install_native_stderr_filters() cfg = loaded.root.densegen run_root = resolve_run_root(loaded.path, cfg.run.root) run_root_str = str(run_root) @@ -2811,8 +3026,45 @@ def run_pipeline(loaded: LoadedConfig, *, deps: PipelineDeps | None = None) -> R plan_leaderboards: dict[tuple[str, str], dict] = {} inputs_manifest_entries: dict[str, dict] = {} source_cache: dict[str, tuple[list, pd.DataFrame | None]] = {} + library_build_rows: list[dict] = [] + library_member_rows: list[dict] = [] + composition_rows: list[dict] = [] outputs_root = run_outputs_root(run_root) outputs_root.mkdir(parents=True, exist_ok=True) + events_path = outputs_root / "meta" / "events.jsonl" + pool_dir = outputs_root / "pools" + try: + _pool_artifact, pool_data = build_pool_artifact( + cfg=cfg, + cfg_path=loaded.path, + deps=deps, + rng=np_rng, + outputs_root=outputs_root, + out_dir=pool_dir, + overwrite=True, + ) + except Exception as exc: + raise RuntimeError(f"Failed to build Stage-A TFBS pools: {exc}") from exc + try: + _emit_event( + events_path, + event="POOL_BUILT", + payload={ + "inputs": [ + { + "name": pool.name, + "input_type": pool.input_type, + "pool_mode": pool.pool_mode, + "rows": int(pool.df.shape[0]) if pool.df is not None else int(len(pool.sequences)), + } + for pool in pool_data.values() + ] + }, + ) + except Exception: + log.debug("Failed to emit POOL_BUILT event.", exc_info=True) + for name, pool in pool_data.items(): + source_cache[name] = (pool.sequences, pool.df) ensure_run_meta_dir(run_root) state_path = run_state_path(run_root) state_created_at = datetime.now(timezone.utc).isoformat() @@ -2996,6 +3248,10 @@ def _write_state() -> None: write_state=_write_state, site_failure_counts=site_failure_counts, source_cache=source_cache, + library_build_rows=library_build_rows, + library_member_rows=library_member_rows, + composition_rows=composition_rows, + events_path=events_path, ) per_plan[(s.name, item.name)] = per_plan.get((s.name, item.name), 0) + produced total += produced @@ -3044,6 +3300,10 @@ def _write_state() -> None: write_state=_write_state, site_failure_counts=site_failure_counts, source_cache=source_cache, + library_build_rows=library_build_rows, + library_member_rows=library_member_rows, + composition_rows=composition_rows, + events_path=events_path, ) produced_counts[key] = current + produced leaderboard_latest = stats.get("leaderboard_latest") @@ -3059,6 +3319,77 @@ def _write_state() -> None: outputs_root = run_outputs_root(run_root) _consolidate_parts(outputs_root, part_glob="attempts_part-*.parquet", final_name="attempts.parquet") + if library_build_rows: + libraries_dir = outputs_root / "libraries" + existing_builds: list[dict] = [] + existing_members: list[dict] = [] + builds_path = libraries_dir / "library_builds.parquet" + members_path = libraries_dir / "library_members.parquet" + if builds_path.exists(): + try: + existing_builds = pd.read_parquet(builds_path).to_dict("records") + except Exception: + log.warning("Failed to read existing library_builds.parquet; overwriting.", exc_info=True) + existing_builds = [] + if members_path.exists(): + try: + existing_members = pd.read_parquet(members_path).to_dict("records") + except Exception: + log.warning("Failed to read existing library_members.parquet; overwriting.", exc_info=True) + existing_members = [] + + existing_indices = { + int(row.get("library_index") or 0) for row in existing_builds if row.get("library_index") is not None + } + new_builds = [row for row in library_build_rows if int(row.get("library_index") or 0) not in existing_indices] + build_rows = existing_builds + new_builds + + existing_member_keys = { + ( + int(row.get("library_index") or 0), + int(row.get("position") or 0), + ) + for row in existing_members + } + new_members = [ + row + for row in library_member_rows + if (int(row.get("library_index") or 0), int(row.get("position") or 0)) not in existing_member_keys + ] + member_rows = existing_members + new_members + + try: + write_library_artifact( + out_dir=libraries_dir, + builds=build_rows, + members=member_rows, + cfg_path=loaded.path, + run_id=str(cfg.run.id), + run_root=run_root, + overwrite=True, + ) + except Exception as exc: + raise RuntimeError(f"Failed to write library artifacts: {exc}") from exc + + if composition_rows: + composition_path = outputs_root / "composition.parquet" + existing_rows: list[dict] = [] + if composition_path.exists(): + try: + existing_rows = pd.read_parquet(composition_path).to_dict("records") + except Exception: + log.warning("Failed to read existing composition.parquet; overwriting.", exc_info=True) + existing_rows = [] + existing_keys = { + (str(row.get("sequence_id") or ""), int(row.get("placement_index") or 0)) for row in existing_rows + } + new_rows = [ + row + for row in composition_rows + if (str(row.get("sequence_id") or ""), int(row.get("placement_index") or 0)) not in existing_keys + ] + pd.DataFrame(existing_rows + new_rows).to_parquet(composition_path, index=False) + manifest_items = [ PlanManifest( input_name=key[0], diff --git a/src/dnadesign/densegen/src/core/reporting.py b/src/dnadesign/densegen/src/core/reporting.py index c4049d75..d2ff922e 100644 --- a/src/dnadesign/densegen/src/core/reporting.py +++ b/src/dnadesign/densegen/src/core/reporting.py @@ -25,6 +25,7 @@ from ..adapters.outputs import load_records_from_config from ..config import RootConfig, resolve_run_root, resolve_run_scoped_path +from .artifacts.pool import POOL_MODE_TFBS, load_pool_artifact from .run_manifest import load_run_manifest from .run_paths import run_manifest_path, run_outputs_root @@ -119,6 +120,8 @@ def _explode_used(df: pd.DataFrame) -> pd.DataFrame: "input_name": str(row.get(input_col) or ""), "tf": tf, "tfbs": tfbs, + "motif_id": entry.get("motif_id"), + "tfbs_id": entry.get("tfbs_id"), "orientation": entry.get("orientation"), "offset": entry.get("offset"), "length": entry.get("length"), @@ -376,6 +379,48 @@ def collect_report_data( tables: Dict[str, pd.DataFrame] = {} + stage_a_bins = pd.DataFrame(columns=["input_name", "tf", "bin_id", "bin_low", "bin_high", "count", "total"]) + pool_dir = outputs_root / "pools" + if pool_dir.exists(): + try: + pool_artifact = load_pool_artifact(pool_dir) + rows: list[dict[str, Any]] = [] + for entry in pool_artifact.inputs.values(): + if entry.pool_mode != POOL_MODE_TFBS: + continue + pool_path = pool_dir / entry.pool_path + if not pool_path.exists(): + continue + df_pool = pd.read_parquet(pool_path) + if "fimo_bin_id" not in df_pool.columns or "tf" not in df_pool.columns: + continue + total_counts = df_pool.groupby("tf").size().to_dict() + grouped = df_pool.groupby(["tf", "fimo_bin_id"]) + for (tf, bin_id), group in grouped: + bin_low = None + bin_high = None + if "fimo_bin_low" in group.columns and not group["fimo_bin_low"].empty: + bin_low = float(group["fimo_bin_low"].iloc[0]) + if "fimo_bin_high" in group.columns and not group["fimo_bin_high"].empty: + bin_high = float(group["fimo_bin_high"].iloc[0]) + rows.append( + { + "input_name": entry.name, + "tf": tf, + "bin_id": int(bin_id), + "bin_low": bin_low, + "bin_high": bin_high, + "count": int(len(group)), + "total": int(total_counts.get(tf, len(group))), + } + ) + if rows: + stage_a_bins = pd.DataFrame(rows) + except Exception: + log.warning("Failed to load Stage-A pool bins for report.", exc_info=True) + + tables["stage_a_bins"] = stage_a_bins + library_summary = pd.DataFrame( columns=["library_hash", "library_index", "input_name", "plan_name", "size", "total_bp", "outputs"] ) @@ -481,6 +526,13 @@ def collect_report_data( tables["tf_cooccurrence"] = _compute_cooccurrence(used_df) tables["tf_adjacency"] = _compute_adjacency(used_df) + composition_path = outputs_root / "composition.parquet" + if composition_path.exists(): + try: + tables["composition"] = pd.read_parquet(composition_path) + except Exception: + log.warning("Failed to load composition.parquet for report tables.", exc_info=True) + library_hashes = df[_dg("sampling_library_hash")].dropna().unique().tolist() tf_counts = used_df["tf"].value_counts().to_dict() if not used_df.empty else {} tfbs_counts = used_df["tfbs"].value_counts().to_dict() if not used_df.empty else {} @@ -604,7 +656,28 @@ def _render_report_md(bundle: ReportBundle) -> str: "## Outputs", "- outputs/dense_arrays.parquet", "- outputs/attempts.parquet", + "- outputs/composition.parquet", + "- outputs/libraries/library_builds.parquet", + "- outputs/libraries/library_members.parquet", + "- outputs/pools/pool_manifest.json", ] + stage_a_bins = bundle.tables.get("stage_a_bins") + if stage_a_bins is not None and not stage_a_bins.empty: + lines.extend(["", "## Stage-A p-value bins"]) + for (input_name, tf), sub in stage_a_bins.groupby(["input_name", "tf"]): + sub = sub.sort_values("bin_id") + parts = [] + for _, row in sub.iterrows(): + bin_id = int(row.get("bin_id") or 0) + count = int(row.get("count") or 0) + low = row.get("bin_low") + high = row.get("bin_high") + if low is not None and high is not None: + label = f"({float(low):.0e},{float(high):.0e}]" + else: + label = f"bin{bin_id}" + parts.append(f"{label}:{count}") + lines.append(f"- {input_name}/{tf}: " + " ".join(parts)) leaderboard = report.get("leaderboard_latest") or {} leader_tf = leaderboard.get("tf") or [] leader_tfbs = leaderboard.get("tfbs") or [] diff --git a/src/dnadesign/densegen/src/core/sampler.py b/src/dnadesign/densegen/src/core/sampler.py index e80d1e4b..4fd65208 100644 --- a/src/dnadesign/densegen/src/core/sampler.py +++ b/src/dnadesign/densegen/src/core/sampler.py @@ -99,15 +99,21 @@ def generate_binding_site_subsample( labels: list[str] = [] site_ids: list[str | None] = [] sources: list[str | None] = [] + tfbs_ids: list[str | None] = [] + motif_ids: list[str | None] = [] seen_tfbs = set() # for unique_binding_sites (tf, tfbs) used_per_tf: dict[str, int] = {} has_site_id = "site_id" in self.df.columns has_source = "source" in self.df.columns + has_tfbs_id = "tfbs_id" in self.df.columns + has_motif_id = "motif_id" in self.df.columns def _append_provenance(row) -> None: site_ids.append(str(row["site_id"]) if has_site_id else None) sources.append(str(row["source"]) if has_source else None) + tfbs_ids.append(str(row["tfbs_id"]) if has_tfbs_id else None) + motif_ids.append(str(row["motif_id"]) if has_motif_id else None) unique_tfs = self.df["tf"].unique().tolist() self.rng.shuffle(unique_tfs) @@ -235,6 +241,8 @@ def _add_required_tfs() -> None: "final_cap": cap, "site_id_by_index": site_ids if has_site_id else None, "source_by_index": sources if has_source else None, + "tfbs_id_by_index": tfbs_ids if has_tfbs_id else None, + "motif_id_by_index": motif_ids if has_motif_id else None, } return sites, meta, labels, info @@ -278,6 +286,8 @@ def generate_binding_site_library( has_site_id = "site_id" in df.columns has_source = "source" in df.columns + has_tfbs_id = "tfbs_id" in df.columns + has_motif_id = "motif_id" in df.columns total_unique_tfbs = len(df.drop_duplicates(["tf", "tfbs"])) unique_tfs = sorted(df["tf"].unique().tolist()) @@ -313,6 +323,8 @@ def generate_binding_site_library( reasons: list[str] = [] site_ids: list[str | None] = [] sources: list[str | None] = [] + tfbs_ids: list[str | None] = [] + motif_ids: list[str | None] = [] seen_tfbs = set() used_per_tf: dict[str, int] = {} @@ -330,6 +342,8 @@ def _append_row(row, reason: str) -> bool: used_per_tf[tf] = used_per_tf.get(tf, 0) + 1 site_ids.append(str(row["site_id"]) if has_site_id else None) sources.append(str(row["source"]) if has_source else None) + tfbs_ids.append(str(row["tfbs_id"]) if has_tfbs_id else None) + motif_ids.append(str(row["motif_id"]) if has_motif_id else None) return True def _pick_for_tf(tf: str, *, reason: str, cap_override: int | None = None) -> bool: @@ -528,6 +542,8 @@ def _fill_uniform_over_pairs() -> None: "final_cap": cap, "site_id_by_index": site_ids if has_site_id else None, "source_by_index": sources if has_source else None, + "tfbs_id_by_index": tfbs_ids if has_tfbs_id else None, + "motif_id_by_index": motif_ids if has_motif_id else None, "selection_reason_by_index": reasons, } return sites, meta, labels, info diff --git a/src/dnadesign/densegen/tests/test_artifacts_ids.py b/src/dnadesign/densegen/tests/test_artifacts_ids.py new file mode 100644 index 00000000..bab0e3b7 --- /dev/null +++ b/src/dnadesign/densegen/tests/test_artifacts_ids.py @@ -0,0 +1,33 @@ +from __future__ import annotations + +from dnadesign.densegen.src.core.artifacts.ids import hash_pwm_motif, hash_tfbs_id + + +def test_hash_tfbs_id_is_deterministic() -> None: + a = hash_tfbs_id(motif_id="M1", sequence="ACGT", scoring_backend="fimo", matched_start=1, matched_stop=4) + b = hash_tfbs_id(motif_id="M1", sequence="ACGT", scoring_backend="fimo", matched_start=1, matched_stop=4) + assert a == b + + +def test_hash_tfbs_id_changes_with_inputs() -> None: + base = hash_tfbs_id(motif_id="M1", sequence="ACGT", scoring_backend="fimo", matched_start=1, matched_stop=4) + diff_seq = hash_tfbs_id(motif_id="M1", sequence="TGCA", scoring_backend="fimo", matched_start=1, matched_stop=4) + diff_match = hash_tfbs_id(motif_id="M1", sequence="ACGT", scoring_backend="fimo", matched_start=2, matched_stop=5) + assert base != diff_seq + assert base != diff_match + + +def test_hash_pwm_motif_changes_with_matrix() -> None: + m1 = hash_pwm_motif( + motif_label="lexA", + matrix=[{"A": 0.7, "C": 0.1, "G": 0.1, "T": 0.1}], + background={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + source_kind="pwm_meme", + ) + m2 = hash_pwm_motif( + motif_label="lexA", + matrix=[{"A": 0.6, "C": 0.2, "G": 0.1, "T": 0.1}], + background={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, + source_kind="pwm_meme", + ) + assert m1 != m2 diff --git a/src/dnadesign/densegen/tests/test_artifacts_library.py b/src/dnadesign/densegen/tests/test_artifacts_library.py new file mode 100644 index 00000000..4490599e --- /dev/null +++ b/src/dnadesign/densegen/tests/test_artifacts_library.py @@ -0,0 +1,61 @@ +from __future__ import annotations + +from pathlib import Path + +from dnadesign.densegen.src.core.artifacts.library import load_library_artifact, write_library_artifact + + +def test_write_library_artifact(tmp_path: Path) -> None: + builds = [ + { + "created_at": "2026-01-20T00:00:00+00:00", + "input_name": "demo", + "plan_name": "plan", + "library_index": 1, + "library_id": "libhash", + "library_hash": "libhash", + "pool_strategy": "subsample", + "library_sampling_strategy": "tf_balanced", + "library_size": 2, + "target_length": 20, + "achieved_length": 18, + "relaxed_cap": False, + "final_cap": None, + "iterative_max_libraries": 0, + "iterative_min_new_solutions": 0, + "required_regulators_selected": None, + } + ] + members = [ + { + "library_id": "libhash", + "library_hash": "libhash", + "library_index": 1, + "input_name": "demo", + "plan_name": "plan", + "position": 0, + "tf": "TF1", + "tfbs": "AAAA", + "tfbs_id": "id1", + "motif_id": "motif1", + "site_id": None, + "source": "src", + } + ] + artifact = write_library_artifact( + out_dir=tmp_path, + builds=builds, + members=members, + cfg_path=Path("config.yaml"), + run_id="demo", + run_root=tmp_path, + overwrite=True, + ) + + assert artifact.manifest_path.exists() + assert artifact.builds_path.exists() + assert artifact.members_path.exists() + + loaded = load_library_artifact(tmp_path) + assert loaded.builds_path.name == artifact.builds_path.name + assert loaded.members_path.name == artifact.members_path.name diff --git a/src/dnadesign/densegen/tests/test_artifacts_pool.py b/src/dnadesign/densegen/tests/test_artifacts_pool.py new file mode 100644 index 00000000..9f3222ad --- /dev/null +++ b/src/dnadesign/densegen/tests/test_artifacts_pool.py @@ -0,0 +1,83 @@ +from __future__ import annotations + +from pathlib import Path + +import numpy as np +import yaml + +from dnadesign.densegen.src.config import load_config +from dnadesign.densegen.src.core.artifacts.pool import build_pool_artifact +from dnadesign.densegen.src.core.pipeline import default_deps + + +def test_build_pool_artifact_binding_sites(tmp_path: Path) -> None: + csv_path = tmp_path / "sites.csv" + csv_path.write_text("tf,tfbs\nTF1,AAAA\nTF2,CCCC\n") + cfg_path = tmp_path / "config.yaml" + cfg_path.write_text( + yaml.safe_dump( + { + "densegen": { + "schema_version": "2.4", + "run": {"id": "demo", "root": "."}, + "inputs": [ + { + "name": "demo input", + "type": "binding_sites", + "path": str(csv_path), + "format": "csv", + } + ], + "output": { + "targets": ["parquet"], + "schema": {"bio_type": "dna", "alphabet": "dna_4"}, + "parquet": {"path": str(tmp_path / "out.parquet")}, + }, + "generation": { + "sequence_length": 10, + "quota": 1, + "plan": [{"name": "default", "quota": 1}], + }, + "solver": {"backend": "CBC", "strategy": "iterate", "options": []}, + "runtime": { + "round_robin": False, + "arrays_generated_before_resample": 10, + "min_count_per_tf": 0, + "max_duplicate_solutions": 5, + "stall_seconds_before_resample": 10, + "stall_warning_every_seconds": 10, + "max_resample_attempts": 1, + "max_total_resamples": 1, + "max_seconds_per_plan": 0, + "max_failed_solutions": 0, + "checkpoint_every": 0, + "leaderboard_every": 50, + }, + "logging": {"log_dir": "outputs/logs", "level": "INFO"}, + "postprocess": {"gap_fill": {"mode": "off"}}, + } + } + ) + ) + + loaded = load_config(cfg_path) + cfg = loaded.root.densegen + out_dir = tmp_path / "outputs" / "pools" + outputs_root = tmp_path / "outputs" + artifact, pool_data = build_pool_artifact( + cfg=cfg, + cfg_path=cfg_path, + deps=default_deps(), + rng=np.random.default_rng(0), + outputs_root=outputs_root, + out_dir=out_dir, + overwrite=False, + ) + + assert artifact.manifest_path.exists() + entry = artifact.entry_for("demo input") + assert " " not in entry.pool_path.name + pool = pool_data["demo input"] + assert pool.df is not None + assert "tfbs_id" in pool.df.columns + assert "motif_id" in pool.df.columns diff --git a/src/dnadesign/densegen/tests/test_cli_workspace_init.py b/src/dnadesign/densegen/tests/test_cli_workspace_init.py new file mode 100644 index 00000000..668c21e3 --- /dev/null +++ b/src/dnadesign/densegen/tests/test_cli_workspace_init.py @@ -0,0 +1,112 @@ +from __future__ import annotations + +import textwrap +from pathlib import Path + +from typer.testing import CliRunner + +from dnadesign.densegen.src.cli import app + + +def _write_template_config(path: Path) -> None: + path.write_text( + textwrap.dedent( + """ + densegen: + schema_version: "2.4" + run: + id: demo + root: "." + inputs: + - name: demo + type: binding_sites + path: inputs/sites.csv + """ + ).strip() + + "\n" + ) + + +def _write_min_config(path: Path) -> None: + path.write_text( + textwrap.dedent( + """ + densegen: + schema_version: "2.4" + run: + id: demo + root: "." + inputs: + - name: demo + type: binding_sites + path: inputs.csv + + output: + targets: [parquet] + schema: + bio_type: dna + alphabet: dna_4 + parquet: + path: outputs/dense_arrays.parquet + + generation: + sequence_length: 10 + quota: 1 + plan: + - name: default + quota: 1 + + solver: + backend: CBC + strategy: iterate + + logging: + log_dir: outputs/logs + """ + ).strip() + + "\n" + ) + + +def test_workspace_init_warns_on_relative_inputs_without_copy(tmp_path: Path) -> None: + template_path = tmp_path / "template.yaml" + _write_template_config(template_path) + runner = CliRunner() + result = runner.invoke( + app, + [ + "workspace", + "init", + "--id", + "demo_run", + "--root", + str(tmp_path), + "--template", + str(template_path), + ], + ) + assert result.exit_code == 0, result.output + assert "Workspace uses file-based inputs with relative paths" in result.output + assert (tmp_path / "demo_run" / "config.yaml").exists() + + +def test_stage_b_reports_missing_pool_manifest(tmp_path: Path) -> None: + cfg_path = tmp_path / "config.yaml" + _write_min_config(cfg_path) + pool_dir = tmp_path / "pools" + pool_dir.mkdir() + runner = CliRunner() + result = runner.invoke( + app, + [ + "stage-b", + "build-libraries", + "-c", + str(cfg_path), + "--pool", + str(pool_dir), + ], + ) + assert result.exit_code != 0, result.output + assert "Pool manifest not found" in result.output + assert "dense stage-a build-pool" in result.output diff --git a/src/dnadesign/densegen/tests/test_used_tfbs_offsets.py b/src/dnadesign/densegen/tests/test_used_tfbs_offsets.py index fab47c33..3d0da07c 100644 --- a/src/dnadesign/densegen/tests/test_used_tfbs_offsets.py +++ b/src/dnadesign/densegen/tests/test_used_tfbs_offsets.py @@ -22,6 +22,8 @@ def test_used_tfbs_offsets_shift_with_5prime_padding() -> None: None, None, None, + None, + None, ) assert used_tfbs == ["TF1:TT", "TF2:GG"] assert used_counts == {"TF1": 1, "TF2": 1} diff --git a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml index 3fed86a2..281471b7 100644 --- a/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml +++ b/src/dnadesign/densegen/workspaces/demo_meme_two_tf/config.yaml @@ -4,7 +4,7 @@ # Motif widths: lexA=22, cpxR=21. densegen: - schema_version: "2.3" + schema_version: "2.4" run: id: demo_meme_two_tf root: "." From 47ce75c6b583e7b9932cc62444b49ce1f41fb213 Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 15:21:38 -0500 Subject: [PATCH 10/40] pixi: add pytest task for MEME-enabled tests --- pixi.toml | 1 + 1 file changed, 1 insertion(+) diff --git a/pixi.toml b/pixi.toml index 7b4672b4..6084d7ab 100644 --- a/pixi.toml +++ b/pixi.toml @@ -7,6 +7,7 @@ platforms = ["osx-arm64", "osx-64", "linux-64"] [tasks] cruncher = "uv run cruncher" dense = "uv run dense" +pytest = "uv run pytest -q" [dependencies] meme = "*" From e6566cc41dcc2faa7f0471bd777faa3a59186d2d Mon Sep 17 00:00:00 2001 From: Eric South Date: Tue, 20 Jan 2026 16:10:39 -0500 Subject: [PATCH 11/40] densegen: harden sampling UX and reporting --- .../densegen/docs/demo/demo_basic.md | 10 +- .../densegen/docs/guide/generation.md | 6 + src/dnadesign/densegen/docs/guide/inputs.md | 3 +- .../densegen/docs/guide/outputs-metadata.md | 34 ++- src/dnadesign/densegen/docs/reference/cli.md | 22 +- .../densegen/docs/reference/config.md | 7 + .../densegen/docs/reference/outputs.md | 5 +- .../src/adapters/sources/pwm_sampling.py | 155 ++++++++++++ src/dnadesign/densegen/src/cli.py | 64 ++--- src/dnadesign/densegen/src/config/__init__.py | 53 ++++ .../densegen/src/core/artifacts/library.py | 2 +- .../densegen/src/core/artifacts/pool.py | 2 +- src/dnadesign/densegen/src/core/pipeline.py | 226 +++++++++++++++--- src/dnadesign/densegen/src/core/reporting.py | 180 +++++++++++++- .../densegen/src/core/run_manifest.py | 12 + .../densegen/src/core/runtime_policy.py | 37 +++ src/dnadesign/densegen/src/core/seeding.py | 22 ++ .../densegen/src/utils/logging_utils.py | 55 +++-- src/dnadesign/densegen/src/utils/mpl_utils.py | 18 ++ .../tests/test_cli_summarize_library.py | 4 + .../densegen/tests/test_config_strict.py | 16 ++ .../densegen/tests/test_run_manifest.py | 5 + .../densegen/tests/test_source_cache.py | 3 +- 23 files changed, 836 insertions(+), 105 deletions(-) create mode 100644 src/dnadesign/densegen/src/core/runtime_policy.py create mode 100644 src/dnadesign/densegen/src/core/seeding.py create mode 100644 src/dnadesign/densegen/src/utils/mpl_utils.py diff --git a/src/dnadesign/densegen/docs/demo/demo_basic.md b/src/dnadesign/densegen/docs/demo/demo_basic.md index 9adf7e50..b0589e96 100644 --- a/src/dnadesign/densegen/docs/demo/demo_basic.md +++ b/src/dnadesign/densegen/docs/demo/demo_basic.md @@ -200,13 +200,12 @@ Quota plan: meme_demo=50 🎉 Run complete. ``` -On macOS you may see Arrow sysctl warnings after generation; they are emitted by pyarrow and do -not indicate a DenseGen failure. +DenseGen suppresses noisy pyarrow sysctl warnings to keep stdout clean during long runs. ## 8) Inspect run summary -DenseGen writes `outputs/meta/run_manifest.json` and `outputs/meta/inputs_manifest.json`. Summarize the -run manifest: +DenseGen writes `outputs/meta/run_manifest.json`, `outputs/meta/inputs_manifest.json`, and +`outputs/meta/effective_config.json`. Summarize the run manifest: ```bash uv run dense inspect run --run /private/tmp/densegen-demo-20260115-1405/demo_press @@ -244,7 +243,7 @@ Generate an audit-grade summary of the run: uv run dense report -c /private/tmp/densegen-demo-20260115-1405/demo_press/config.yaml --format all ``` -This writes `outputs/report.json`, `outputs/report.md`, and `outputs/report.html`. +This writes `outputs/report.json`, `outputs/report.md`, `outputs/report.html`, and `outputs/report_assets/`. ## 10) Inspect outputs @@ -265,6 +264,7 @@ pools report.html report.json report.md +report_assets ``` Inspect Stage‑A pools and Stage‑B libraries: diff --git a/src/dnadesign/densegen/docs/guide/generation.md b/src/dnadesign/densegen/docs/guide/generation.md index 041c458c..cab41eaa 100644 --- a/src/dnadesign/densegen/docs/guide/generation.md +++ b/src/dnadesign/densegen/docs/guide/generation.md @@ -76,6 +76,7 @@ DenseGen exposes dense-arrays solution modes via `solver.strategy`: - `optimal` - only the best solution per library. - `approximate` - heuristic solution per library (no solver options; backend optional). - `strands` - `single | double` (default: `double`). +Use `solver.fallback_to_cbc` to allow a CBC fallback if the preferred solver is not available. ```yaml solver: @@ -83,8 +84,13 @@ solver: strategy: diverse options: ["Threads=8", "TimeLimit=10"] strands: double + fallback_to_cbc: false + allow_unknown_options: false ``` +DenseGen validates solver option keys for known backends and fails fast on unknown options. If you +need to pass custom solver flags, set `solver.allow_unknown_options: true` explicitly. + --- ### Sampling controls diff --git a/src/dnadesign/densegen/docs/guide/inputs.md b/src/dnadesign/densegen/docs/guide/inputs.md index 3ad6b50d..1189bcab 100644 --- a/src/dnadesign/densegen/docs/guide/inputs.md +++ b/src/dnadesign/densegen/docs/guide/inputs.md @@ -111,7 +111,8 @@ Required sampling fields: - `retain_bin_ids` (optional list of ints): keep only specific p‑value bins - `log_every_batches` (int > 0): log yield summaries every N batches - `bgfile` (optional): MEME bfile-format background model for FIMO -- `keep_all_candidates_debug` (optional): write raw FIMO TSVs to `outputs/meta/fimo/` for inspection +- `keep_all_candidates_debug` (optional): write raw FIMO TSVs and candidate-level Parquet + (`candidates__