NVIDIA-NeMo · mohammadaaftabv · Jun 5, 2026 · Jun 5, 2026 · Jun 5, 2026 · Jun 5, 2026
diff --git a/.cursor/skills/review-curator-pr/SKILL.md b/.cursor/skills/review-curator-pr/SKILL.md
@@ -0,0 +1,140 @@
+---
+name: review-curator-pr
+description: >-
+  Review NVIDIA-NeMo/Curator pull requests against the project's stage
+  contracts and contribution standards. Pulls fresh PR data with the GitHub
+  CLI, diffs the changed code, applies Curator's documented review lenses, and
+  emits a structured review digest plus a paste-ready comment queue. Use when a
+  contributor or maintainer asks to review, re-review, or triage a Curator PR
+  (e.g. "review PR 1967", "what's still open on PR 1898", "triage this PR"), or
+  to draft replies to reviewer comments.
+---
+
+# Review NeMo Curator PRs
+
+Produce a defensible, reproducible review of an `NVIDIA-NeMo/Curator` PR from a
+fresh GitHub pull plus the code diff. The audience is the PR author (who must
+respond to reviewers and land the change) and maintainers triaging the queue.
+
+Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a
+local checkout of this repository (the one containing this skill).
+
+## Two deliverables (always)
+
+Each review run writes two Markdown files into a scratch directory (default
+`.curator-pr-review/`, override with `--outdir`):
+
+1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state,
+   commits, changed-file table, reviews, every inline comment grouped by file
+   with OPEN / OUTDATED / RESOLVED status, and issue comments.
+2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root
+   threads that need an author reply or code change, paste-ready, with stale
+   threads listed at the bottom for traceability.
+
+See [templates.md](templates.md) for the exact section layout. Keeping the
+structure stable makes diffs across review passes mechanical, so a re-review
+only surfaces what changed.
+
+## Workflow
+
+```
+- [ ] 1. Identify the PR number
+- [ ] 2. Pull fresh GitHub data (gh)
+- [ ] 3. Check out the PR head and diff against the base
+- [ ] 4. Read the changed code through the review lenses
+- [ ] 5. Generate the digest + comment queue
+- [ ] 6. Classify findings P0-P3 and (optionally) draft replies
+```
+
+### Step 1 - Identify the PR
+
+Confirm the target: `gh pr view <N> --repo NVIDIA-NeMo/Curator`. If a prior
+review digest for this PR exists, read it first and record its head SHA as the
+baseline so commits and comments can be marked NEW vs already-seen.
+
+### Step 2 - Pull fresh GitHub data
+
+Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so
+prior pulls are preserved for delta analysis):
+
+```bash
+.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <N>
+```
+
+It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+`comments`, `files`, `commits`) and the GraphQL review threads, which carry the
+`isResolved` / `isOutdated` flags the REST inline endpoint omits.
+
+### Step 3 - Diff the code
+
+```bash
+gh pr checkout <N>                       # or: git fetch origin pull/<N>/head:pr-<N>
+git diff origin/main...HEAD             # changed files on this PR
+```
+
+Always review the actual diff, never the comment text alone. Use `main` as the
+baseline for "how the project already does this".
+
+### Step 4 - Read through the review lenses
+
+Apply the lenses in [recurring-themes.md](recurring-themes.md): stage
+contracts, setup/teardown lifecycle, dependency and optional-extra hygiene,
+secret-safe logging, memory/serialization of large payloads, streaming and
+performance, tutorials/docs, tests/coverage, and PR size/reviewability.
+
+These build on the project's own always-on rules in `.cursor/rules/`
+(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`,
+`executors`, `resources-configuration`, `pipeline-structure`,
+`modality-structure`, `coding-standards`) - treat those as the canonical
+contract and cite them when a change deviates.
+
+### Step 5 - Generate the two files
+
+```bash
+.cursor/skills/review-curator-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
+```
+
+The builder joins the pulled JSON and classifies each comment by thread status:
+**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**.
+The comment queue lists only OPEN root threads.
+
+### Step 6 - Findings and replies
+
+In the digest's findings section, classify each issue by severity:
+
+- **P0** - merge blocker: data loss / crash on a valid config, a secret leaked
+  into logs, or an import that breaks a supported install.
+- **P1** - fix before merge: undeclared runtime dependencies, missing required
+  tests/coverage, environment-specific code in a reusable path, a PR too large
+  to review safely.
+- **P2** - should fix: duplicated logic, inefficient batch handling, incomplete
+  metrics/observability, documented-but-unwired config.
+- **P3** - nice to have: temporary workarounds that need a test + comment,
+  debug-only knobs to document.
+
+When drafting replies, cite the exact `path:line-range` on the current head,
+state whether the point is already addressed (with the commit SHA) or will be,
+and keep each reply self-contained. See [templates.md](templates.md) section C.
+
+## Conventions to enforce (from CONTRIBUTING.md and rules)
+
+- **DCO sign-off** is required on every commit (`git commit -s`).
+- **Tests** must accompany source changes; GPU-only tests use
+  `@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
+  and `tests/` mirrors the `nemo_curator/` layout.
+- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
+- **Ruff** lint/format, 119-char lines; **loguru** for logging.
+- Supported Python: 3.10-3.12.
+
+## Knowledge sources
+
+Full index in [reference.md](reference.md). Quick map:
+
+| Need | Source |
+|------|--------|
+| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` |
+| Modality guides | `nemo_curator/stages/<modality>/README.md`, `tutorials/<modality>/README.md` |
+| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` |
+| Perf expectations | `benchmarking/` |
+| Contribution rules | `CONTRIBUTING.md` |
+| Live PR data | `gh` (see scripts) |
diff --git a/.cursor/skills/review-curator-pr/recurring-themes.md b/.cursor/skills/review-curator-pr/recurring-themes.md
@@ -0,0 +1,127 @@
+# Review lenses - recurring themes in Curator PRs
+
+Apply each lens to the diff and cite exact `path:line` evidence. These extend
+the always-on rules in `.cursor/rules/`; when a change violates a rule, link the
+rule by name. Most points are modality-agnostic; audio examples are
+illustrative.
+
+## Stage contracts and reuse
+
+- A stage subclasses `ProcessingStage[InputTask, OutputTask]` with explicit
+  `inputs()` / `outputs()`, a meaningful `name`, and appropriate `resources` /
+  `batch_size` (set as class attributes / dataclass fields, never as the
+  read-only `_name` / `_resources` / `_batch_size` properties).
+- GPU or vectorizable work belongs in `process_batch()` with a real
+  `batch_size`; `__init__` stays lightweight (it runs on the driver).
+- Reuse existing readers, writers, and shared model abstractions instead of
+  bespoke orchestration. Use `CompositeStage` to present several internal stages
+  as one user-facing stage.
+- If `inputs()` does not declare a required key, runtime validation will not
+  check it, and a direct `task.data[key]` access then raises an opaque
+  `KeyError` instead of a clear validation error. Declare required keys.
+- Fan-out stages (one task -> many) need deterministic child IDs and propagated
+  `_metadata` / `_stage_perf`. First-stage discovery from an empty input should
+  constrain placement so work is not duplicated across every worker.
+
+## Setup / teardown lifecycle and dependencies
+
+- `setup_on_node()` = once-per-node prep (e.g. model downloads / shared cache).
+  `setup()` = per-worker init (model load, GPU allocation). `teardown()` frees
+  resources. Don't conflate them.
+- Import heavy or optional dependencies lazily inside `setup()` (or behind a
+  guarded import), never unconditionally at module top level, when the dependency
+  ships only with an optional extra. A top-level `import <optional_pkg>` breaks
+  `import nemo_curator...` on any install without that extra.
+- Declare the full runtime dependency set in the appropriate `pyproject.toml`
+  optional extra; add an import smoke test under that extra. Don't rely on an
+  external Docker image to supply dependencies a reusable stage needs.
+- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
+  shared across modalities (e.g. `transformers`, `huggingface-hub`,
+  `accelerate`); prefer compatible ranges so other extras don't break.
+
+## Secret-safe logging
+
+- Never log a full resolved config object; it can contain tokens / credentials
+  (e.g. an HF token, access keys). Redact secret-like keys before logging, or
+  log only a non-secret subset, and add a test asserting the secret value never
+  appears in logged output.
+
+## Memory and serialization of large payloads
+
+- Drop large in-memory payloads (waveform arrays, image/video tensors) from
+  `task.data` once a stage no longer needs them. Leaked arrays reach downstream
+  JSON writers and raise `TypeError: Object of type ndarray is not JSON
+  serializable` (which can fail an entire batch) or cause memory blowups. Strip
+  by key plus a duck-typed guard, and wrap `json.dumps` so a stray
+  non-serializable value reports the offending key instead of crashing.
+- Guard external I/O edge cases (e.g. `tarfile.extractfile()` returns `None` for
+  non-regular members even after `isfile()`; check before `.read()`).
+
+## Streaming, performance, benchmarking
+
+- Streaming execution with per-stage resource configuration is the default model
+  for scalable pipelines (see `executors.mdc`).
+- Emit stage metrics through the supported metrics path; keep records
+  JSON-shaped. For multi-node analysis, include worker/node/GPU identity and
+  latency distributions (p50/p95/p99), not just totals/averages.
+- Writers that loop per task inside `process_batch()` reopen output files
+  repeatedly; group by output target and open once per batch.
+- Standalone benchmark scripts belong in the `benchmarking/` flow with config
+  entries and comparable parameters - not freestanding.
+
+## Environment-specific code in reusable paths
+
+- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
+  cluster-specific paths) out of generic reader/writer code. Make remote access
+  pluggable: accept local paths, ordinary cloud URIs, or a user-provided
+  opener/resolver, and isolate any environment-specific support behind a small,
+  documented, optional component.
+
+## Tutorials and docs
+
+- Tutorials must be runnable, minimal, and use public APIs only - no private
+  APIs, no machine-local absolute paths, no committed noisy notebook output, no
+  executor mismatch between a notebook and its CLI. Prefer `uv` for environment
+  setup if the tutorial standard does.
+- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
+  an `examples/` script; extract one shared builder both call.
+- Every config key documented for users must actually reach a stage; validate
+  config-to-stage mapping and fail (or warn) on unused keys.
+- Don't commit generated analysis/scratch docs into the source tree.
+
+## Tests and standards (CONTRIBUTING + coding-standards)
+
+- Source changes need accompanying tests; `tests/` mirrors `nemo_curator/`. The
+  project enforces 80% coverage under `nemo_curator/`. GPU-only tests use
+  `@pytest.mark.gpu`.
+- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
+- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
+- Every commit is DCO signed-off.
+
+## PR size and reviewability
+
+- Large multi-purpose PRs are hard to review and to defend. Prefer splitting by
+  stage / module / tutorial / benchmark. If a single PR must remain, organize it
+  as if sliced: no duplicated builder logic, clear module ownership, and a
+  reviewable commit sequence.
+
+## Smaller recurring nits
+
+- Missing copyright headers on new files.
+- Top-level imports that should be lazy/guarded.
+- Long inline strings (prompts, configs) that should be module constants.
+- Dataclass fields validated in `__post_init__` that could just be typed.
+- Dead conditionals / redundant truthiness checks.
+- Tests added for tutorials that don't warrant them.
+- Debug-only knobs presented as features - document them as debug throttles.
+- `trust_remote_code=True` (or similar) hardcoded - expose as a parameter so
+  security-conscious users can opt out.
+
+## Severity mapping
+
+| Severity | Meaning |
+|----------|---------|
+| P0 | Merge blocker: crash/data-loss on a valid config, secret leaked to logs, import breaks a supported install |
+| P1 | Fix before merge: undeclared deps, missing required tests/coverage, env-specific code in reusable path, PR too large |
+| P2 | Should fix: duplicated logic, inefficient batching, incomplete metrics, documented-but-unwired config |
+| P3 | Nice to have: temporary workaround needing test+comment, debug knob docs |
diff --git a/.cursor/skills/review-curator-pr/reference.md b/.cursor/skills/review-curator-pr/reference.md
@@ -0,0 +1,86 @@
+# Reference - NeMo Curator PR review
+
+All paths are relative to the repository root. The skill is self-contained: it
+relies only on this repository plus the GitHub CLI (`gh`).
+
+## Canonical contracts (`.cursor/rules/`)
+
+These always-on rules are the authoritative description of how stages and
+pipelines must be written. Cite them when a PR deviates.
+
+| Rule | Covers |
+|------|--------|
+| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
+| `composite-stage-patterns.mdc` | `CompositeStage` decomposition |
+| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
+| `executors.mdc` | Xenna / Ray Data backends and when to use each |
+| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
+| `pipeline-structure.mdc` | Pipeline assembly |
+| `modality-structure.mdc` | Where modality code/tests/tutorials live |
+| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |
+
+## Framework code (diff ground truth)
+
+- `nemo_curator/stages/base.py` - `ProcessingStage` base class.
+- `nemo_curator/backends/` - `base.py`, `xenna/`, `ray_data/`,
+  `ray_actor_pool/`; adapters define how stages run on each backend.
+- `nemo_curator/pipeline/pipeline.py` - pipeline construction and guards.
+- `nemo_curator/tasks/` - task types.
+
+## Modality guides
+
+| Modality | Stage README | Tutorials |
+|----------|--------------|-----------|
+| audio | `nemo_curator/stages/audio/README.md` | `tutorials/audio/` |
+| interleaved | `nemo_curator/stages/interleaved/README.md` | `tutorials/interleaved/` |
+| text / image / video / math / synthetic | (stage code under `nemo_curator/stages/`) | `tutorials/{text,image,video,math,synthetic}/` |
+
+## Performance / benchmarking
+
+- `benchmarking/README.md` - how benchmarks are wired and run.
+- `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` - modality
+  perf expectations. Standalone benchmark scripts should be integrated here, not
+  left freestanding.
+
+## Contribution rules
+
+- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test
+  requirements, coverage threshold, copyright header.
+
+## GitHub data via `gh`
+
+`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir:
+
+| File | Endpoint |
+|------|----------|
+| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
+| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
+| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
+| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
+| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
+| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
+| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |
+
+The REST inline-comments endpoint does not expose resolve/outdate state; the
+GraphQL thread payload does. The builder joins them by comment `databaseId`
+(== REST `id`), falling back to a `(path, body-prefix)` match when an older
+thread dump lacks `databaseId`.
+
+GraphQL query used by the pull script:
+
+```graphql
+query($owner:String!,$repo:String!,$pr:Int!){
+  repository(owner:$owner,name:$repo){
+    pullRequest(number:$pr){
+      reviewThreads(first:100){ nodes{
+        id isResolved isOutdated isCollapsed line originalLine path
+        comments(first:50){ nodes{ databaseId } }
+      }}}}}
+```
+
+## Scripts
+
+- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch.
+- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue.
+
+Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore).
diff --git a/.cursor/skills/review-curator-pr/scripts/README.md b/.cursor/skills/review-curator-pr/scripts/README.md
@@ -0,0 +1,32 @@
+# Scripts - review-curator-pr
+
+Both require the GitHub CLI (`gh`) authenticated against github.com. They write
+to a scratch directory (default `.curator-pr-review/`) that is safe to delete;
+do not commit its contents.
+
+## pr_review_pull.sh
+
+```bash
+.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
+```
+
+Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
+`comments`, `files`, `commits`) plus the GraphQL review threads into
+`pr<N>_*_latest.json` (and timestamped snapshots for delta analysis).
+
+## build_digest.py
+
+```bash
+.cursor/skills/review-curator-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]
+```
+
+Joins the `pr<N>_*_latest.json` files and writes
+`curator_pr<N>_fresh_review_<date>.md` + `curator_pr<N>_github_comment_queue_<date>.md`.
+
+- `--prev-head` sets the SHA in the "commits since prior reviewed head" header.
+- `--baseline-ts <TS>` (a prior pull's timestamp suffix) marks
+  comments/reviews/commits NEW vs already-seen.
+
+The thread join uses comment `databaseId` when present, else a
+(path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
+correctly.