Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions .cursor/skills/review-curator-pr/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
---
name: review-curator-pr
description: >-
Review NVIDIA-NeMo/Curator pull requests against the project's stage
contracts and contribution standards. Pulls fresh PR data with the GitHub
CLI, diffs the changed code, applies Curator's documented review lenses, and
emits a structured review digest plus a paste-ready comment queue. Use when a
contributor or maintainer asks to review, re-review, or triage a Curator PR
(e.g. "review PR 1967", "what's still open on PR 1898", "triage this PR"), or
to draft replies to reviewer comments.
---

# Review NeMo Curator PRs

Produce a defensible, reproducible review of an `NVIDIA-NeMo/Curator` PR from a
fresh GitHub pull plus the code diff. The audience is the PR author (who must
respond to reviewers and land the change) and maintainers triaging the queue.

Requirements: the GitHub CLI (`gh`) authenticated against `github.com`, and a
local checkout of this repository (the one containing this skill).

## Two deliverables (always)

Each review run writes two Markdown files into a scratch directory (default
`.curator-pr-review/`, override with `--outdir`):

1. `curator_pr<N>_fresh_review_<YYYY_MM_DD>.md` - full digest: PR state,
commits, changed-file table, reviews, every inline comment grouped by file
with OPEN / OUTDATED / RESOLVED status, and issue comments.
2. `curator_pr<N>_github_comment_queue_<YYYY_MM_DD>.md` - only the OPEN root
threads that need an author reply or code change, paste-ready, with stale
threads listed at the bottom for traceability.

See [templates.md](templates.md) for the exact section layout. Keeping the
structure stable makes diffs across review passes mechanical, so a re-review
only surfaces what changed.

## Workflow

```
- [ ] 1. Identify the PR number
- [ ] 2. Pull fresh GitHub data (gh)
- [ ] 3. Check out the PR head and diff against the base
- [ ] 4. Read the changed code through the review lenses
- [ ] 5. Generate the digest + comment queue
- [ ] 6. Classify findings P0-P3 and (optionally) draft replies
```

### Step 1 - Identify the PR

Confirm the target: `gh pr view <N> --repo NVIDIA-NeMo/Curator`. If a prior
review digest for this PR exists, read it first and record its head SHA as the
baseline so commits and comments can be marked NEW vs already-seen.

### Step 2 - Pull fresh GitHub data

Run the pull script (writes `pr<N>_*_latest.json` plus timestamped snapshots so
prior pulls are preserved for delta analysis):

```bash
.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <N>
```

It pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
`comments`, `files`, `commits`) and the GraphQL review threads, which carry the
`isResolved` / `isOutdated` flags the REST inline endpoint omits.

### Step 3 - Diff the code

```bash
gh pr checkout <N> # or: git fetch origin pull/<N>/head:pr-<N>
git diff origin/main...HEAD # changed files on this PR
```

Always review the actual diff, never the comment text alone. Use `main` as the
baseline for "how the project already does this".

### Step 4 - Read through the review lenses

Apply the lenses in [recurring-themes.md](recurring-themes.md): stage
contracts, setup/teardown lifecycle, dependency and optional-extra hygiene,
secret-safe logging, memory/serialization of large payloads, streaming and
performance, tutorials/docs, tests/coverage, and PR size/reviewability.

These build on the project's own always-on rules in `.cursor/rules/`
(`processing-stage-patterns`, `composite-stage-patterns`, `task-patterns`,
`executors`, `resources-configuration`, `pipeline-structure`,
`modality-structure`, `coding-standards`) - treat those as the canonical
contract and cite them when a change deviates.

### Step 5 - Generate the two files

```bash
.cursor/skills/review-curator-pr/scripts/build_digest.py <N> --today <YYYY-MM-DD>
```

The builder joins the pulled JSON and classifies each comment by thread status:
**OPEN** (actionable), **OUTDATED** (pre-dates the current head), **RESOLVED**.
The comment queue lists only OPEN root threads.

### Step 6 - Findings and replies

In the digest's findings section, classify each issue by severity:

- **P0** - merge blocker: data loss / crash on a valid config, a secret leaked
into logs, or an import that breaks a supported install.
- **P1** - fix before merge: undeclared runtime dependencies, missing required
tests/coverage, environment-specific code in a reusable path, a PR too large
to review safely.
- **P2** - should fix: duplicated logic, inefficient batch handling, incomplete
metrics/observability, documented-but-unwired config.
- **P3** - nice to have: temporary workarounds that need a test + comment,
debug-only knobs to document.

When drafting replies, cite the exact `path:line-range` on the current head,
state whether the point is already addressed (with the commit SHA) or will be,
and keep each reply self-contained. See [templates.md](templates.md) section C.

## Conventions to enforce (from CONTRIBUTING.md and rules)

- **DCO sign-off** is required on every commit (`git commit -s`).
- **Tests** must accompany source changes; GPU-only tests use
`@pytest.mark.gpu`; the project enforces 80% coverage under `nemo_curator/`,
and `tests/` mirrors the `nemo_curator/` layout.
- **NVIDIA Apache-2.0 copyright header** on every non-empty Python file.
- **Ruff** lint/format, 119-char lines; **loguru** for logging.
- Supported Python: 3.10-3.12.

## Knowledge sources

Full index in [reference.md](reference.md). Quick map:

| Need | Source |
|------|--------|
| Stage / pipeline contracts | `.cursor/rules/*.mdc`, `nemo_curator/stages/base.py` |
| Modality guides | `nemo_curator/stages/<modality>/README.md`, `tutorials/<modality>/README.md` |
| Backends / executors | `nemo_curator/backends/`, `nemo_curator/pipeline/pipeline.py` |
| Perf expectations | `benchmarking/` |
| Contribution rules | `CONTRIBUTING.md` |
| Live PR data | `gh` (see scripts) |
127 changes: 127 additions & 0 deletions .cursor/skills/review-curator-pr/recurring-themes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Review lenses - recurring themes in Curator PRs

Apply each lens to the diff and cite exact `path:line` evidence. These extend
the always-on rules in `.cursor/rules/`; when a change violates a rule, link the
rule by name. Most points are modality-agnostic; audio examples are
illustrative.

## Stage contracts and reuse

- A stage subclasses `ProcessingStage[InputTask, OutputTask]` with explicit
`inputs()` / `outputs()`, a meaningful `name`, and appropriate `resources` /
`batch_size` (set as class attributes / dataclass fields, never as the
read-only `_name` / `_resources` / `_batch_size` properties).
- GPU or vectorizable work belongs in `process_batch()` with a real
`batch_size`; `__init__` stays lightweight (it runs on the driver).
- Reuse existing readers, writers, and shared model abstractions instead of
bespoke orchestration. Use `CompositeStage` to present several internal stages
as one user-facing stage.
- If `inputs()` does not declare a required key, runtime validation will not
check it, and a direct `task.data[key]` access then raises an opaque
`KeyError` instead of a clear validation error. Declare required keys.
- Fan-out stages (one task -> many) need deterministic child IDs and propagated
`_metadata` / `_stage_perf`. First-stage discovery from an empty input should
constrain placement so work is not duplicated across every worker.

## Setup / teardown lifecycle and dependencies

- `setup_on_node()` = once-per-node prep (e.g. model downloads / shared cache).
`setup()` = per-worker init (model load, GPU allocation). `teardown()` frees
resources. Don't conflate them.
- Import heavy or optional dependencies lazily inside `setup()` (or behind a
guarded import), never unconditionally at module top level, when the dependency
ships only with an optional extra. A top-level `import <optional_pkg>` breaks
`import nemo_curator...` on any install without that extra.
- Declare the full runtime dependency set in the appropriate `pyproject.toml`
optional extra; add an import smoke test under that extra. Don't rely on an
external Docker image to supply dependencies a reusable stage needs.
- Avoid hard version pins (`pkg==x.y.z`) and global overrides for libraries
shared across modalities (e.g. `transformers`, `huggingface-hub`,
`accelerate`); prefer compatible ranges so other extras don't break.

## Secret-safe logging

- Never log a full resolved config object; it can contain tokens / credentials
(e.g. an HF token, access keys). Redact secret-like keys before logging, or
log only a non-secret subset, and add a test asserting the secret value never
appears in logged output.

## Memory and serialization of large payloads

- Drop large in-memory payloads (waveform arrays, image/video tensors) from
`task.data` once a stage no longer needs them. Leaked arrays reach downstream
JSON writers and raise `TypeError: Object of type ndarray is not JSON
serializable` (which can fail an entire batch) or cause memory blowups. Strip
by key plus a duck-typed guard, and wrap `json.dumps` so a stray
non-serializable value reports the offending key instead of crashing.
- Guard external I/O edge cases (e.g. `tarfile.extractfile()` returns `None` for
non-regular members even after `isfile()`; check before `.read()`).

## Streaming, performance, benchmarking

- Streaming execution with per-stage resource configuration is the default model
for scalable pipelines (see `executors.mdc`).
- Emit stage metrics through the supported metrics path; keep records
JSON-shaped. For multi-node analysis, include worker/node/GPU identity and
latency distributions (p50/p95/p99), not just totals/averages.
- Writers that loop per task inside `process_batch()` reopen output files
repeatedly; group by output target and open once per batch.
- Standalone benchmark scripts belong in the `benchmarking/` flow with config
entries and comparable parameters - not freestanding.

## Environment-specific code in reusable paths

- Keep infrastructure assumptions (specific object-store URLs, auth env vars,
cluster-specific paths) out of generic reader/writer code. Make remote access
pluggable: accept local paths, ordinary cloud URIs, or a user-provided
opener/resolver, and isolate any environment-specific support behind a small,
documented, optional component.

## Tutorials and docs

- Tutorials must be runnable, minimal, and use public APIs only - no private
APIs, no machine-local absolute paths, no committed noisy notebook output, no
executor mismatch between a notebook and its CLI. Prefer `uv` for environment
setup if the tutorial standard does.
- Don't duplicate pipeline-construction logic between a tutorial entrypoint and
an `examples/` script; extract one shared builder both call.
- Every config key documented for users must actually reach a stage; validate
config-to-stage mapping and fail (or warn) on unused keys.
- Don't commit generated analysis/scratch docs into the source tree.

## Tests and standards (CONTRIBUTING + coding-standards)

- Source changes need accompanying tests; `tests/` mirrors `nemo_curator/`. The
project enforces 80% coverage under `nemo_curator/`. GPU-only tests use
`@pytest.mark.gpu`.
- Every non-empty Python file carries the NVIDIA Apache-2.0 copyright header.
- Ruff clean at 119-char lines; type annotations on functions; loguru for logs.
- Every commit is DCO signed-off.

## PR size and reviewability

- Large multi-purpose PRs are hard to review and to defend. Prefer splitting by
stage / module / tutorial / benchmark. If a single PR must remain, organize it
as if sliced: no duplicated builder logic, clear module ownership, and a
reviewable commit sequence.

## Smaller recurring nits

- Missing copyright headers on new files.
- Top-level imports that should be lazy/guarded.
- Long inline strings (prompts, configs) that should be module constants.
- Dataclass fields validated in `__post_init__` that could just be typed.
- Dead conditionals / redundant truthiness checks.
- Tests added for tutorials that don't warrant them.
- Debug-only knobs presented as features - document them as debug throttles.
- `trust_remote_code=True` (or similar) hardcoded - expose as a parameter so
security-conscious users can opt out.

## Severity mapping

| Severity | Meaning |
|----------|---------|
| P0 | Merge blocker: crash/data-loss on a valid config, secret leaked to logs, import breaks a supported install |
| P1 | Fix before merge: undeclared deps, missing required tests/coverage, env-specific code in reusable path, PR too large |
| P2 | Should fix: duplicated logic, inefficient batching, incomplete metrics, documented-but-unwired config |
| P3 | Nice to have: temporary workaround needing test+comment, debug knob docs |
86 changes: 86 additions & 0 deletions .cursor/skills/review-curator-pr/reference.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Reference - NeMo Curator PR review

All paths are relative to the repository root. The skill is self-contained: it
relies only on this repository plus the GitHub CLI (`gh`).

## Canonical contracts (`.cursor/rules/`)

These always-on rules are the authoritative description of how stages and
pipelines must be written. Cite them when a PR deviates.

| Rule | Covers |
|------|--------|
| `processing-stage-patterns.mdc` | `ProcessingStage` subclassing, `inputs()`/`outputs()`, `process()`, lifecycle hooks |
| `composite-stage-patterns.mdc` | `CompositeStage` decomposition |
| `task-patterns.mdc` | `Task` / typed task subclasses, metadata flow |
| `executors.mdc` | Xenna / Ray Data backends and when to use each |
| `resources-configuration.mdc` | `Resources`, `batch_size`, GPU allocation |
| `pipeline-structure.mdc` | Pipeline assembly |
| `modality-structure.mdc` | Where modality code/tests/tutorials live |
| `coding-standards.mdc` | Ruff, copyright header, loguru, pytest, Python versions |

## Framework code (diff ground truth)

- `nemo_curator/stages/base.py` - `ProcessingStage` base class.
- `nemo_curator/backends/` - `base.py`, `xenna/`, `ray_data/`,
`ray_actor_pool/`; adapters define how stages run on each backend.
- `nemo_curator/pipeline/pipeline.py` - pipeline construction and guards.
- `nemo_curator/tasks/` - task types.

## Modality guides

| Modality | Stage README | Tutorials |
|----------|--------------|-----------|
| audio | `nemo_curator/stages/audio/README.md` | `tutorials/audio/` |
| interleaved | `nemo_curator/stages/interleaved/README.md` | `tutorials/interleaved/` |
| text / image / video / math / synthetic | (stage code under `nemo_curator/stages/`) | `tutorials/{text,image,video,math,synthetic}/` |

## Performance / benchmarking

- `benchmarking/README.md` - how benchmarks are wired and run.
- `benchmarking/AUDIO_PROFILING.md`, `benchmarking/ALM_BENCHMARK.md` - modality
perf expectations. Standalone benchmark scripts should be integrated here, not
left freestanding.

## Contribution rules

- `CONTRIBUTING.md` - DCO sign-off (`git commit -s`), PR process, test
requirements, coverage threshold, copyright header.

## GitHub data via `gh`

`scripts/pr_review_pull.sh <N>` pulls these into the scratch outdir:

| File | Endpoint |
|------|----------|
| `pr<N>_gh_latest.json` | `gh pr view <N> --json ...` |
| `pr<N>_reviews_latest.json` | `repos/:owner/:repo/pulls/<N>/reviews` |
| `pr<N>_review_comments_latest.json` | `repos/:owner/:repo/pulls/<N>/comments` (inline) |
| `pr<N>_issue_comments_latest.json` | `repos/:owner/:repo/issues/<N>/comments` |
| `pr<N>_files_latest.json` | `repos/:owner/:repo/pulls/<N>/files` |
| `pr<N>_commits_latest.json` | `repos/:owner/:repo/pulls/<N>/commits` |
| `pr<N>_review_threads_latest.json` | GraphQL `pullRequest.reviewThreads` (isResolved/isOutdated) |

The REST inline-comments endpoint does not expose resolve/outdate state; the
GraphQL thread payload does. The builder joins them by comment `databaseId`
(== REST `id`), falling back to a `(path, body-prefix)` match when an older
thread dump lacks `databaseId`.

GraphQL query used by the pull script:

```graphql
query($owner:String!,$repo:String!,$pr:Int!){
repository(owner:$owner,name:$repo){
pullRequest(number:$pr){
reviewThreads(first:100){ nodes{
id isResolved isOutdated isCollapsed line originalLine path
comments(first:50){ nodes{ databaseId } }
}}}}}
```

## Scripts

- `scripts/pr_review_pull.sh <N> [--outdir DIR] [--repo OWNER/REPO]` - fetch.
- `scripts/build_digest.py <N> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]` - render the digest + comment queue.

Default outdir: `.curator-pr-review/` (scratch; safe to delete or gitignore).
32 changes: 32 additions & 0 deletions .cursor/skills/review-curator-pr/scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Scripts - review-curator-pr

Both require the GitHub CLI (`gh`) authenticated against github.com. They write
to a scratch directory (default `.curator-pr-review/`) that is safe to delete;
do not commit its contents.

## pr_review_pull.sh

```bash
.cursor/skills/review-curator-pr/scripts/pr_review_pull.sh <PR_NUMBER> [--outdir DIR] [--repo OWNER/REPO]
```

Pulls six REST endpoints (`pr view`, `reviews`, inline `comments`, issue
`comments`, `files`, `commits`) plus the GraphQL review threads into
`pr<N>_*_latest.json` (and timestamped snapshots for delta analysis).

## build_digest.py

```bash
.cursor/skills/review-curator-pr/scripts/build_digest.py <PR_NUMBER> [--outdir DIR] [--today YYYY-MM-DD] [--prev-head SHA] [--baseline-ts TS]
```

Joins the `pr<N>_*_latest.json` files and writes
`curator_pr<N>_fresh_review_<date>.md` + `curator_pr<N>_github_comment_queue_<date>.md`.

- `--prev-head` sets the SHA in the "commits since prior reviewed head" header.
- `--baseline-ts <TS>` (a prior pull's timestamp suffix) marks
comments/reviews/commits NEW vs already-seen.

The thread join uses comment `databaseId` when present, else a
(path, body-prefix) fallback, so both GraphQL thread-dump shapes classify
correctly.
Loading
Loading