NeMo Curator 26.04 (v1.2.0)
⚠️ Python 3.10 support ends in 26.06. This is the last release to support Python 3.10 — upgrade environments to 3.11+ before the next release.
Highlights
- vLLM & Sentence Transformers embeddings — new
VLLMEmbeddingModelStage and SentenceTransformerEmbeddingModelStage; EmbeddingCreatorStage gains use_sentence_transformer and cache_dir.
- Inference Server (Ray Serve) — new
InferenceServer / InferenceModelConfig to serve OpenAI-compatible LLMs inside a Ray cluster; new inference_server and sdg_cuda12 extras.
RayDataExecutor promoted out of experimental → nemo_curator.backends.ray_data.
- Semantic dedup defaults to vLLM with
google/embeddinggemma-300m (was SentenceTransformers + all-MiniLM-L6-v2).
- Per-stage runtime environments — declare
runtime_env on a ProcessingStage for isolated Python deps per stage.
- Cosmos-Xenna 0.2.0 — simplified
Resources API (gpu_memory_gb or gpus; nvdecs/nvencs/entire_gpu removed); Ray ≥ 2.54.
- Multi-node Ray on SLURM — drop-in
SlurmRayClient + reference tutorials (container & bare-metal).
- NeMo Data Designer integration — new
DataDesignerStage + NDD-backed Nemotron-CC stages.
- Megatron tokenization writer — produce Megatron-LM
.bin/.idx directly from a Curator pipeline.
- Audio overhaul —
AudioBatch → AudioTask; new VAD / Band / SIGMOS / UTMOS / Speaker-Separation stages, AudioDataFilterStage composite, streaming Sortformer diarization, ALM pipeline, DNS Challenge ReadSpeech tutorial.
- Video — Nemotron Nano 12B V2 VLM captioning (bf16/fp8/nvfp4); fused
DocumentIterateExtractStage for 3-stage acquisition pipelines.
- Interleaved IO —
InterleavedParquetReader + InterleavedWebdatasetWriter close the WDS ⇄ Parquet round-trip; four new filters (blur, QR-code, CLIP-score, image/text ratio).
- PDF pipeline — four-stage Nemotron-Parse Xenna pipeline (
pypdfium2 dep added).
- CommonCrawl S3 transport — opt-in via
use_s3=True / CC_USE_S3.
- Workflow results API — all dedup workflows now return
WorkflowRunResult with structured per-stage metadata.
- Multi-user metrics isolation — per-UID metrics dirs, PID-based tracking, auto Ray dashboards.
Security
Breaking Changes
- Minimum Ray 2.54 (was 2.50).
TextSemanticDeduplicationWorkflow: default backend is now vLLM; default model is google/embeddinggemma-300m; removed embedding_model_inference_batch_size, embedding_pooling, embedding_padding_side, embedding_max_seq_length — use embedding_vllm_init_kwargs.
Resources: removed nvdecs, nvencs, entire_gpu — use gpus or gpu_memory_gb.
AudioBatch removed → use AudioTask (single dict, not list[dict]).
RayDataExecutor moved: nemo_curator.backends.experimental.ray_data → nemo_curator.backends.ray_data.
DocumentExtractStage removed; DocumentIterateStage replaced by DocumentIterateExtractStage. Data acquisition is now 3 stages (URL gen → download → iterate-extract).
- Dedup workflow
run() returns: ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow (was None), SemanticDeduplicationWorkflow, TextSemanticDeduplicationWorkflow (was dict), TextDuplicatesRemovalWorkflow (was list[FileGroupTask] | None) — all now return WorkflowRunResult.
Deprecations
- Python 3.10 — last supported in 26.04; removed in 26.06.
Full Release Notes
docs.nvidia.com/nemo/curator/v26.04/about/release-notes