NeMo Curator 26.04 (v1.2.0)

⚠️ Python 3.10 support ends in 26.06. This is the last release to support Python 3.10 — upgrade environments to 3.11+ before the next release.

Highlights

vLLM & Sentence Transformers embeddings — new VLLMEmbeddingModelStage and SentenceTransformerEmbeddingModelStage; EmbeddingCreatorStage gains use_sentence_transformer and cache_dir.
Inference Server (Ray Serve) — new InferenceServer / InferenceModelConfig to serve OpenAI-compatible LLMs inside a Ray cluster; new inference_server and sdg_cuda12 extras.
RayDataExecutor promoted out of experimental → nemo_curator.backends.ray_data.
Semantic dedup defaults to vLLM with google/embeddinggemma-300m (was SentenceTransformers + all-MiniLM-L6-v2).
Per-stage runtime environments — declare runtime_env on a ProcessingStage for isolated Python deps per stage.
Cosmos-Xenna 0.2.0 — simplified Resources API (gpu_memory_gb or gpus; nvdecs/nvencs/entire_gpu removed); Ray ≥ 2.54.
Multi-node Ray on SLURM — drop-in SlurmRayClient + reference tutorials (container & bare-metal).
NeMo Data Designer integration — new DataDesignerStage + NDD-backed Nemotron-CC stages.
Megatron tokenization writer — produce Megatron-LM .bin/.idx directly from a Curator pipeline.
Audio overhaul — AudioBatch → AudioTask; new VAD / Band / SIGMOS / UTMOS / Speaker-Separation stages, AudioDataFilterStage composite, streaming Sortformer diarization, ALM pipeline, DNS Challenge ReadSpeech tutorial.
Video — Nemotron Nano 12B V2 VLM captioning (bf16/fp8/nvfp4); fused DocumentIterateExtractStage for 3-stage acquisition pipelines.
Interleaved IO — InterleavedParquetReader + InterleavedWebdatasetWriter close the WDS ⇄ Parquet round-trip; four new filters (blur, QR-code, CLIP-score, image/text ratio).
PDF pipeline — four-stage Nemotron-Parse Xenna pipeline (pypdfium2 dep added).
CommonCrawl S3 transport — opt-in via use_s3=True / CC_USE_S3.
Workflow results API — all dedup workflows now return WorkflowRunResult with structured per-stage metadata.
Multi-user metrics isolation — per-UID metrics dirs, PID-based tracking, auto Ray dashboards.

Security

nemo-toolkit RCE (CVE-2025-33245, CVE-2025-33253) — bumped to >=2.7.2.
xgrammar DoS (CVE-2026-25048) — override to >=0.1.32.
jackson-core DoS (GHSA-72hv-8253-57qq) — Ray's bundled ray_dist.jar removed from the container image.

Breaking Changes

Minimum Ray 2.54 (was 2.50).
TextSemanticDeduplicationWorkflow: default backend is now vLLM; default model is google/embeddinggemma-300m; removed embedding_model_inference_batch_size, embedding_pooling, embedding_padding_side, embedding_max_seq_length — use embedding_vllm_init_kwargs.
Resources: removed nvdecs, nvencs, entire_gpu — use gpus or gpu_memory_gb.
AudioBatch removed → use AudioTask (single dict, not list[dict]).
RayDataExecutor moved: nemo_curator.backends.experimental.ray_data → nemo_curator.backends.ray_data.
DocumentExtractStage removed; DocumentIterateStage replaced by DocumentIterateExtractStage. Data acquisition is now 3 stages (URL gen → download → iterate-extract).
Dedup workflow run() returns: ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow (was None), SemanticDeduplicationWorkflow, TextSemanticDeduplicationWorkflow (was dict), TextDuplicatesRemovalWorkflow (was list[FileGroupTask] | None) — all now return WorkflowRunResult.

Deprecations

Python 3.10 — last supported in 26.04; removed in 26.06.

Full Release Notes

docs.nvidia.com/nemo/curator/v26.04/about/release-notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVIDIA NeMo Curator 1.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

NeMo Curator 26.04 (v1.2.0)

Highlights

Security

Breaking Changes

Deprecations

Full Release Notes

Uh oh!