Skip to content

NVIDIA NeMo Curator 1.2.0

Latest

Choose a tag to compare

@chtruong814 chtruong814 released this 14 May 21:43
f07fa0e

NeMo Curator 26.04 (v1.2.0)

⚠️ Python 3.10 support ends in 26.06. This is the last release to support Python 3.10 — upgrade environments to 3.11+ before the next release.

Highlights

  • vLLM & Sentence Transformers embeddings — new VLLMEmbeddingModelStage and SentenceTransformerEmbeddingModelStage; EmbeddingCreatorStage gains use_sentence_transformer and cache_dir.
  • Inference Server (Ray Serve) — new InferenceServer / InferenceModelConfig to serve OpenAI-compatible LLMs inside a Ray cluster; new inference_server and sdg_cuda12 extras.
  • RayDataExecutor promoted out of experimental → nemo_curator.backends.ray_data.
  • Semantic dedup defaults to vLLM with google/embeddinggemma-300m (was SentenceTransformers + all-MiniLM-L6-v2).
  • Per-stage runtime environments — declare runtime_env on a ProcessingStage for isolated Python deps per stage.
  • Cosmos-Xenna 0.2.0 — simplified Resources API (gpu_memory_gb or gpus; nvdecs/nvencs/entire_gpu removed); Ray ≥ 2.54.
  • Multi-node Ray on SLURM — drop-in SlurmRayClient + reference tutorials (container & bare-metal).
  • NeMo Data Designer integration — new DataDesignerStage + NDD-backed Nemotron-CC stages.
  • Megatron tokenization writer — produce Megatron-LM .bin/.idx directly from a Curator pipeline.
  • Audio overhaulAudioBatchAudioTask; new VAD / Band / SIGMOS / UTMOS / Speaker-Separation stages, AudioDataFilterStage composite, streaming Sortformer diarization, ALM pipeline, DNS Challenge ReadSpeech tutorial.
  • Video — Nemotron Nano 12B V2 VLM captioning (bf16/fp8/nvfp4); fused DocumentIterateExtractStage for 3-stage acquisition pipelines.
  • Interleaved IOInterleavedParquetReader + InterleavedWebdatasetWriter close the WDS ⇄ Parquet round-trip; four new filters (blur, QR-code, CLIP-score, image/text ratio).
  • PDF pipeline — four-stage Nemotron-Parse Xenna pipeline (pypdfium2 dep added).
  • CommonCrawl S3 transport — opt-in via use_s3=True / CC_USE_S3.
  • Workflow results API — all dedup workflows now return WorkflowRunResult with structured per-stage metadata.
  • Multi-user metrics isolation — per-UID metrics dirs, PID-based tracking, auto Ray dashboards.

Security

Breaking Changes

  • Minimum Ray 2.54 (was 2.50).
  • TextSemanticDeduplicationWorkflow: default backend is now vLLM; default model is google/embeddinggemma-300m; removed embedding_model_inference_batch_size, embedding_pooling, embedding_padding_side, embedding_max_seq_length — use embedding_vllm_init_kwargs.
  • Resources: removed nvdecs, nvencs, entire_gpu — use gpus or gpu_memory_gb.
  • AudioBatch removed → use AudioTask (single dict, not list[dict]).
  • RayDataExecutor moved: nemo_curator.backends.experimental.ray_datanemo_curator.backends.ray_data.
  • DocumentExtractStage removed; DocumentIterateStage replaced by DocumentIterateExtractStage. Data acquisition is now 3 stages (URL gen → download → iterate-extract).
  • Dedup workflow run() returns: ExactDeduplicationWorkflow, FuzzyDeduplicationWorkflow (was None), SemanticDeduplicationWorkflow, TextSemanticDeduplicationWorkflow (was dict), TextDuplicatesRemovalWorkflow (was list[FileGroupTask] | None) — all now return WorkflowRunResult.

Deprecations

  • Python 3.10 — last supported in 26.04; removed in 26.06.

Full Release Notes

docs.nvidia.com/nemo/curator/v26.04/about/release-notes