feat: Add DataChef recipe generation integration (Issue #1760) by anushkagupta200615-jpg · Pull Request #2095 · NVIDIA-NeMo/Curator

anushkagupta200615-jpg · 2026-06-22T14:28:41Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Resolves #1760

Description
Adds integration with the DataChef LLM model (arXiv:2602.11089) to generate end-to-end NeMo Curator pipeline specifications dynamically, based on a target benchmark and base model.

Changes

nemo_curator/recipe/datachef.py: Implements the DataChefRecipeGenerator class. It manages API payload construction, YAML parsing, dry-run configuration validation via NeMo Curator pipelines, and fallback configuration generation.
tutorials/datachef_math_recipe.py: Adds an end-to-end tutorial script demonstrating math-specialization recipe creation.
tests/test_datachef.py: Adds Pytest-based integration tests utilizing pytest-httpserver to ensure generation, validation, and fallback mechanisms work seamlessly.

Notes for Reviewers

The proxy reward mechanism currently logs a placeholder evaluation. It can be extended easily as the actual metric function is integrated into the repo.
The fallback logic is configured to use the FineMathClassifier stage from nemo_curator.stages.math as a best-practice template for math domains when the DataChef API is unreachable.

)

copy-pr-bot · 2026-06-22T14:28:45Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-06-22T14:32:38Z

Greptile Summary

This PR adds a DataChefRecipeGenerator class under a new nemo_curator/recipe module. It calls an external LLM API (DataChef, arXiv:2602.11089) to generate NeMo Curator pipeline YAML configs dynamically based on a target benchmark and base model, validates the config by instantiating stages via Hydra, and falls back to a hardcoded FineMathClassifier template when the API is unreachable.

nemo_curator/recipe/datachef.py: Core generator class with API call, YAML validation via create_pipeline_from_yaml, a proxy reward stub, and a math-specific fallback config.
tests/test_datachef.py: Two integration tests using pytest-httpserver covering the happy path and the HTTP 500 fallback case; module-level sys.platform mutation is not guarded with try/finally.
tutorials/datachef_math_recipe.py: End-to-end tutorial script demonstrating math-specialization recipe generation.

Confidence Score: 4/5

The new recipe module works for the happy path, but the fallback always returns a math-specific config regardless of the requested benchmark, and requests is not a declared core dependency — these are functional gaps that should be resolved before merging.

The _get_fallback_config method unconditionally returns a FineMathClassifier pipeline template regardless of target_benchmark, so any caller targeting a non-math benchmark who hits an API failure silently receives a wrong-domain config with no warning. Additionally, requests is imported at the module's top level but is only available as a transitive dependency via transformers; a future slim install would break the import.

nemo_curator/recipe/datachef.py — fallback config domain mismatch and missing requests dependency declaration.

Important Files Changed

Filename	Overview
nemo_curator/recipe/datachef.py	New DataChefRecipeGenerator class that calls an external LLM API to generate pipeline YAML, validates it, and falls back to a hardcoded FineMathClassifier template. Several concerns: `requests` is not a declared core dependency, fallback config hardcodes a math-domain classifier regardless of `target_benchmark`, `_evaluate_proxy_reward` is a non-functional stub, and the license header is missing.
tests/test_datachef.py	Integration tests using pytest-httpserver. Module-level sys.platform mutation is not wrapped in try/finally, which can corrupt the test process state if the import fails. Tests do not mock `_validate_config`, so both tests will attempt to fully instantiate FineMathClassifier stages at test runtime.
nemo_curator/recipe/init.py	Minimal init.py exporting DataChefRecipeGenerator. No issues.
tutorials/datachef_math_recipe.py	Tutorial script demonstrating DataChefRecipeGenerator usage. Straightforward; no issues.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant DataChefRecipeGenerator
    participant DataChefAPI
    participant create_pipeline_from_yaml
    participant Pipeline

    Caller->>DataChefRecipeGenerator: generate()
    DataChefRecipeGenerator->>DataChefAPI: POST /v1/generate (payload)
    alt API success
        DataChefAPI-->>DataChefRecipeGenerator: "{"config_yaml": "..."}"
    else API failure / missing field
        DataChefAPI-->>DataChefRecipeGenerator: error / 5xx
        DataChefRecipeGenerator->>DataChefRecipeGenerator: _get_fallback_config()
        Note right of DataChefRecipeGenerator: Always returns FineMathClassifier YAML
    end
    DataChefRecipeGenerator->>create_pipeline_from_yaml: _validate_config(yaml_config)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: OmegaConf.create(yaml)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: hydra.utils.instantiate() per stage
    create_pipeline_from_yaml-->>Pipeline: Pipeline instance
    Pipeline->>Pipeline: build() — decompose stages
    Pipeline-->>DataChefRecipeGenerator: validated
    DataChefRecipeGenerator->>DataChefRecipeGenerator: _evaluate_proxy_reward() [stub]
    DataChefRecipeGenerator-->>Caller: yaml_config (str)

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant DataChefRecipeGenerator
    participant DataChefAPI
    participant create_pipeline_from_yaml
    participant Pipeline

    Caller->>DataChefRecipeGenerator: generate()
    DataChefRecipeGenerator->>DataChefAPI: POST /v1/generate (payload)
    alt API success
        DataChefAPI-->>DataChefRecipeGenerator: "{"config_yaml": "..."}"
    else API failure / missing field
        DataChefAPI-->>DataChefRecipeGenerator: error / 5xx
        DataChefRecipeGenerator->>DataChefRecipeGenerator: _get_fallback_config()
        Note right of DataChefRecipeGenerator: Always returns FineMathClassifier YAML
    end
    DataChefRecipeGenerator->>create_pipeline_from_yaml: _validate_config(yaml_config)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: OmegaConf.create(yaml)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: hydra.utils.instantiate() per stage
    create_pipeline_from_yaml-->>Pipeline: Pipeline instance
    Pipeline->>Pipeline: build() — decompose stages
    Pipeline-->>DataChefRecipeGenerator: validated
    DataChefRecipeGenerator->>DataChefRecipeGenerator: _evaluate_proxy_reward() [stub]
    DataChefRecipeGenerator-->>Caller: yaml_config (str)

_{Reviews (2): Last reviewed commit: "Merge branch 'main' into feature/datache..." | Re-trigger Greptile}

greptile-apps · 2026-06-22T14:32:43Z

+        return yaml_config
+
+    def _validate_config(self, yaml_config: str) -> None:
+        """
+        Runs a dry-run of the generated pipeline to ensure the YAML is valid 
+        and can be built into an executable plan by NeMo Curator.
+        """
+        logger.info("Validating generated pipeline configuration...")
+        try:
+            cfg = OmegaConf.create(yaml_config)
+            # Create pipeline and disable logging out the full config each time
+            pipeline = create_pipeline_from_yaml(cfg, log_config=False)
+            if not isinstance(pipeline, Pipeline):
+                raise ValueError("Parsed configuration did not yield a Pipeline object.")


Validation is not a dry-run — it fully instantiates stage objects

create_pipeline_from_yaml calls hydra.utils.instantiate() on every stage config, which invokes each stage class's __init__. For ML stages like FineMathClassifier, this means loading a model, allocating memory, and potentially requiring optional GPU/CUDA dependencies just to "validate" a config string. The comment "Build the pipeline execution plan (dry-run)" on pipeline.build() is misleading — stage constructors have already run before build() is called. Any user calling generate() when the API is unreachable will silently trigger a full FineMathClassifier model load, which can fail if math-specific extras aren't installed or exhaust memory in a lightweight environment.

greptile-apps · 2026-06-22T14:32:44Z

+    def _get_fallback_config(self) -> str:
+        """
+        Outputs a best-practice template config for the domain if DataChef is unavailable.
+        """
+        return '''
+stages:
+  - _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier
+'''


Fallback config ignores target_benchmark entirely

_get_fallback_config always returns the FineMathClassifier pipeline regardless of what target_benchmark was set to. A caller using target_benchmark="MMLU" or any non-math benchmark who hits an API failure will receive a math-domain-specific config with no indication that it is inappropriate for their use case. At minimum the fallback should log a warning that it is math-specific.

Suggested change

def _get_fallback_config(self) -> str:

"""

Outputs a best-practice template config for the domain if DataChef is unavailable.

"""

return '''

stages:

- _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier

'''

def _get_fallback_config(self) -> str:

"""

Outputs a best-practice template config for the domain if DataChef is unavailable.

NOTE: Currently this template is only appropriate for math-domain benchmarks.

"""

logger.warning(

"Falling back to a math-domain template (FineMathClassifier). "

"This may not be suitable for target_benchmark=%r.",

self.target_benchmark,

)

return '''

stages:

- _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier

'''

greptile-apps · 2026-06-22T14:32:45Z

@@ -0,0 +1,98 @@
+import logging
+import requests
+from typing import List


requests is not declared as a direct dependency

requests is imported at the top level but is not listed in [project].dependencies in pyproject.toml. It is currently available only as a transitive dependency (via transformers). Relying on transitive availability is fragile — a future update to transformers or a minimal install that avoids it could break the import silently. requests should be added to core dependencies or to a dedicated optional extra (e.g., recipe) alongside this module.

greptile-apps · 2026-06-22T14:32:46Z

+original_platform = sys.platform
+sys.platform = "linux"
+import nemo_curator
+sys.platform = original_platform


Module-level sys.platform mutation is unsafe for parallel test runs

sys.platform is overwritten to "linux" at import time and restored after the import nemo_curator call. If tests run in parallel (e.g. with pytest-xdist) or if another module under test reads sys.platform during collection while this value is active, it can produce incorrect behavior. The restore on line 15 also doesn't use a try/finally, so an unexpected import error could leave sys.platform permanently set to "linux" for the rest of the test process. Consider using unittest.mock.patch as a context manager instead.

feat: Add DataChef recipe generation integration (Issue NVIDIA-NeMo#1760

54eca05

)

anushkagupta200615-jpg requested a review from a team as a code owner June 22, 2026 14:28

anushkagupta200615-jpg requested review from praateekmahajan and removed request for a team June 22, 2026 14:28

github-actions Bot added the community-request label Jun 22, 2026

greptile-apps Bot reviewed Jun 22, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 24, 2026

Merge branch 'main' into feature/datachef-recipe-generation

61bf9f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add DataChef recipe generation integration (Issue #1760)#2095

feat: Add DataChef recipe generation integration (Issue #1760)#2095
anushkagupta200615-jpg wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
anushkagupta200615-jpg:feature/datachef-recipe-generation

anushkagupta200615-jpg commented Jun 22, 2026

Uh oh!

copy-pr-bot Bot commented Jun 22, 2026

Uh oh!

greptile-apps Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 22, 2026

Uh oh!

greptile-apps Bot Jun 22, 2026

Uh oh!

greptile-apps Bot Jun 22, 2026

Uh oh!

greptile-apps Bot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

anushkagupta200615-jpg commented Jun 22, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 22, 2026

Uh oh!

greptile-apps Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 22, 2026 •

edited

Loading