Skip to content

feat: Add DataChef recipe generation integration (Issue #1760)#2095

Open
anushkagupta200615-jpg wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
anushkagupta200615-jpg:feature/datachef-recipe-generation
Open

feat: Add DataChef recipe generation integration (Issue #1760)#2095
anushkagupta200615-jpg wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
anushkagupta200615-jpg:feature/datachef-recipe-generation

Conversation

@anushkagupta200615-jpg

Copy link
Copy Markdown

Description

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Resolves #1760

Description
Adds integration with the DataChef LLM model (arXiv:2602.11089) to generate end-to-end NeMo Curator pipeline specifications dynamically, based on a target benchmark and base model.

Changes

  • nemo_curator/recipe/datachef.py: Implements the DataChefRecipeGenerator class. It manages API payload construction, YAML parsing, dry-run configuration validation via NeMo Curator pipelines, and fallback configuration generation.
  • tutorials/datachef_math_recipe.py: Adds an end-to-end tutorial script demonstrating math-specialization recipe creation.
  • tests/test_datachef.py: Adds Pytest-based integration tests utilizing pytest-httpserver to ensure generation, validation, and fallback mechanisms work seamlessly.

Notes for Reviewers

  • The proxy reward mechanism currently logs a placeholder evaluation. It can be extended easily as the actual metric function is integrated into the repo.
  • The fallback logic is configured to use the FineMathClassifier stage from nemo_curator.stages.math as a best-practice template for math domains when the DataChef API is unreachable.

@anushkagupta200615-jpg anushkagupta200615-jpg requested a review from a team as a code owner June 22, 2026 14:28
@anushkagupta200615-jpg anushkagupta200615-jpg requested review from praateekmahajan and removed request for a team June 22, 2026 14:28
@copy-pr-bot

copy-pr-bot Bot commented Jun 22, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps

greptile-apps Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a DataChefRecipeGenerator class under a new nemo_curator/recipe module. It calls an external LLM API (DataChef, arXiv:2602.11089) to generate NeMo Curator pipeline YAML configs dynamically based on a target benchmark and base model, validates the config by instantiating stages via Hydra, and falls back to a hardcoded FineMathClassifier template when the API is unreachable.

  • nemo_curator/recipe/datachef.py: Core generator class with API call, YAML validation via create_pipeline_from_yaml, a proxy reward stub, and a math-specific fallback config.
  • tests/test_datachef.py: Two integration tests using pytest-httpserver covering the happy path and the HTTP 500 fallback case; module-level sys.platform mutation is not guarded with try/finally.
  • tutorials/datachef_math_recipe.py: End-to-end tutorial script demonstrating math-specialization recipe generation.

Confidence Score: 4/5

The new recipe module works for the happy path, but the fallback always returns a math-specific config regardless of the requested benchmark, and requests is not a declared core dependency — these are functional gaps that should be resolved before merging.

The _get_fallback_config method unconditionally returns a FineMathClassifier pipeline template regardless of target_benchmark, so any caller targeting a non-math benchmark who hits an API failure silently receives a wrong-domain config with no warning. Additionally, requests is imported at the module's top level but is only available as a transitive dependency via transformers; a future slim install would break the import.

nemo_curator/recipe/datachef.py — fallback config domain mismatch and missing requests dependency declaration.

Important Files Changed

Filename Overview
nemo_curator/recipe/datachef.py New DataChefRecipeGenerator class that calls an external LLM API to generate pipeline YAML, validates it, and falls back to a hardcoded FineMathClassifier template. Several concerns: requests is not a declared core dependency, fallback config hardcodes a math-domain classifier regardless of target_benchmark, _evaluate_proxy_reward is a non-functional stub, and the license header is missing.
tests/test_datachef.py Integration tests using pytest-httpserver. Module-level sys.platform mutation is not wrapped in try/finally, which can corrupt the test process state if the import fails. Tests do not mock _validate_config, so both tests will attempt to fully instantiate FineMathClassifier stages at test runtime.
nemo_curator/recipe/init.py Minimal init.py exporting DataChefRecipeGenerator. No issues.
tutorials/datachef_math_recipe.py Tutorial script demonstrating DataChefRecipeGenerator usage. Straightforward; no issues.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller
    participant DataChefRecipeGenerator
    participant DataChefAPI
    participant create_pipeline_from_yaml
    participant Pipeline

    Caller->>DataChefRecipeGenerator: generate()
    DataChefRecipeGenerator->>DataChefAPI: POST /v1/generate (payload)
    alt API success
        DataChefAPI-->>DataChefRecipeGenerator: "{"config_yaml": "..."}"
    else API failure / missing field
        DataChefAPI-->>DataChefRecipeGenerator: error / 5xx
        DataChefRecipeGenerator->>DataChefRecipeGenerator: _get_fallback_config()
        Note right of DataChefRecipeGenerator: Always returns FineMathClassifier YAML
    end
    DataChefRecipeGenerator->>create_pipeline_from_yaml: _validate_config(yaml_config)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: OmegaConf.create(yaml)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: hydra.utils.instantiate() per stage
    create_pipeline_from_yaml-->>Pipeline: Pipeline instance
    Pipeline->>Pipeline: build() — decompose stages
    Pipeline-->>DataChefRecipeGenerator: validated
    DataChefRecipeGenerator->>DataChefRecipeGenerator: _evaluate_proxy_reward() [stub]
    DataChefRecipeGenerator-->>Caller: yaml_config (str)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller
    participant DataChefRecipeGenerator
    participant DataChefAPI
    participant create_pipeline_from_yaml
    participant Pipeline

    Caller->>DataChefRecipeGenerator: generate()
    DataChefRecipeGenerator->>DataChefAPI: POST /v1/generate (payload)
    alt API success
        DataChefAPI-->>DataChefRecipeGenerator: "{"config_yaml": "..."}"
    else API failure / missing field
        DataChefAPI-->>DataChefRecipeGenerator: error / 5xx
        DataChefRecipeGenerator->>DataChefRecipeGenerator: _get_fallback_config()
        Note right of DataChefRecipeGenerator: Always returns FineMathClassifier YAML
    end
    DataChefRecipeGenerator->>create_pipeline_from_yaml: _validate_config(yaml_config)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: OmegaConf.create(yaml)
    create_pipeline_from_yaml->>create_pipeline_from_yaml: hydra.utils.instantiate() per stage
    create_pipeline_from_yaml-->>Pipeline: Pipeline instance
    Pipeline->>Pipeline: build() — decompose stages
    Pipeline-->>DataChefRecipeGenerator: validated
    DataChefRecipeGenerator->>DataChefRecipeGenerator: _evaluate_proxy_reward() [stub]
    DataChefRecipeGenerator-->>Caller: yaml_config (str)
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into feature/datache..." | Re-trigger Greptile

Comment on lines +63 to +76
return yaml_config

def _validate_config(self, yaml_config: str) -> None:
"""
Runs a dry-run of the generated pipeline to ensure the YAML is valid
and can be built into an executable plan by NeMo Curator.
"""
logger.info("Validating generated pipeline configuration...")
try:
cfg = OmegaConf.create(yaml_config)
# Create pipeline and disable logging out the full config each time
pipeline = create_pipeline_from_yaml(cfg, log_config=False)
if not isinstance(pipeline, Pipeline):
raise ValueError("Parsed configuration did not yield a Pipeline object.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Validation is not a dry-run — it fully instantiates stage objects

create_pipeline_from_yaml calls hydra.utils.instantiate() on every stage config, which invokes each stage class's __init__. For ML stages like FineMathClassifier, this means loading a model, allocating memory, and potentially requiring optional GPU/CUDA dependencies just to "validate" a config string. The comment "Build the pipeline execution plan (dry-run)" on pipeline.build() is misleading — stage constructors have already run before build() is called. Any user calling generate() when the API is unreachable will silently trigger a full FineMathClassifier model load, which can fail if math-specific extras aren't installed or exhaust memory in a lightweight environment.

Comment on lines +91 to +98
def _get_fallback_config(self) -> str:
"""
Outputs a best-practice template config for the domain if DataChef is unavailable.
"""
return '''
stages:
- _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier
'''

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Fallback config ignores target_benchmark entirely

_get_fallback_config always returns the FineMathClassifier pipeline regardless of what target_benchmark was set to. A caller using target_benchmark="MMLU" or any non-math benchmark who hits an API failure will receive a math-domain-specific config with no indication that it is inappropriate for their use case. At minimum the fallback should log a warning that it is math-specific.

Suggested change
def _get_fallback_config(self) -> str:
"""
Outputs a best-practice template config for the domain if DataChef is unavailable.
"""
return '''
stages:
- _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier
'''
def _get_fallback_config(self) -> str:
"""
Outputs a best-practice template config for the domain if DataChef is unavailable.
NOTE: Currently this template is only appropriate for math-domain benchmarks.
"""
logger.warning(
"Falling back to a math-domain template (FineMathClassifier). "
"This may not be suitable for target_benchmark=%r.",
self.target_benchmark,
)
return '''
stages:
- _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier
'''

@@ -0,0 +1,98 @@
import logging
import requests
from typing import List

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 requests is not declared as a direct dependency

requests is imported at the top level but is not listed in [project].dependencies in pyproject.toml. It is currently available only as a transitive dependency (via transformers). Relying on transitive availability is fragile — a future update to transformers or a minimal install that avoids it could break the import silently. requests should be added to core dependencies or to a dedicated optional extra (e.g., recipe) alongside this module.

Comment thread tests/test_datachef.py
Comment on lines +12 to +15
original_platform = sys.platform
sys.platform = "linux"
import nemo_curator
sys.platform = original_platform

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Module-level sys.platform mutation is unsafe for parallel test runs

sys.platform is overwritten to "linux" at import time and restored after the import nemo_curator call. If tests run in parallel (e.g. with pytest-xdist) or if another module under test reads sys.platform during collection while this value is active, it can produce incorrect behavior. The restore on line 15 also doesn't use a try/finally, so an unexpected import error could leave sys.platform permanently set to "linux" for the rest of the test process. Consider using unittest.mock.patch as a context manager instead.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RL-Trained LLM for End-to-End Data Recipe Generation

2 participants