feat: Add DataChef recipe generation integration (Issue #1760)#2095
feat: Add DataChef recipe generation integration (Issue #1760)#2095anushkagupta200615-jpg wants to merge 2 commits into
Conversation
| return yaml_config | ||
|
|
||
| def _validate_config(self, yaml_config: str) -> None: | ||
| """ | ||
| Runs a dry-run of the generated pipeline to ensure the YAML is valid | ||
| and can be built into an executable plan by NeMo Curator. | ||
| """ | ||
| logger.info("Validating generated pipeline configuration...") | ||
| try: | ||
| cfg = OmegaConf.create(yaml_config) | ||
| # Create pipeline and disable logging out the full config each time | ||
| pipeline = create_pipeline_from_yaml(cfg, log_config=False) | ||
| if not isinstance(pipeline, Pipeline): | ||
| raise ValueError("Parsed configuration did not yield a Pipeline object.") |
There was a problem hiding this comment.
Validation is not a dry-run — it fully instantiates stage objects
create_pipeline_from_yaml calls hydra.utils.instantiate() on every stage config, which invokes each stage class's __init__. For ML stages like FineMathClassifier, this means loading a model, allocating memory, and potentially requiring optional GPU/CUDA dependencies just to "validate" a config string. The comment "Build the pipeline execution plan (dry-run)" on pipeline.build() is misleading — stage constructors have already run before build() is called. Any user calling generate() when the API is unreachable will silently trigger a full FineMathClassifier model load, which can fail if math-specific extras aren't installed or exhaust memory in a lightweight environment.
| def _get_fallback_config(self) -> str: | ||
| """ | ||
| Outputs a best-practice template config for the domain if DataChef is unavailable. | ||
| """ | ||
| return ''' | ||
| stages: | ||
| - _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier | ||
| ''' |
There was a problem hiding this comment.
Fallback config ignores
target_benchmark entirely
_get_fallback_config always returns the FineMathClassifier pipeline regardless of what target_benchmark was set to. A caller using target_benchmark="MMLU" or any non-math benchmark who hits an API failure will receive a math-domain-specific config with no indication that it is inappropriate for their use case. At minimum the fallback should log a warning that it is math-specific.
| def _get_fallback_config(self) -> str: | |
| """ | |
| Outputs a best-practice template config for the domain if DataChef is unavailable. | |
| """ | |
| return ''' | |
| stages: | |
| - _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier | |
| ''' | |
| def _get_fallback_config(self) -> str: | |
| """ | |
| Outputs a best-practice template config for the domain if DataChef is unavailable. | |
| NOTE: Currently this template is only appropriate for math-domain benchmarks. | |
| """ | |
| logger.warning( | |
| "Falling back to a math-domain template (FineMathClassifier). " | |
| "This may not be suitable for target_benchmark=%r.", | |
| self.target_benchmark, | |
| ) | |
| return ''' | |
| stages: | |
| - _target_: nemo_curator.stages.math.classifiers.finemath.FineMathClassifier | |
| ''' |
| @@ -0,0 +1,98 @@ | |||
| import logging | |||
| import requests | |||
| from typing import List | |||
There was a problem hiding this comment.
requests is not declared as a direct dependency
requests is imported at the top level but is not listed in [project].dependencies in pyproject.toml. It is currently available only as a transitive dependency (via transformers). Relying on transitive availability is fragile — a future update to transformers or a minimal install that avoids it could break the import silently. requests should be added to core dependencies or to a dedicated optional extra (e.g., recipe) alongside this module.
| original_platform = sys.platform | ||
| sys.platform = "linux" | ||
| import nemo_curator | ||
| sys.platform = original_platform |
There was a problem hiding this comment.
Module-level
sys.platform mutation is unsafe for parallel test runs
sys.platform is overwritten to "linux" at import time and restored after the import nemo_curator call. If tests run in parallel (e.g. with pytest-xdist) or if another module under test reads sys.platform during collection while this value is active, it can produce incorrect behavior. The restore on line 15 also doesn't use a try/finally, so an unexpected import error could leave sys.platform permanently set to "linux" for the rest of the test process. Consider using unittest.mock.patch as a context manager instead.
Description
Usage
# Add snippet demonstrating usageChecklist
Resolves #1760
Description
Adds integration with the DataChef LLM model (arXiv:2602.11089) to generate end-to-end NeMo Curator pipeline specifications dynamically, based on a target benchmark and base model.
Changes
nemo_curator/recipe/datachef.py: Implements theDataChefRecipeGeneratorclass. It manages API payload construction, YAML parsing, dry-run configuration validation via NeMo Curator pipelines, and fallback configuration generation.tutorials/datachef_math_recipe.py: Adds an end-to-end tutorial script demonstrating math-specialization recipe creation.tests/test_datachef.py: Adds Pytest-based integration tests utilizingpytest-httpserverto ensure generation, validation, and fallback mechanisms work seamlessly.Notes for Reviewers
FineMathClassifierstage fromnemo_curator.stages.mathas a best-practice template for math domains when the DataChef API is unreachable.