v0.3.0 2026-01-08
🎨 NeMo Data Designer v0.3.0 Release Notes
DataDesigner v0.3.0 introduces some breaking changes that we highlight below.
💥 Breaking Change: config validation
The Data Designer config validation method .validate has been moved from the config builder to the DataDesigner object.
Before (v0.2.x):
from data_designer.essentials import DataDesigner, DataDesignerConfigBuilder
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()
# ... build your config ...
# validate config
config_builder.validate()After (v0.3.x):
from data_designer.essentials import DataDesigner, DataDesignerConfigBuilder
data_designer = DataDesigner()
config_builder = DataDesignerConfigBuilder()
# ... build your config ...
# validate config
data_designer.validate(config_builder)💥 Breaking Change: seed datasets
Working with seed datasets has been simplified with the introduction of SeedSource objects, which are passed directly to config_builder.with_seed_dataset. This removes the step of making a seed reference with datastore settings (when needed).
Before (v0.2.x):
Seed from a local file:
from data_designer.essentials import DataDesigner, DataDesignerConfigBuilder
config_builder = DataDesignerConfigBuilder()
seed_dataset_reference = data_designer.make_seed_reference_from_file("my_seed_dataset.parquet")
config_builder.with_seed_dataset(seed_dataset_reference)Seed from a Dataframe:
from data_designer.essentials import DataDesigner, DataDesignerConfigBuilder
# define dataframe `df`
config_builder = DataDesignerConfigBuilder()
# the dataframe must be written to file in v0.2.x
seed_dataset_reference = data_designer.make_seed_reference_from_dataframe(df, "my_seed_dataset.parquet")
config_builder.with_seed_dataset(seed_dataset_reference)After (v0.3.x):
Seed from a local file:
from data_designer.essentials import DataDesigner, DataDesignerConfigBuilder, LocalFileSeedSource
config_builder = DataDesignerConfigBuilder()
config_builder.with_seed_dataset(LocalFileSeedSource(path="my_seed_dataset.parquet"))Seed from a DataFrame:
from data_designer.essentials import DataDesigner, DataDesignerConfigBuilder, DataFrameSeedSource
# define dataframe `df`
config_builder = DataDesignerConfigBuilder()
# no need to specify a file, as the dataframe will be sampled directly in memory
config_builder.with_seed_dataset(DataFrameSeedSource(df=df))Seed from Hugging Face Hub:
from data_designer.essentials import DataDesigner, DataDesignerConfigBuilder, HuggingFaceSeedSource
config_builder = DataDesignerConfigBuilder()
config_builder.with_seed_dataset(HuggingFaceSeedSource(path="datasets/my-username/my-dataset/data/*.parquet"))💥 Breaking Change: plugins
When defining plugins, there are two important updates:
task->impl- The arguments of the
Pluginobject are now given as fully-qualified object names (e.g.,"my_plugin.module.PluginObject") rather than the actual objects.
Before (v0.2.x):
from my_plugin.multiple_column_generator import IndexMultiplierColumnGenerator, IndexMultiplierColumnConfig
from data_designer.plugins import Plugin, PluginType
plugin = Plugin(
task_cls=IndexMultiplierColumnGenerator,
config_cls=IndexMultiplierColumnConfig,
plugin_type=PluginType.COLUMN_GENERATOR,
emoji="🔌",
)After (v0.3.x)
from data_designer.plugins import Plugin, PluginType
plugin = Plugin(
impl_qualified_name="my_plugin.multiple_column_generator.IndexMultiplierColumnGenerator",
config_qualified_name="my_plugin.multiple_column_generator.IndexMultiplierColumnConfig",
plugin_type=PluginType.COLUMN_GENERATOR,
emoji="🔌",
)What's Changed
- fix: make doc building workflow use python 3.11 by @johnnygreco in #170
- refactor: plugin system updates by @mikeknep in #168
- feat: add OpenRouter as one of the default providers by @nabinchha in #161
- feat: Allow defining extra headers on model providers by @mikeknep in #174
- docs: fix documentation on max_tokens by @nabinchha in #176
- docs: Add extra_headers to model provider docs by @mikeknep in #178
- fix:
Decimalin structured generation leads to errors by @andreatgretel in #171 - fix: litellm max callbacks override by @nabinchha in #180
- fix: deserializing instantiates seed columns twice by @andreatgretel in #188
- chore: deprecate InferenceParameters by @nabinchha in #183
- refactor: Overhaul to seed datasets by @mikeknep in #167
- refactor: Plugins rename task to impl by @mikeknep in #189
- chore: limit update upper bound on litellm version by @johnnygreco in #190
- feat: Expose shutdown options as RunConfig by @eric-tramel in #186
New Contributors
- @eric-tramel made their first contribution in #186
Full Changelog: v0.2.2...v0.3.0