diff --git a/docs/code_reference/analysis.md b/docs/code_reference/analysis.md deleted file mode 100644 index 7c8e10340..000000000 --- a/docs/code_reference/analysis.md +++ /dev/null @@ -1,31 +0,0 @@ -# Analysis - -The `analysis` modules provide tools for profiling and analyzing generated datasets. It includes statistics tracking, column profiling, and reporting capabilities. - -## Column Statistics - -Column statistics are automatically computed for every column after generation. They provide basic metrics specific to the column type. For example, LLM columns track token usage statistics, sampler columns track distribution information, and validation columns track validation success rates. - -The classes below are result objects that store the computed statistics for each column type and provide methods for formatting these results for display in reports. - -::: data_designer.config.analysis.column_statistics - -## Column Profilers - -Column profilers are optional analysis tools that provide deeper insights into specific column types. Currently, the only column profiler available is the Judge Score Profiler. - -The classes below are result objects that store the computed profiler results and provide methods for formatting these results for display in reports. - -::: data_designer.config.analysis.column_profilers - -## Dataset Profiler - -The [DatasetProfilerResults](#data_designer.config.analysis.dataset_profiler.DatasetProfilerResults) class contains complete profiling results for a generated dataset. It aggregates column-level statistics, metadata, and profiler results, and provides methods to: - -- Compute dataset-level metrics (completion percentage, column type summary) -- Filter statistics by column type -- Generate formatted analysis reports via the `to_report()` method - -Reports can be displayed in the console or exported to HTML/SVG formats. - -::: data_designer.config.analysis.dataset_profiler diff --git a/docs/code_reference/column_configs.md b/docs/code_reference/column_configs.md deleted file mode 100644 index d6d613422..000000000 --- a/docs/code_reference/column_configs.md +++ /dev/null @@ -1,8 +0,0 @@ -# Column Configurations - -The `column_configs` module defines configuration objects for all Data Designer column types. Each configuration inherits from [SingleColumnConfig](#data_designer.config.base.SingleColumnConfig), which provides shared arguments like the column `name`, whether to `drop` the column after generation, and the `column_type`. - -!!! info "`column_type` is a discriminator field" - The `column_type` argument is used to identify column types when deserializing the [Data Designer Config](data_designer_config.md) from JSON/YAML. It acts as the discriminator in a [discriminated union](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions), allowing Pydantic to automatically determine which column configuration class to instantiate. - -::: data_designer.config.column_configs diff --git a/docs/code_reference/config/analysis.md b/docs/code_reference/config/analysis.md new file mode 100644 index 000000000..fa59221a0 --- /dev/null +++ b/docs/code_reference/config/analysis.md @@ -0,0 +1,31 @@ +# Analysis + +Profiling result objects and report helpers returned after generation. + +## Column Statistics + +`DataDesigner.create()` and `DataDesigner.preview()` run the dataset profiler after generation. The profiler computes statistics for each configured column; side-effect columns are recorded separately in `DatasetProfilerResults.side_effect_column_names`. + +Statistics result classes store computed metrics for each column type and format those metrics for reports. + +::: data_designer.config.analysis.column_statistics + +## Column Profilers + +Column profilers are optional analysis tools that provide deeper insights into specific column types. Currently, the only column profiler available is the Judge Score Profiler. + +Profiler result classes store computed profiler output and format it for reports. + +::: data_designer.config.analysis.column_profilers + +## Dataset Profiler + +The [DatasetProfilerResults](#data_designer.config.analysis.dataset_profiler.DatasetProfilerResults) class stores profiling results for a generated dataset. It aggregates column-level statistics, side-effect column names, and optional profiler results, and provides methods to: + +- Compute dataset-level metrics (completion percentage, column type summary) +- Filter statistics by column type +- Generate formatted analysis reports via the `to_report()` method + +Reports can be displayed in the console or exported to HTML/SVG formats. + +::: data_designer.config.analysis.dataset_profiler diff --git a/docs/code_reference/config/column_configs.md b/docs/code_reference/config/column_configs.md new file mode 100644 index 000000000..4ff2e8f2f --- /dev/null +++ b/docs/code_reference/config/column_configs.md @@ -0,0 +1,18 @@ +# Column Configurations + +Column configs declare Data Designer's built-in column types. Each configuration inherits from [SingleColumnConfig](#data_designer.config.base.SingleColumnConfig), which provides shared arguments like the column `name`, whether to `drop` the column after generation, and the `column_type`. + +For column generator implementation classes, see [column_generators](../engine/column_generators.md). + +!!! info "`column_type` is a discriminator field" + The `column_type` argument is used to identify column types when deserializing the [Data Designer Config](data_designer_config.md) from JSON/YAML. It acts as the discriminator in a [discriminated union](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions), allowing Pydantic to automatically determine which column configuration class to instantiate. + +## `SingleColumnConfig` {#data_designer.config.base.SingleColumnConfig} + +::: data_designer.config.base.SingleColumnConfig + options: + show_root_toc_entry: false + +## Column configurations + +::: data_designer.config.column_configs diff --git a/docs/code_reference/config/config_builder.md b/docs/code_reference/config/config_builder.md new file mode 100644 index 000000000..1aad978ae --- /dev/null +++ b/docs/code_reference/config/config_builder.md @@ -0,0 +1,10 @@ +# Data Designer's Config Builder + +Use [DataDesignerConfigBuilder](#data_designer.config.config_builder.DataDesignerConfigBuilder) to construct [DataDesignerConfig](data_designer_config.md#data_designer.config.data_designer_config.DataDesignerConfig) objects. The builder accumulates model configs, tool configs, column configs, constraints, seed settings, processors, and profilers. + +Inputs can come from scratch, a `dict`, [BuilderConfig](#data_designer.config.config_builder.BuilderConfig), a local YAML/JSON file, or an HTTP(S) YAML/JSON URL via [`from_config()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.from_config). Use [`build()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.build) to create a [DataDesignerConfig](data_designer_config.md#data_designer.config.data_designer_config.DataDesignerConfig), or [`write_config()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.write_config) to serialize the current builder config to YAML or JSON. + +!!! info "Model config loading" + [DataDesignerConfigBuilder](#data_designer.config.config_builder.DataDesignerConfigBuilder) accepts model configs as a list of [ModelConfig](models.md#data_designer.config.models.ModelConfig) objects, a YAML/JSON config path, or `None`. When `model_configs=None`, the builder loads default model configs if Data Designer can run locally; otherwise initialization raises BuilderConfigurationError. Model configs define the aliases referenced by model-backed columns such as [`LLMTextColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMTextColumnConfig), [`LLMCodeColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMCodeColumnConfig), [`LLMStructuredColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMStructuredColumnConfig), [`LLMJudgeColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMJudgeColumnConfig), [`EmbeddingColumnConfig`](column_configs.md#data_designer.config.column_configs.EmbeddingColumnConfig), and [`ImageColumnConfig`](column_configs.md#data_designer.config.column_configs.ImageColumnConfig). + +::: data_designer.config.config_builder diff --git a/docs/code_reference/config/data_designer_config.md b/docs/code_reference/config/data_designer_config.md new file mode 100644 index 000000000..d6329a9fa --- /dev/null +++ b/docs/code_reference/config/data_designer_config.md @@ -0,0 +1,7 @@ +# Data Designer Configuration + +[DataDesignerConfig](#data_designer.config.data_designer_config.DataDesignerConfig) is the top-level configuration object passed to Data Designer. It declares the columns to generate and may include model configs, tool configs, seed settings, sampler constraints, processors, and profiler configs. + +Prefer [DataDesignerConfigBuilder](config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder) for programmatic construction. Direct [DataDesignerConfig](#data_designer.config.data_designer_config.DataDesignerConfig) instantiation is also supported. + +::: data_designer.config.data_designer_config diff --git a/docs/code_reference/config/index.md b/docs/code_reference/config/index.md new file mode 100644 index 000000000..1ec8b4de0 --- /dev/null +++ b/docs/code_reference/config/index.md @@ -0,0 +1,7 @@ +# Config Package + +The `data-designer-config` package provides `data_designer.config`, the configuration layer of Data Designer. It contains the objects used to describe dataset structure, model access, tool access, seed data, sampler parameters, validators, processors, run settings, plugin registrations, and analysis results. + +This package is the base of the dependency chain. Engine and interface code consume these config objects, but config objects do not execute generation directly. + +For programmatic configuration work, start with [config_builder](config_builder.md) and [data_designer_config](data_designer_config.md). Use the narrower pages for exact constructor fields for columns, models, MCP tools, seeds, processors, samplers, validators, or profiling results. diff --git a/docs/code_reference/config/mcp.md b/docs/code_reference/config/mcp.md new file mode 100644 index 000000000..49b6f5cfb --- /dev/null +++ b/docs/code_reference/config/mcp.md @@ -0,0 +1,16 @@ +# MCP Configuration + +MCP config objects tell Data Designer which Model Context Protocol providers exist and which tools an LLM column may use. + +[MCPProvider](#data_designer.config.mcp.MCPProvider) configures remote MCP servers via SSE or Streamable HTTP transport. [LocalStdioMCPProvider](#data_designer.config.mcp.LocalStdioMCPProvider) configures local MCP servers as subprocesses via stdio transport. [ToolConfig](#data_designer.config.mcp.ToolConfig) sets which tools are available for LLM columns and how they are constrained. + +For MCP execution internals, see [Engine MCP](../engine/mcp.md). Related guides: + +- **[MCP Providers](../../concepts/mcp/mcp-providers.md)** - Configure local or remote MCP providers +- **[Tool Configs](../../concepts/mcp/tool-configs.md)** - Define tool permissions and limits +- **[Enabling Tools](../../concepts/mcp/enabling-tools.md)** - Use tools in LLM columns +- **[Traces](../../concepts/traces.md)** - Capture full conversation history + +## API Reference + +::: data_designer.config.mcp diff --git a/docs/code_reference/config/models.md b/docs/code_reference/config/models.md new file mode 100644 index 000000000..e14e8cfdb --- /dev/null +++ b/docs/code_reference/config/models.md @@ -0,0 +1,12 @@ +# Models + +[ModelProvider](#data_designer.config.models.ModelProvider) stores connection and authentication details for model providers. [ModelConfig](#data_designer.config.models.ModelConfig) stores a model alias, model identifier, provider settings, and inference parameters. [Inference Parameters](../../concepts/models/inference-parameters.md) control model behavior. Chat-completion parameters include `temperature`, `top_p`, and `max_tokens`; `temperature` and `top_p` can be fixed values or configured distributions. [ImageContext](#data_designer.config.models.ImageContext) provides image inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) configures image generation models. + +Related guides: + +- **[Model Providers](../../concepts/models/model-providers.md)** +- **[Model Configs](../../concepts/models/model-configs.md)** +- **[Image Context](../../notebooks/4-providing-images-as-context.ipynb)** +- **[Generating Images](../../notebooks/5-generating-images.ipynb)** + +::: data_designer.config.models diff --git a/docs/code_reference/config/plugins.md b/docs/code_reference/config/plugins.md new file mode 100644 index 000000000..93f4533de --- /dev/null +++ b/docs/code_reference/config/plugins.md @@ -0,0 +1,17 @@ +# Plugins + +Plugin packages register [Plugin](#data_designer.plugins.plugin.Plugin) objects through entry points in the `data_designer.plugins` group. A plugin registration ties a config class to its implementation class and declares its [PluginType](#data_designer.plugins.plugin.PluginType). + +Related pages: [Build Your Own](../../plugins/build_your_own.md), [Column Generators](../engine/column_generators.md), [Seed Readers](../engine/seed_readers.md), [Engine Processors](../engine/processors.md), and [Processor Configurations](processors.md). + +## `Plugin` {#data_designer.plugins.plugin.Plugin} + +::: data_designer.plugins.plugin.Plugin + options: + show_root_toc_entry: false + +## `PluginType` {#data_designer.plugins.plugin.PluginType} + +::: data_designer.plugins.plugin.PluginType + options: + show_root_toc_entry: false diff --git a/docs/code_reference/config/processors.md b/docs/code_reference/config/processors.md new file mode 100644 index 000000000..a1795643b --- /dev/null +++ b/docs/code_reference/config/processors.md @@ -0,0 +1,7 @@ +# Processor Configurations + +Processor configs request data transformations after generation. Add them to a `DataDesignerConfig` or `DataDesignerConfigBuilder`; the engine later compiles them into runtime processor implementations. + +Related pages: [engine processors](../engine/processors.md) and [Build Your Own](../../plugins/build_your_own.md). + +::: data_designer.config.processors diff --git a/docs/code_reference/run_config.md b/docs/code_reference/config/run_config.md similarity index 64% rename from docs/code_reference/run_config.md rename to docs/code_reference/config/run_config.md index ae358d5e0..f39dbb7f3 100644 --- a/docs/code_reference/run_config.md +++ b/docs/code_reference/config/run_config.md @@ -1,14 +1,14 @@ # Run Config -The `run_config` module defines runtime settings that control dataset generation behavior, -including early shutdown thresholds, batch sizing, non-inference worker concurrency, -and the Jinja rendering engine used by the runtime. +`RunConfig` controls dataset generation behavior, including early shutdown thresholds, +batch sizing, non-inference worker concurrency, and the Jinja rendering engine used by +the runtime. `JinjaRenderingEngine.SECURE` is the default. Set `JinjaRenderingEngine.NATIVE` when you want Jinja2's broader built-in sandbox behavior instead of Data Designer's hardened renderer. -For guidance on when to use each mode, see [Security](../concepts/security.md). +For guidance on when to use each mode, see [Security](../../concepts/security.md). ## Usage diff --git a/docs/code_reference/sampler_params.md b/docs/code_reference/config/sampler_params.md similarity index 51% rename from docs/code_reference/sampler_params.md rename to docs/code_reference/config/sampler_params.md index ecb75b2d0..751fc604d 100644 --- a/docs/code_reference/sampler_params.md +++ b/docs/code_reference/config/sampler_params.md @@ -1,6 +1,6 @@ # Sampler Parameters -The `sampler_params` module defines parameter configuration objects for all Data Designer sampler types. Sampler parameters are used within the [SamplerColumnConfig](column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) to specify how values should be generated for sampled columns. +Sampler parameter classes configure Data Designer's built-in samplers. Use them in [SamplerColumnConfig](column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) to specify how sampled column values are generated. !!! tip "Displaying available samplers and their parameters" The config builder has an `info` attribute that can be used to display the diff --git a/docs/code_reference/config/seeds.md b/docs/code_reference/config/seeds.md new file mode 100644 index 000000000..a3b77ac64 --- /dev/null +++ b/docs/code_reference/config/seeds.md @@ -0,0 +1,19 @@ +# Seeds + +Seed configs declare existing data used as input during generation. A [SeedConfig](#data_designer.config.seed.SeedConfig) combines a seed source with optional row sampling and selection settings. Seed source objects declare where seed data comes from; the engine reads them through seed readers. + +Use these objects with `DataDesignerConfigBuilder.with_seed_dataset()`. Related pages: [Seed Datasets](../../concepts/seed-datasets.md) and [seed readers](../engine/seed_readers.md). + +Built-in seed sources include local files, Hugging Face paths, in-memory DataFrames, directories, file contents, and agent rollout traces. Plugin seed sources can extend the same discriminated union through the plugin system. + +## Seed Config + +::: data_designer.config.seed + +## Built-In Seed Sources + +::: data_designer.config.seed_source + +## DataFrame Seed Source + +::: data_designer.config.seed_source_dataframe diff --git a/docs/code_reference/config/validator_params.md b/docs/code_reference/config/validator_params.md new file mode 100644 index 000000000..c69773da6 --- /dev/null +++ b/docs/code_reference/config/validator_params.md @@ -0,0 +1,6 @@ +# Validator Parameters + +`ValidationColumnConfig` selects a validator with `validator_type` and configures it with `validator_params`. +The `validator_type` field can be `code`, `local_callable`, or `remote`. The matching `validator_params` objects are: + +::: data_designer.config.validator_params diff --git a/docs/code_reference/config_builder.md b/docs/code_reference/config_builder.md deleted file mode 100644 index 0465933ad..000000000 --- a/docs/code_reference/config_builder.md +++ /dev/null @@ -1,10 +0,0 @@ -# Data Designer's Config Builder - -The `config_builder` module provides a high-level interface for constructing Data Designer configurations through the [DataDesignerConfigBuilder](#data_designer.config.config_builder.DataDesignerConfigBuilder) class, enabling programmatic creation of [DataDesignerConfig](data_designer_config.md#data_designer.config.data_designer_config.DataDesignerConfig) objects by incrementally adding column configurations, constraints, processors, and profilers. - -You can use the builder to create Data Designer configurations from scratch or from existing configurations stored in YAML/JSON files via [`from_config()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.from_config). The builder includes validation capabilities to catch configuration errors early and can work with seed datasets from local sources or external datastores. Once configured, use [`build()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.build) to generate the final configuration object or [`write_config()`](#data_designer.config.config_builder.DataDesignerConfigBuilder.write_config) to serialize it to disk. - -!!! info "Model configs are required" - [DataDesignerConfigBuilder](#data_designer.config.config_builder.DataDesignerConfigBuilder) requires a list of model configurations at initialization. This tells the builder which model aliases can be referenced by LLM-generated columns (such as [`LLMTextColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMTextColumnConfig), [`LLMCodeColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMCodeColumnConfig), [`LLMStructuredColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMStructuredColumnConfig), and [`LLMJudgeColumnConfig`](column_configs.md#data_designer.config.column_configs.LLMJudgeColumnConfig)). Each model configuration specifies the model alias, model provider, model ID, and inference parameters that will be used during data generation. - -::: data_designer.config.config_builder diff --git a/docs/code_reference/data_designer_config.md b/docs/code_reference/data_designer_config.md deleted file mode 100644 index 5d7e4cbce..000000000 --- a/docs/code_reference/data_designer_config.md +++ /dev/null @@ -1,7 +0,0 @@ -# Data Designer Configuration - -[DataDesignerConfig](#data_designer.config.data_designer_config.DataDesignerConfig) is the main configuration object for builder datasets with Data Designer. It is a declarative configuration for defining the dataset you want to generate column-by-column, including options for dataset post-processing, validation, and profiling. - -Generally, you should use the [DataDesignerConfigBuilder](config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder) to build your configuration, but you can also build it manually by instantiating the [DataDesignerConfig](#data_designer.config.data_designer_config.DataDesignerConfig) class directly. - -::: data_designer.config.data_designer_config diff --git a/docs/code_reference/engine/column_generators.md b/docs/code_reference/engine/column_generators.md new file mode 100644 index 000000000..b2aff0ce1 --- /dev/null +++ b/docs/code_reference/engine/column_generators.md @@ -0,0 +1,53 @@ +# Column Generators + +Column generators execute column generation in the Data Designer engine. A generator receives the upstream data needed for its task, returns row or batch data with generated values added, and reports the generation strategy the scheduler should use. + +Related pages: [column_configs](../config/column_configs.md), [Build Your Own](../../plugins/build_your_own.md), [Using Models in Plugins](../../plugins/models.md), and [Custom Columns](../../concepts/custom_columns.md). + +## Configuration + +User-facing column configs inherit from [SingleColumnConfig](../config/column_configs.md#data_designer.config.base.SingleColumnConfig) and define a unique `column_type` discriminator. During compilation, the engine may group related configs into multi-column configs for generators that create sampler or seed columns together. + +## Generation strategy + +Column generator base classes return [GenerationStrategy](../config/column_configs.md#data_designer.config.column_configs.GenerationStrategy) values to tell the engine whether they run per row or over a full batch. + +## Implementation bases + +Generators that operate on a full batch can inherit from [ColumnGeneratorFullColumn](#data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn). Row-oriented non-model generators can inherit from [ColumnGeneratorCellByCell](#data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell). Generators that create initial rows use [FromScratchColumnGenerator](#data_designer.engine.column_generators.generators.base.FromScratchColumnGenerator). Model-backed plugin generators should use [ColumnGeneratorWithModelRegistry](#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModelRegistry) or [ColumnGeneratorWithModel](#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModel); see [Using Models in Plugins](../../plugins/models.md) for authoring guidance. + +### `ColumnGenerator` {#data_designer.engine.column_generators.generators.base.ColumnGenerator} + +::: data_designer.engine.column_generators.generators.base.ColumnGenerator + options: + show_root_toc_entry: false + +### `ColumnGeneratorFullColumn` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn} + +::: data_designer.engine.column_generators.generators.base.ColumnGeneratorFullColumn + options: + show_root_toc_entry: false + +### `ColumnGeneratorCellByCell` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell} + +::: data_designer.engine.column_generators.generators.base.ColumnGeneratorCellByCell + options: + show_root_toc_entry: false + +### `FromScratchColumnGenerator` {#data_designer.engine.column_generators.generators.base.FromScratchColumnGenerator} + +::: data_designer.engine.column_generators.generators.base.FromScratchColumnGenerator + options: + show_root_toc_entry: false + +### `ColumnGeneratorWithModelRegistry` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModelRegistry} + +::: data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModelRegistry + options: + show_root_toc_entry: false + +### `ColumnGeneratorWithModel` {#data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModel} + +::: data_designer.engine.column_generators.generators.base.ColumnGeneratorWithModel + options: + show_root_toc_entry: false diff --git a/docs/code_reference/engine/index.md b/docs/code_reference/engine/index.md new file mode 100644 index 000000000..06dfa4e6d --- /dev/null +++ b/docs/code_reference/engine/index.md @@ -0,0 +1,5 @@ +# Engine Package + +The `data-designer-engine` package provides `data_designer.engine`, the runtime layer of Data Designer. It consumes `data_designer.config` objects and maps them to execution behavior through generators, seed readers, processors, registries, model access, and MCP tool execution. + +This package sits between config and interface: it depends on config, and the public interface calls into it. Use these pages for plugin implementation contracts, registry behavior, seed reader internals, processor execution, column generator bases, and MCP runtime behavior. diff --git a/docs/code_reference/engine/mcp.md b/docs/code_reference/engine/mcp.md new file mode 100644 index 000000000..a9b333b97 --- /dev/null +++ b/docs/code_reference/engine/mcp.md @@ -0,0 +1,94 @@ +# Engine MCP + +Execution-time MCP registries, facades, session handling, schema discovery, and tool calls. + +For user-facing provider and tool config objects, see [MCP configuration](../config/mcp.md). + +## Parallel Structure + +| Model layer | MCP layer | Purpose | +|-------------|-----------|---------| +| `ModelProviderRegistry` | `MCPProviderRegistry` | Holds provider configurations. | +| `ModelRegistry` | `MCPRegistry` | Manages configs by alias and lazily creates facades. | +| `ModelFacade` | `MCPFacade` | Provides a lightweight runtime facade scoped to one config. | +| `ModelConfig.alias` | `ToolConfig.tool_alias` | Alias referenced by column configs. | + +## Registry + +### `MCPToolDefinition` {#data_designer.engine.mcp.registry.MCPToolDefinition} + +::: data_designer.engine.mcp.registry.MCPToolDefinition + options: + show_root_toc_entry: false + +### `MCPToolResult` {#data_designer.engine.mcp.registry.MCPToolResult} + +::: data_designer.engine.mcp.registry.MCPToolResult + options: + show_root_toc_entry: false + +### `MCPRegistry` {#data_designer.engine.mcp.registry.MCPRegistry} + +::: data_designer.engine.mcp.registry.MCPRegistry + options: + show_root_toc_entry: false + +### `create_mcp_registry` {#data_designer.engine.mcp.factory.create_mcp_registry} + +::: data_designer.engine.mcp.factory.create_mcp_registry + options: + show_root_toc_entry: false + +## Facade + +`ModelFacade.generate()` accepts a `tool_alias` parameter. When it is provided, `ModelFacade` looks up the matching `MCPFacade` from `MCPRegistry`, fetches tool schemas, passes them to the model, processes tool calls after each completion, tracks tool-call turns, and returns messages that include tool results for trace capture. + +### `MCPFacade` {#data_designer.engine.mcp.facade.MCPFacade} + +::: data_designer.engine.mcp.facade.MCPFacade + options: + show_root_toc_entry: false + +## I/O Service + +The I/O service owns a background event loop, pools MCP sessions by provider config, coalesces concurrent tool schema lookups, and executes parallel tool calls. + +### `MCPIOService` {#data_designer.engine.mcp.io.MCPIOService} + +::: data_designer.engine.mcp.io.MCPIOService + options: + show_root_toc_entry: false + +### Runtime Helpers + +::: data_designer.engine.mcp.io.list_tools + options: + show_root_toc_entry: false + +::: data_designer.engine.mcp.io.list_tool_names + options: + show_root_toc_entry: false + +::: data_designer.engine.mcp.io.call_tools + options: + show_root_toc_entry: false + +::: data_designer.engine.mcp.io.clear_provider_caches + options: + show_root_toc_entry: false + +::: data_designer.engine.mcp.io.clear_tools_cache + options: + show_root_toc_entry: false + +::: data_designer.engine.mcp.io.get_cache_info + options: + show_root_toc_entry: false + +::: data_designer.engine.mcp.io.clear_session_pool + options: + show_root_toc_entry: false + +::: data_designer.engine.mcp.io.get_session_pool_info + options: + show_root_toc_entry: false diff --git a/docs/code_reference/engine/processors.md b/docs/code_reference/engine/processors.md new file mode 100644 index 000000000..e11653ead --- /dev/null +++ b/docs/code_reference/engine/processors.md @@ -0,0 +1,43 @@ +# Engine Processor Implementations + +Runtime processor classes and processor registry helpers. + +Plugin processors inherit from [Processor](#data_designer.engine.processing.processors.base.Processor) and override one or more callback methods: `process_before_batch`, `process_after_batch`, or `process_after_generation`. + +For user-facing processor config objects, see [processor configurations](../config/processors.md). + +## Base Contract + +### `Processor` {#data_designer.engine.processing.processors.base.Processor} + +::: data_designer.engine.processing.processors.base.Processor + options: + show_root_toc_entry: false + +## Built-In Implementations + +### `DropColumnsProcessor` {#data_designer.engine.processing.processors.drop_columns.DropColumnsProcessor} + +::: data_designer.engine.processing.processors.drop_columns.DropColumnsProcessor + options: + show_root_toc_entry: false + +### `SchemaTransformProcessor` {#data_designer.engine.processing.processors.schema_transform.SchemaTransformProcessor} + +::: data_designer.engine.processing.processors.schema_transform.SchemaTransformProcessor + options: + show_root_toc_entry: false + +## Registry + +### `ProcessorRegistry` {#data_designer.engine.processing.processors.registry.ProcessorRegistry} + +::: data_designer.engine.processing.processors.registry.ProcessorRegistry + options: + show_root_toc_entry: false + +### `create_default_processor_registry` {#data_designer.engine.processing.processors.registry.create_default_processor_registry} + +::: data_designer.engine.processing.processors.registry.create_default_processor_registry + options: + show_root_toc_entry: false diff --git a/docs/code_reference/engine/seed_readers.md b/docs/code_reference/engine/seed_readers.md new file mode 100644 index 000000000..5f6294a34 --- /dev/null +++ b/docs/code_reference/engine/seed_readers.md @@ -0,0 +1,101 @@ +# Seed Readers + +Seed readers are engine-side adapters that turn a configured seed source into tabular seed rows. The engine attaches a `SeedSource` and secret resolver, asks the reader for column names and dataset size, then streams batches into generation. + +Related pages: [seeds](../config/seeds.md), [Seed Datasets](../../concepts/seed-datasets.md), and [Build Your Own](../../plugins/build_your_own.md). + +## Core Contracts + +### `SeedReader` {#data_designer.engine.resources.seed_reader.SeedReader} + +::: data_designer.engine.resources.seed_reader.SeedReader + options: + show_root_toc_entry: false + +### `FileSystemSeedReader` {#data_designer.engine.resources.seed_reader.FileSystemSeedReader} + +::: data_designer.engine.resources.seed_reader.FileSystemSeedReader + options: + show_root_toc_entry: false + +### `SeedReaderFileSystemContext` {#data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext} + +::: data_designer.engine.resources.seed_reader.SeedReaderFileSystemContext + options: + show_root_toc_entry: false + +### `SeedReaderBatch` {#data_designer.engine.resources.seed_reader.SeedReaderBatch} + +::: data_designer.engine.resources.seed_reader.SeedReaderBatch + options: + show_root_toc_entry: false + +### `SeedReaderBatchReader` {#data_designer.engine.resources.seed_reader.SeedReaderBatchReader} + +::: data_designer.engine.resources.seed_reader.SeedReaderBatchReader + options: + show_root_toc_entry: false + +### `PandasSeedReaderBatch` {#data_designer.engine.resources.seed_reader.PandasSeedReaderBatch} + +::: data_designer.engine.resources.seed_reader.PandasSeedReaderBatch + options: + show_root_toc_entry: false + +### `create_seed_reader_output_dataframe` {#data_designer.engine.resources.seed_reader.create_seed_reader_output_dataframe} + +::: data_designer.engine.resources.seed_reader.create_seed_reader_output_dataframe + options: + show_root_toc_entry: false + +## Built-In Readers + +### `LocalFileSeedReader` {#data_designer.engine.resources.seed_reader.LocalFileSeedReader} + +::: data_designer.engine.resources.seed_reader.LocalFileSeedReader + options: + show_root_toc_entry: false + +### `HuggingFaceSeedReader` {#data_designer.engine.resources.seed_reader.HuggingFaceSeedReader} + +::: data_designer.engine.resources.seed_reader.HuggingFaceSeedReader + options: + show_root_toc_entry: false + +### `DataFrameSeedReader` {#data_designer.engine.resources.seed_reader.DataFrameSeedReader} + +::: data_designer.engine.resources.seed_reader.DataFrameSeedReader + options: + show_root_toc_entry: false + +### `DirectorySeedReader` {#data_designer.engine.resources.seed_reader.DirectorySeedReader} + +::: data_designer.engine.resources.seed_reader.DirectorySeedReader + options: + show_root_toc_entry: false + +### `FileContentsSeedReader` {#data_designer.engine.resources.seed_reader.FileContentsSeedReader} + +::: data_designer.engine.resources.seed_reader.FileContentsSeedReader + options: + show_root_toc_entry: false + +### `AgentRolloutSeedReader` {#data_designer.engine.resources.seed_reader.AgentRolloutSeedReader} + +::: data_designer.engine.resources.seed_reader.AgentRolloutSeedReader + options: + show_root_toc_entry: false + +## Registry and Errors + +### `SeedReaderRegistry` {#data_designer.engine.resources.seed_reader.SeedReaderRegistry} + +::: data_designer.engine.resources.seed_reader.SeedReaderRegistry + options: + show_root_toc_entry: false + +### `SeedReaderError` {#data_designer.engine.resources.seed_reader.SeedReaderError} + +::: data_designer.engine.resources.seed_reader.SeedReaderError + options: + show_root_toc_entry: false diff --git a/docs/code_reference/index.md b/docs/code_reference/index.md new file mode 100644 index 000000000..5263b0ffe --- /dev/null +++ b/docs/code_reference/index.md @@ -0,0 +1,11 @@ +# Code Reference + +Data Designer is implemented as three installable packages that share the `data_designer` namespace. The packages are layered: user-facing interface code calls the engine, and the engine consumes declarative config objects. + +| Package | Namespace | Role | +|---------|-----------|------| +| [`data-designer-config`](config/index.md) | `data_designer.config` | Configuration schemas, builder APIs, plugin registration objects, and result schemas. | +| [`data-designer-engine`](engine/index.md) | `data_designer.engine` | Runtime contracts and implementations for generation, seed reading, processing, and MCP tool execution. | +| [`data-designer`](interface/index.md) | `data_designer.interface` | Public entry points for previewing, creating, and inspecting generated datasets. | + +The dependency direction is `interface -> engine -> config`. Config objects describe what should happen, engine objects implement how it happens, and interface objects expose the supported public API. diff --git a/docs/code_reference/interface/data_designer.md b/docs/code_reference/interface/data_designer.md new file mode 100644 index 000000000..050ba6242 --- /dev/null +++ b/docs/code_reference/interface/data_designer.md @@ -0,0 +1,11 @@ +# DataDesigner Interface + +[DataDesigner](#data_designer.interface.data_designer.DataDesigner) validates configs, generates in-memory previews, creates persisted datasets, lists configured MCP tools, and exposes default model settings. + +For runtime settings passed through `set_run_config()`, see [run_config](../config/run_config.md). For persisted creation results returned by `create()`, see [results](results.md). + +## `DataDesigner` {#data_designer.interface.data_designer.DataDesigner} + +::: data_designer.interface.data_designer.DataDesigner + options: + show_root_toc_entry: false diff --git a/docs/code_reference/interface/errors.md b/docs/code_reference/interface/errors.md new file mode 100644 index 000000000..a969cf8fe --- /dev/null +++ b/docs/code_reference/interface/errors.md @@ -0,0 +1,29 @@ +# Interface Errors + +Interface errors represent failures surfaced at the public API boundary. DataDesignerGenerationError wraps dataset generation failures from `create()` and `preview()`, DataDesignerEarlyShutdownError identifies generation runs that terminate early without producing records, and DataDesignerProfilingError wraps profiling failures from those methods. These errors inherit from `data_designer.errors.DataDesignerError`, allowing callers to catch either specific interface failures or the project-wide base error type. + +The package-level `data_designer.interface` export lazily exposes [DataDesignerGenerationError](#data_designer.interface.errors.DataDesignerGenerationError), [DataDesignerEarlyShutdownError](#data_designer.interface.errors.DataDesignerEarlyShutdownError), and [DataDesignerProfilingError](#data_designer.interface.errors.DataDesignerProfilingError). [InvalidBufferValueError](#data_designer.interface.errors.InvalidBufferValueError) is defined in this module. + +## `DataDesignerGenerationError` {#data_designer.interface.errors.DataDesignerGenerationError} + +::: data_designer.interface.errors.DataDesignerGenerationError + options: + show_root_toc_entry: false + +## `DataDesignerEarlyShutdownError` {#data_designer.interface.errors.DataDesignerEarlyShutdownError} + +::: data_designer.interface.errors.DataDesignerEarlyShutdownError + options: + show_root_toc_entry: false + +## `DataDesignerProfilingError` {#data_designer.interface.errors.DataDesignerProfilingError} + +::: data_designer.interface.errors.DataDesignerProfilingError + options: + show_root_toc_entry: false + +## `InvalidBufferValueError` {#data_designer.interface.errors.InvalidBufferValueError} + +::: data_designer.interface.errors.InvalidBufferValueError + options: + show_root_toc_entry: false diff --git a/docs/code_reference/interface/index.md b/docs/code_reference/interface/index.md new file mode 100644 index 000000000..e43caa783 --- /dev/null +++ b/docs/code_reference/interface/index.md @@ -0,0 +1,7 @@ +# Interface Package + +The `data-designer` package provides the top-level user-facing package surface. This section covers `data_designer.interface`, which contains `DataDesigner`, persisted dataset creation results, and interface-level errors. + +This package sits above engine and config. `DataDesigner` accepts Data Designer configs, calls the runtime layer, and returns preview or persisted creation results. + +Start with [DataDesigner](data_designer.md) for previewing, creating, and inspecting datasets from a config. Use [results](results.md) for the object returned by persisted dataset creation, and [errors](errors.md) for exceptions surfaced at the public API boundary. diff --git a/docs/code_reference/interface/results.md b/docs/code_reference/interface/results.md new file mode 100644 index 000000000..044ca6ccf --- /dev/null +++ b/docs/code_reference/interface/results.md @@ -0,0 +1,11 @@ +# Dataset Creation Results + +[DatasetCreationResults](#data_designer.interface.results.DatasetCreationResults) is returned by [DataDesigner.create()](data_designer.md#data_designer.interface.data_designer.DataDesigner.create). It provides access to persisted creation artifacts, including the generated dataset, profiling analysis, processor outputs, task traces, dataset metadata, and Hugging Face Hub upload support. + +Preview generation uses the in-memory `data_designer.config.preview_results.PreviewResults` object returned by [DataDesigner.preview()](data_designer.md#data_designer.interface.data_designer.DataDesigner.preview). Persisted dataset creation uses [DatasetCreationResults](#data_designer.interface.results.DatasetCreationResults). + +## `DatasetCreationResults` {#data_designer.interface.results.DatasetCreationResults} + +::: data_designer.interface.results.DatasetCreationResults + options: + show_root_toc_entry: false diff --git a/docs/code_reference/mcp.md b/docs/code_reference/mcp.md deleted file mode 100644 index cbabce846..000000000 --- a/docs/code_reference/mcp.md +++ /dev/null @@ -1,104 +0,0 @@ -# MCP (Model Context Protocol) - -The `mcp` module defines configuration and execution classes for tool use via MCP (Model Context Protocol). - -## Configuration Classes - -[MCPProvider](#data_designer.config.mcp.MCPProvider) configures remote MCP servers via SSE or Streamable HTTP transport. [LocalStdioMCPProvider](#data_designer.config.mcp.LocalStdioMCPProvider) configures local MCP servers as subprocesses via stdio transport. [ToolConfig](#data_designer.config.mcp.ToolConfig) defines which tools are available for LLM columns and how they are constrained. - -For user-facing guides, see: - -- **[MCP Providers](../concepts/mcp/mcp-providers.md)** - Configure local or remote MCP providers -- **[Tool Configs](../concepts/mcp/tool-configs.md)** - Define tool permissions and limits -- **[Enabling Tools](../concepts/mcp/enabling-tools.md)** - Use tools in LLM columns -- **[Traces](../concepts/traces.md)** - Capture full conversation history - -## Internal Architecture - -### Parallel Structure - -| Model Layer | MCP Layer | Purpose | -|-------------|-----------|---------| -| `ModelProviderRegistry` | `MCPProviderRegistry` | Holds provider configurations | -| `ModelRegistry` | `MCPRegistry` | Manages configs by alias, lazy facade creation | -| `ModelFacade` | `MCPFacade` | Lightweight facade scoped to specific config | -| `ModelConfig.alias` | `ToolConfig.tool_alias` | Alias for referencing in column configs | - -### MCPProviderRegistry - -Holds MCP provider configurations. Can be empty (MCP is optional). Created first during resource initialization. - -### MCPRegistry - -The central registry for tool configurations: - -- Holds `ToolConfig` instances by `tool_alias` -- Lazily creates `MCPFacade` instances via `get_mcp(tool_alias)` -- Manages shared connection pool and tool cache across all facades -- Validates that tool configs reference valid providers - -### MCPFacade - -A lightweight facade scoped to a specific `ToolConfig`. Key methods: - -| Method | Description | -|--------|-------------| -| `tool_call_count(response)` | Count tool calls in a completion response | -| `has_tool_calls(response)` | Check if response contains tool calls | -| `get_tool_schemas()` | Get OpenAI-format tool schemas for this config | -| `process_completion_response(response)` | Execute tool calls and return messages | -| `refuse_completion_response(response)` | Refuse tool calls gracefully (budget exhaustion) | - -Properties: `tool_alias`, `providers`, `max_tool_call_turns`, `allow_tools`, `timeout_sec` - -### I/O Layer (mcp/io.py) - -The `io.py` module provides low-level MCP communication with performance optimizations: - -**Single event loop architecture:** -All MCP operations funnel through a dedicated background daemon thread running an asyncio event loop. This allows: - -- Efficient concurrent I/O without per-thread event loop overhead -- Natural session sharing across all worker threads -- Clean async implementation for parallel tool calls - -**Session pooling:** -MCP sessions are created lazily and kept alive for the program's duration: - -- One session per provider (keyed by serialized config) -- No per-call connection/handshake overhead -- Graceful cleanup on program exit via `atexit` handler - -**Request coalescing:** -The `list_tools` operation uses request coalescing to prevent thundering herd: - -- When multiple workers request tools from the same provider simultaneously -- Only one request is made; others wait for the cached result -- Uses asyncio.Lock per provider key - -**Parallel tool execution:** -The `call_tools_parallel()` function executes multiple tool calls concurrently via `asyncio.gather()`. This is used by MCPFacade when the model returns parallel tool calls in a single response. - -### Integration with ModelFacade.generate() - -The `ModelFacade.generate()` method accepts an optional `tool_alias` parameter: - -```python -output, messages = model_facade.generate( - prompt="Search and answer...", - parser=my_parser, - tool_alias="my-tools", # Enables tool calling for this generation -) -``` - -When `tool_alias` is provided: - -1. `ModelFacade` looks up the `MCPFacade` from `MCPRegistry` -2. Tool schemas are fetched and passed to the LLM -3. After each completion, `MCPFacade` processes tool calls -4. Turn counting tracks iterations; refusal kicks in when budget exhausted -5. Messages (including tool results) are returned for trace capture - -## Config Module - -::: data_designer.config.mcp diff --git a/docs/code_reference/models.md b/docs/code_reference/models.md deleted file mode 100644 index 98023d517..000000000 --- a/docs/code_reference/models.md +++ /dev/null @@ -1,12 +0,0 @@ -# Models - -The `models` module defines configuration objects for model-based generation. [ModelProvider](#data_designer.config.models.ModelProvider) specifies connection and authentication details for custom providers. [ModelConfig](#data_designer.config.models.ModelConfig) encapsulates model details including the model alias, identifier, and inference parameters. [Inference Parameters](../concepts/models/inference-parameters.md) controls model behavior through settings like `temperature`, `top_p`, and `max_tokens`, with support for both fixed values and distribution-based sampling. The module includes [ImageContext](#data_designer.config.models.ImageContext) for providing image inputs to multimodal models, and [ImageInferenceParams](#data_designer.config.models.ImageInferenceParams) for configuring image generation models. - -For more information on how they are used, see below: - -- **[Model Providers](../concepts/models/model-providers.md)** -- **[Model Configs](../concepts/models/model-configs.md)** -- **[Image Context](../notebooks/4-providing-images-as-context.ipynb)** -- **[Generating Images](../notebooks/5-generating-images.ipynb)** - -::: data_designer.config.models diff --git a/docs/code_reference/processors.md b/docs/code_reference/processors.md deleted file mode 100644 index 1c798c4e9..000000000 --- a/docs/code_reference/processors.md +++ /dev/null @@ -1,6 +0,0 @@ -# Processors - -The `processors` module defines configuration objects for post-generation data transformations. Processors run after column generation and can modify the dataset schema or content before output. - -::: data_designer.config.processors - diff --git a/docs/code_reference/validator_params.md b/docs/code_reference/validator_params.md deleted file mode 100644 index 5e3511bb2..000000000 --- a/docs/code_reference/validator_params.md +++ /dev/null @@ -1,6 +0,0 @@ -# Validator Parameters - -When creating a `ValidationColumnConfig`, two parameters are used to define the validator: `validator_type` and `validator_config`. -The `validator_type` parameter can be set to either `code`, `local_callable` or `remote`. The `validator_config` accompanying each of these is, respectively: - -::: data_designer.config.validator_params \ No newline at end of file diff --git a/docs/concepts/columns.md b/docs/concepts/columns.md index 03fda5de6..45b87d174 100644 --- a/docs/concepts/columns.md +++ b/docs/concepts/columns.md @@ -213,4 +213,4 @@ Computed property listing columns created implicitly alongside the primary colum - `{name}__trace`: Created when `with_trace` is not `TraceType.NONE` on the column. - `{name}__reasoning_content`: Created when `extract_reasoning_content=True` on the column. -For detailed information on each column type, refer to the [column configuration code reference](../code_reference/column_configs.md). +For detailed information on each column type, refer to the [column configuration code reference](../code_reference/config/column_configs.md). diff --git a/docs/concepts/custom_columns.md b/docs/concepts/custom_columns.md index 447c6f96d..3d9ae3954 100644 --- a/docs/concepts/custom_columns.md +++ b/docs/concepts/custom_columns.md @@ -191,5 +191,5 @@ Mocking only `generate()` will silently no-op under the async engine because the ## See Also -- [Column Configs Reference](../code_reference/column_configs.md) +- [Column Configs Reference](../code_reference/config/column_configs.md) - [Plugins Overview](../plugins/overview.md) diff --git a/docs/concepts/deployment-options.md b/docs/concepts/deployment-options.md index ca7278ffa..35e325e2f 100644 --- a/docs/concepts/deployment-options.md +++ b/docs/concepts/deployment-options.md @@ -78,7 +78,7 @@ dd = DataDesigner() ### You Need Maximum Flexibility -- **Custom plugins**: Extend Data Designer with custom column generators, validators, or processors +- **Custom plugins**: Extend Data Designer with custom column generators, seed readers, or processors - **Local development**: Rapid iteration with immediate feedback - **Integration**: Embed Data Designer into existing Python pipelines or notebooks - **Experimentation**: Research workflows with custom models or configurations diff --git a/docs/concepts/models/model-configs.md b/docs/concepts/models/model-configs.md index d18c71937..888a7bdca 100644 --- a/docs/concepts/models/model-configs.md +++ b/docs/concepts/models/model-configs.md @@ -143,5 +143,5 @@ model_config = dd.ModelConfig( - **[Default Model Settings](default-model-settings.md)**: Pre-configured model settings included with Data Designer - **[Custom Model Settings](custom-model-settings.md)**: Learn how to create custom providers and model configurations - **[Configure Model Settings With the CLI](configure-model-settings-with-the-cli.md)**: Use the CLI to manage model settings -- **[Column Configurations](../../code_reference/column_configs.md)**: Learn how to use models in column configurations +- **[Column Configurations](../../code_reference/config/column_configs.md)**: Learn how to use models in column configurations - **[Architecture & Performance](../architecture-and-performance.md)**: Understanding separation of concerns and optimizing concurrency diff --git a/docs/concepts/person_sampling.md b/docs/concepts/person_sampling.md index 5452e7b98..3c9e5eaf6 100644 --- a/docs/concepts/person_sampling.md +++ b/docs/concepts/person_sampling.md @@ -40,7 +40,7 @@ config_builder.add_column( ) ``` -For mor details, see the documentation for [`SamplerColumnConfig`](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonFromFakerSamplerParams`](../code_reference/sampler_params.md#data_designer.config.sampler_params.PersonFromFakerSamplerParams). +For mor details, see the documentation for [`SamplerColumnConfig`](../code_reference/config/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonFromFakerSamplerParams`](../code_reference/config/sampler_params.md#data_designer.config.sampler_params.PersonFromFakerSamplerParams). --- @@ -161,7 +161,7 @@ config_builder.add_column( ) ``` -For more details, see the documentation for [`SamplerColumnConfig`](../code_reference/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonSamplerParams`](../code_reference/sampler_params.md#data_designer.config.sampler_params.PersonSamplerParams). +For more details, see the documentation for [`SamplerColumnConfig`](../code_reference/config/column_configs.md#data_designer.config.column_configs.SamplerColumnConfig) and [`PersonSamplerParams`](../code_reference/config/sampler_params.md#data_designer.config.sampler_params.PersonSamplerParams). ### Available Data Fields diff --git a/docs/concepts/processors.md b/docs/concepts/processors.md index 1b0f75943..290eff00c 100644 --- a/docs/concepts/processors.md +++ b/docs/concepts/processors.md @@ -88,7 +88,7 @@ processor = dd.SchemaTransformProcessorConfig( - Each key in `template` becomes a column in the transformed dataset - Values are Jinja2 templates with access to all columns in the batch - Complex structures (lists, nested dicts) are supported -- Output is saved to the `processors-outputs/{name}/` directory +- Output is saved to the `processors-files/{name}/` directory - The original dataset passes through unchanged **Template Capabilities:** @@ -143,13 +143,7 @@ Processors execute in the order they're added. Plan accordingly when one process ## Processor Plugins -You can extend Data Designer with custom processors via the [plugin system](../plugins/overview.md). A processor plugin is a Python package that provides: - -- A **config class** inheriting from `ProcessorConfig` with a `processor_type: Literal["your-type"]` discriminator -- An **implementation class** inheriting from `Processor` that overrides the desired callback methods -- A **`Plugin` instance** connecting the two - -Once installed, plugin processors are automatically discovered and can be used with `add_processor()` like built-in processors. +You can extend Data Designer with custom processors via the [plugin system](../plugins/overview.md). Once installed, plugin processors are automatically discovered and can be used with `add_processor()` like built-in processors. ```python from my_processor_plugin.config import MyProcessorConfig @@ -162,14 +156,7 @@ builder.add_processor( ) ``` -**Entry point configuration** in `pyproject.toml`: - -```toml -[project.entry-points."data_designer.plugins"] -my-processor = "my_plugin.plugin:my_processor_plugin" -``` - -See the [plugins overview](../plugins/overview.md) for the full guide on creating plugins. +For implementation instructions across all plugin types, see [Build Your Own](../plugins/build_your_own.md). ## Configuration Parameters diff --git a/docs/concepts/security.md b/docs/concepts/security.md index 1b98bd1a9..6b365befd 100644 --- a/docs/concepts/security.md +++ b/docs/concepts/security.md @@ -200,4 +200,4 @@ For example, this is often reasonable in a notebook, local script, or other sing ## Related Reading - [Deployment Options](deployment-options.md) -- [Run Config Reference](../code_reference/run_config.md) +- [Run Config Reference](../code_reference/config/run_config.md) diff --git a/docs/concepts/seed-datasets.md b/docs/concepts/seed-datasets.md index a64add812..581fc2f8f 100644 --- a/docs/concepts/seed-datasets.md +++ b/docs/concepts/seed-datasets.md @@ -167,7 +167,7 @@ Path: {{ relative_path }} - `content` — decoded text contents of the matched file !!! tip "Custom Filesystem Readers" - If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom `FileSystemSeedReader` and pass it via `DataDesigner(seed_readers=[...])`. See the [FileSystemSeedReader Plugins](../plugins/filesystem_seed_reader.md) guide. + If you need custom row construction, fan-out behavior, or expensive hydration logic for any directory-backed seed source, build a custom `FileSystemSeedReader` and pass it via `DataDesigner(seed_readers=[...])`. For packaging and registration, see [Build Your Own](../plugins/build_your_own.md). !!! note "Encoding" `encoding="utf-8"` is the default. Set a different Python codec name if your files use another text encoding. diff --git a/docs/concepts/tool_use_and_mcp.md b/docs/concepts/tool_use_and_mcp.md index 1f1891421..ec2771f3f 100644 --- a/docs/concepts/tool_use_and_mcp.md +++ b/docs/concepts/tool_use_and_mcp.md @@ -66,4 +66,4 @@ See the [PDF Q&A Recipe](../recipes/mcp_and_tooluse/pdf_qa.md) for a complete wo ## Code Reference -For internal architecture and API documentation, see [MCP Code Reference](../code_reference/mcp.md). +For config objects, see [MCP Configuration Reference](../code_reference/config/mcp.md). For runtime internals, see [Engine MCP Reference](../code_reference/engine/mcp.md). diff --git a/docs/concepts/validators.md b/docs/concepts/validators.md index b6ccb8f5f..043694ee7 100644 --- a/docs/concepts/validators.md +++ b/docs/concepts/validators.md @@ -288,7 +288,7 @@ The `target_columns` parameter specifies which columns to validate. All target c ### Configuration Parameters -See more about parameters used to instantiate `ValidationColumnConfig` in the [code reference](../../code_reference/column_configs/#data_designer.config.column_configs.ValidationColumnConfig). +See more about parameters used to instantiate `ValidationColumnConfig` in the [code reference](../code_reference/config/column_configs.md#data_designer.config.column_configs.ValidationColumnConfig). ### Batch Size Considerations @@ -330,4 +330,4 @@ builder.add_column( ## See Also -- [Validator Parameters Reference](../code_reference/validator_params.md): Configuration object schemas +- [Validator Parameters Reference](../code_reference/config/validator_params.md): Configuration object schemas diff --git a/docs/css/mkdocstrings.css b/docs/css/mkdocstrings.css index 6fd2a45d9..56ba05c64 100644 --- a/docs/css/mkdocstrings.css +++ b/docs/css/mkdocstrings.css @@ -78,3 +78,55 @@ div.doc-contents:not(.first) { .doc-symbol-toc.doc-symbol-method::after { content: "method"; } + + /* Keep API section tables readable when Python type annotations are long. */ + div.doc-contents:has(table:has(thead th:nth-child(3))) { + overflow-x: auto; + } + + div.doc-contents table:has(thead th:nth-child(3)) { + table-layout: fixed; + width: 100%; + min-width: 42rem; + } + + div.doc-contents table:has(thead th:nth-child(3)) td { + vertical-align: top; + } + + div.doc-contents table:has(thead th:nth-child(3)) code { + white-space: normal; + overflow-wrap: anywhere; + word-break: normal; + } + + /* Attributes: Name, Type, Description. */ + div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) th:nth-child(1), + div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) td:nth-child(1) { + width: clamp(9rem, 18%, 12rem); + } + + div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) th:nth-child(2), + div.doc-contents table:has(thead th:nth-child(3)):not(:has(thead th:nth-child(4))) td:nth-child(2) { + width: clamp(16rem, 38%, 34rem); + } + + /* Parameters: Name, Type, Description, Default. */ + div.doc-contents table:has(thead th:nth-child(4)) { + min-width: 54rem; + } + + div.doc-contents table:has(thead th:nth-child(4)) th:nth-child(1), + div.doc-contents table:has(thead th:nth-child(4)) td:nth-child(1) { + width: clamp(9rem, 16%, 11rem); + } + + div.doc-contents table:has(thead th:nth-child(4)) th:nth-child(2), + div.doc-contents table:has(thead th:nth-child(4)) td:nth-child(2) { + width: clamp(16rem, 32%, 28rem); + } + + div.doc-contents table:has(thead th:nth-child(4)) th:nth-child(4), + div.doc-contents table:has(thead th:nth-child(4)) td:nth-child(4) { + width: clamp(5rem, 9%, 7rem); + } diff --git a/docs/css/style.css b/docs/css/style.css index 54c525904..474add1f7 100644 --- a/docs/css/style.css +++ b/docs/css/style.css @@ -162,6 +162,19 @@ h2 { .md-typeset__table > table { max-height: 60vh; + min-width: 100%; + width: max-content; + } + + .md-typeset__table { + display: block; + overflow-x: auto; + } + + .md-typeset__table code { + white-space: nowrap; + word-break: normal; + overflow-wrap: normal; } .md-typeset__table > table thead { diff --git a/docs/notebook_source/_README.md b/docs/notebook_source/_README.md index ff7dd541f..97bcdf8cb 100644 --- a/docs/notebook_source/_README.md +++ b/docs/notebook_source/_README.md @@ -136,7 +136,7 @@ Understanding these concepts will help you make the most of the tutorials: Quick reference guides for the main configuration objects: -- **[column_configs](../code_reference/column_configs.md)** - All column configuration types -- **[config_builder](../code_reference/config_builder.md)** - The `DataDesignerConfigBuilder` API -- **[data_designer_config](../code_reference/data_designer_config.md)** - Main configuration schema -- **[validator_params](../code_reference/validator_params.md)** - Validator configuration options +- **[column_configs](../code_reference/config/column_configs.md)** - All column configuration types +- **[config_builder](../code_reference/config/config_builder.md)** - The `DataDesignerConfigBuilder` API +- **[data_designer_config](../code_reference/config/data_designer_config.md)** - Main configuration schema +- **[validator_params](../code_reference/config/validator_params.md)** - Validator configuration options diff --git a/docs/plugins/available.md b/docs/plugins/available.md index 2489dcfdc..be855222e 100644 --- a/docs/plugins/available.md +++ b/docs/plugins/available.md @@ -1,3 +1,20 @@ -# 🚧 Coming Soon +# Available Plugins -This page will list available Data Designer plugins. Stay tuned! +Data Designer plugins come from two places: NVIDIA-maintained first-party packages and community packages shared by Data Designer users. + +## First-party plugins + +NVIDIA-maintained Data Designer plugins are developed and tested in the [DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins) repository. + +- [Browse first-party plugin documentation](https://nvidia-nemo.github.io/DataDesignerPlugins/plugins/) for available plugins and usage instructions. +- [View plugin packages](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins) for source code, package metadata, and tests. + +## Community plugins + +There are no community plugins listed yet, but yours could be the first! If you build a plugin, please consider publishing and requesting a listing so other Data Designer users can find it here. + +To build and share a community plugin: + +1. Build the plugin following the [Build Your Own](build_your_own.md) guide. +2. Publish the plugin package to [PyPI](https://pypi.org/). +3. Open a GitHub issue on the [Data Designer repository](https://github.com/NVIDIA-NeMo/DataDesigner/issues) with the package name, PyPI URL, source repository, documentation link, supported Data Designer version, and plugin type. diff --git a/docs/plugins/build_your_own.md b/docs/plugins/build_your_own.md new file mode 100644 index 000000000..649b8bdd7 --- /dev/null +++ b/docs/plugins/build_your_own.md @@ -0,0 +1,307 @@ +# Build Your Own + +Data Designer supports three plugin types: **column generators**, **seed readers**, and **processors**. They all use the same package shape: a config class, an implementation class, and a `Plugin` object registered through a `data_designer.plugins` entry point. + +Use this page as the implementation checklist for plugin packages. Each tab below shows the core files for one plugin type. + +## Package shape + +Use the same structure for each plugin package: + +```text +data-designer-my-plugin/ +|-- pyproject.toml +`-- src/ + `-- data_designer_my_plugin/ + |-- __init__.py + |-- config.py + |-- impl.py + `-- plugin.py +``` + +## Implementation patterns + +=== "Column generator" + + This `index-multiplier` plugin adds a custom column whose value is the row index multiplied by a configurable integer. + + !!! note "Model-backed generators" + If your column generator interacts with models, include at least one `model_alias` field in the config and use the model registry from the implementation. See [Using Models in Plugins](models.md) for the registry access pattern. + + !!! info "Full-column vs cell-by-cell generators" + The example below uses `ColumnGeneratorFullColumn` because it can fill the whole batch from the DataFrame index. Use `ColumnGeneratorCellByCell` when each row can be generated independently from its upstream values and your `generate` method should receive and return a row dictionary. Cell-by-cell generation is especially useful for independent LLM calls because the async engine can run rows concurrently; the built-in [LLM completion generators](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/llm_completion.py) are good examples. Prefer `ColumnGeneratorFullColumn` for vectorized pandas operations, batched external APIs, or logic that needs to inspect or update the full batch at once. + + `config.py`: + + ```python + from __future__ import annotations + + from typing import Literal + + from data_designer.config.base import SingleColumnConfig + + + class IndexMultiplierColumnConfig(SingleColumnConfig): + column_type: Literal["index-multiplier"] = "index-multiplier" + multiplier: int = 2 + + @staticmethod + def get_column_emoji() -> str: + return "✖️" + + @property + def required_columns(self) -> list[str]: + return [] + + @property + def side_effect_columns(self) -> list[str]: + return [] + ``` + + `impl.py`: + + ```python + from __future__ import annotations + + from typing import TYPE_CHECKING + + from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn + + from data_designer_index_multiplier.config import IndexMultiplierColumnConfig + + if TYPE_CHECKING: + import pandas as pd + + + class IndexMultiplierColumnGenerator(ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]): + def generate(self, data: pd.DataFrame) -> pd.DataFrame: + data[self.config.name] = data.index * self.config.multiplier + return data + ``` + + `plugin.py`: + + ```python + from __future__ import annotations + + from data_designer.plugins import Plugin, PluginType + + plugin = Plugin( + config_qualified_name="data_designer_index_multiplier.config.IndexMultiplierColumnConfig", + impl_qualified_name="data_designer_index_multiplier.impl.IndexMultiplierColumnGenerator", + plugin_type=PluginType.COLUMN_GENERATOR, + ) + ``` + + Entry point: + + ```toml + [project.entry-points."data_designer.plugins"] + index-multiplier = "data_designer_index_multiplier.plugin:plugin" + ``` + + For the generator implementation contract, see [Column Generators](../code_reference/engine/column_generators.md). For inline custom functions, see [Custom Columns](../concepts/custom_columns.md). + +=== "Seed reader" + + This `prefixed-text-files` plugin loads text files from a directory and emits a seed dataset with prefixed file contents. + + `config.py`: + + ```python + from __future__ import annotations + + from typing import Literal + + from data_designer.config.seed_source import FileSystemSeedSource + + + class PrefixedTextSeedSource(FileSystemSeedSource): + seed_type: Literal["prefixed-text-files"] = "prefixed-text-files" + prefix: str = "plugin" + ``` + + `impl.py`: + + ```python + from __future__ import annotations + + from pathlib import Path + from typing import Any + + import data_designer.lazy_heavy_imports as lazy + from data_designer.engine.resources.seed_reader import ( + FileSystemSeedReader, + SeedReaderFileSystemContext, + ) + + from data_designer_prefixed_text_seed_reader.config import PrefixedTextSeedSource + + + class PrefixedTextSeedReader(FileSystemSeedReader[PrefixedTextSeedSource]): + output_columns = ["relative_path", "file_name", "prefixed_content"] + + def build_manifest( + self, + *, + context: SeedReaderFileSystemContext, + ) -> lazy.pd.DataFrame | list[dict[str, str]]: + matched_paths = self.get_matching_relative_paths( + context=context, + file_pattern=self.source.file_pattern, + recursive=self.source.recursive, + ) + return [ + { + "relative_path": relative_path, + "file_name": Path(relative_path).name, + } + for relative_path in matched_paths + ] + + def hydrate_row( + self, + *, + manifest_row: dict[str, Any], + context: SeedReaderFileSystemContext, + ) -> dict[str, str]: + relative_path = str(manifest_row["relative_path"]) + with context.fs.open(relative_path, "r", encoding="utf-8") as handle: + content = handle.read().strip() + return { + "relative_path": relative_path, + "file_name": str(manifest_row["file_name"]), + "prefixed_content": f"{self.source.prefix}:{content}", + } + ``` + + `plugin.py`: + + ```python + from __future__ import annotations + + from data_designer.plugins import Plugin, PluginType + + plugin = Plugin( + config_qualified_name="data_designer_prefixed_text_seed_reader.config.PrefixedTextSeedSource", + impl_qualified_name="data_designer_prefixed_text_seed_reader.impl.PrefixedTextSeedReader", + plugin_type=PluginType.SEED_READER, + ) + ``` + + Entry point: + + ```toml + [project.entry-points."data_designer.plugins"] + prefixed-text-files = "data_designer_prefixed_text_seed_reader.plugin:plugin" + ``` + + For the engine API behind this example, see [Seed Readers](../code_reference/engine/seed_readers.md). + +=== "Processor" + + This `regex-filter` plugin filters rows whose column value matches a regular expression. + + `config.py`: + + ```python + from __future__ import annotations + + from typing import Literal + + from pydantic import Field + + from data_designer.config.base import ProcessorConfig + + + class RegexFilterProcessorConfig(ProcessorConfig): + processor_type: Literal["regex-filter"] = "regex-filter" + column: str = Field(description="Column to match against.") + pattern: str = Field(description="Regex pattern to match.") + invert: bool = Field(default=False, description="If True, keep rows that do not match.") + ``` + + `impl.py`: + + ```python + from __future__ import annotations + + from typing import TYPE_CHECKING + + from data_designer.engine.processing.processors.base import Processor + + from data_designer_regex_filter.config import RegexFilterProcessorConfig + + if TYPE_CHECKING: + import pandas as pd + + + class RegexFilterProcessor(Processor[RegexFilterProcessorConfig]): + def process_after_generation(self, data: pd.DataFrame) -> pd.DataFrame: + mask = data[self.config.column].astype(str).str.contains(self.config.pattern, regex=True) + if self.config.invert: + mask = ~mask + return data[mask].reset_index(drop=True) + ``` + + `plugin.py`: + + ```python + from __future__ import annotations + + from data_designer.plugins import Plugin, PluginType + + plugin = Plugin( + config_qualified_name="data_designer_regex_filter.config.RegexFilterProcessorConfig", + impl_qualified_name="data_designer_regex_filter.impl.RegexFilterProcessor", + plugin_type=PluginType.PROCESSOR, + ) + ``` + + Entry point: + + ```toml + [project.entry-points."data_designer.plugins"] + regex-filter = "data_designer_regex_filter.plugin:plugin" + ``` + + For callback selection and processor execution details, see [Processors](../concepts/processors.md). For the engine API behind this example, see [Engine Processors code reference](../code_reference/engine/processors.md). + +## Install and use locally + +Install any plugin package in editable mode from the package directory: + +```bash +uv pip install -e . +``` + +The editable install registers the `data_designer.plugins` entry point so Data Designer can discover the plugin. + +!!! note "Restart your kernel after installing" + Data Designer caches the plugin registry on first import, so an `import data_designer` that already happened in your Python process — typical in a notebook — won't pick up a freshly installed plugin. After `uv pip install -e .`, restart the kernel (or interpreter) so the next import rebuilds the registry. + +## Validate plugins + +Data Designer provides a testing utility for common plugin structure checks: + +```python +from data_designer.engine.testing.utils import assert_valid_plugin +from data_designer_index_multiplier.plugin import plugin + +assert_valid_plugin(plugin) +``` + +`assert_valid_plugin` checks that the plugin's config inherits from `ConfigBase` and that the implementation class inherits from the appropriate base for its plugin type (`ConfigurableTask` for column generators, `SeedReader` for seed readers). + +For published plugins, add at least one functional test that runs the plugin through `DataDesigner.preview(...)`. This catches packaging and entry point issues that a direct implementation test can miss. + +## Multiple plugins in one package + +A single Python package can register multiple plugins by defining multiple `Plugin` objects and entry points: + +```toml +[project.entry-points."data_designer.plugins"] +my-column-generator = "my_package.plugins.column_generator.plugin:column_generator_plugin" +my-seed-reader = "my_package.plugins.seed_reader.plugin:seed_reader_plugin" +my-processor = "my_package.plugins.processor.plugin:processor_plugin" +``` diff --git a/docs/plugins/example.md b/docs/plugins/example.md deleted file mode 100644 index ce847be93..000000000 --- a/docs/plugins/example.md +++ /dev/null @@ -1,289 +0,0 @@ -!!! warning "Experimental Feature" - The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions). - - -# Example Plugin: Column Generator - -Data Designer supports three plugin types: **column generators**, **seed readers**, and **processors**. This page walks through a complete column generator example. For filesystem-backed seed reader plugins, see [FileSystemSeedReader Plugins](filesystem_seed_reader.md). - -A Data Designer plugin is implemented as a Python package with three main components: - -1. **Configuration Class**: Defines the parameters users can configure -2. **Implementation Class**: Contains the core logic of the plugin -3. **Plugin Object**: Connects the config and implementation classes to make the plugin discoverable - -We recommend separating these into individual files (`config.py`, `impl.py`, `plugin.py`) within a plugin subdirectory. This keeps the code organized, makes it easy to test each component independently, and guards against circular dependencies — since the config module can be imported without pulling in the engine-level implementation classes, and the plugin object can be discovered without importing either. - ---- - -## Column Generator Plugin: Index Multiplier - -In this section, we will build a simple column generator plugin that generates values by multiplying the row index by a user-specified multiplier. - -### Step 1: Create a Python package - -We recommend the following structure for column generator plugins: - -``` -data-designer-index-multiplier/ -├── pyproject.toml -└── src/ - └── data_designer_index_multiplier/ - ├── __init__.py - ├── config.py - ├── impl.py - └── plugin.py -``` - -### Step 2: Create the config class - -The configuration class defines what parameters users can set when using your plugin. For column generator plugins, it must inherit from [SingleColumnConfig](../code_reference/column_configs.md#data_designer.config.column_configs.SingleColumnConfig) and include a [discriminator field](https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions). - -Create `src/data_designer_index_multiplier/config.py`: - -```python -from typing import Literal - -from data_designer.config.base import SingleColumnConfig - - -class IndexMultiplierColumnConfig(SingleColumnConfig): - """Configuration for the index multiplier column generator.""" - - # Required: discriminator field with a unique Literal type - # This value identifies your plugin and becomes its column_type - column_type: Literal["index-multiplier"] = "index-multiplier" - - # Configurable parameter for this plugin - multiplier: int = 2 - - @staticmethod - def get_column_emoji() -> str: - return "✖️" - - @property - def required_columns(self) -> list[str]: - """Columns that must exist before this generator runs.""" - return [] - - @property - def side_effect_columns(self) -> list[str]: - """Additional columns produced beyond the primary column.""" - return [] -``` - -**Key points:** - -- The `column_type` field must be a `Literal` type with a string default -- This value uniquely identifies your plugin (use kebab-case) -- Add any custom parameters your plugin needs (here: `multiplier`) -- `SingleColumnConfig` is a Pydantic model, so you can leverage all of Pydantic's validation features -- `get_column_emoji()` returns the emoji displayed in logs for this column type -- `required_columns` lists any columns this generator depends on (empty if none) -- `side_effect_columns` lists any additional columns this generator produces beyond the primary column (empty if none) - -**If your plugin can expand or retract the number of rows (1:N or N:1):** set `allow_resize=True` in the config class so the pipeline updates batch bookkeeping correctly. For example: - -```python -class MyColumnConfig(SingleColumnConfig): - column_type: Literal["my-plugin"] = "my-plugin" - allow_resize: bool = True # required when output row count can differ from input - # ... -``` - -The default is `False`; only set it to `True` when your `generate` method can return more or fewer rows than it receives. - -### Step 3: Create the implementation class - -The implementation class defines the actual business logic of the plugin. For column generator plugins, inherit from `ColumnGeneratorFullColumn` or `ColumnGeneratorCellByCell` and implement the `generate` method. - -Create `src/data_designer_index_multiplier/impl.py`: - -```python -import logging - -import pandas as pd -from data_designer.engine.column_generators.generators.base import ColumnGeneratorFullColumn - -from data_designer_index_multiplier.config import IndexMultiplierColumnConfig - -logger = logging.getLogger(__name__) - - -class IndexMultiplierColumnGenerator(ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]): - - def generate(self, data: pd.DataFrame) -> pd.DataFrame: - """Generate the column data. - - Args: - data: The current DataFrame being built - - Returns: - The DataFrame with the new column added - """ - logger.info( - f"Generating column {self.config.name} " - f"with multiplier {self.config.multiplier}" - ) - - data[self.config.name] = data.index * self.config.multiplier - - return data -``` - -**Key points:** - -- Generic type `ColumnGeneratorFullColumn[IndexMultiplierColumnConfig]` connects the implementation to its config -- You have access to the configuration parameters via `self.config` - -!!! info "Understanding generation_strategy" - The `generation_strategy` specifies how the column generator will generate data. You choose a strategy by inheriting from the corresponding base class: - - - **`ColumnGeneratorFullColumn`**: Generates the full column (at the batch level) in a single call to `generate` - - `generate` must take as input a `pd.DataFrame` with all previous columns and return a `pd.DataFrame` with the generated column appended. - - - **`ColumnGeneratorCellByCell`**: Generates one cell at a time - - `generate` must take as input a `dict` with key/value pairs for all previous columns and return a `dict` with an additional key/value for the generated cell - - Supports concurrent workers via a `max_parallel_requests` parameter on the configuration - -### Step 4: Create the plugin object - -Create a `Plugin` object that makes the plugin discoverable and connects the implementation and config classes. - -Create `src/data_designer_index_multiplier/plugin.py`: - -```python -from data_designer.plugins import Plugin, PluginType - -plugin = Plugin( - config_qualified_name="data_designer_index_multiplier.config.IndexMultiplierColumnConfig", - impl_qualified_name="data_designer_index_multiplier.impl.IndexMultiplierColumnGenerator", - plugin_type=PluginType.COLUMN_GENERATOR, -) -``` - -### Step 5: Package your plugin - -Create a `pyproject.toml` file to define your package and register the entry point: - -```toml -[project] -name = "data-designer-index-multiplier" -version = "1.0.0" -description = "Data Designer index multiplier plugin" -requires-python = ">=3.10" -dependencies = [ - "data-designer", -] - -# Register this plugin via entry points -[project.entry-points."data_designer.plugins"] -index-multiplier = "data_designer_index_multiplier.plugin:plugin" - -[build-system] -requires = ["hatchling"] -build-backend = "hatchling.build" - -[tool.hatch.build.targets.wheel] -packages = ["src/data_designer_index_multiplier"] -``` - -!!! info "Entry Point Registration" - Plugins are discovered automatically using [Python entry points](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata). It is important to register your plugin as an entry point under the `data_designer.plugins` group. - - The entry point format is: - ```toml - [project.entry-points."data_designer.plugins"] - = ":" - ``` - -### Step 6: Install and use your plugin locally - -Install your plugin in editable mode — this is all you need to start using it. No PyPI publishing required: - -```bash -# From the plugin directory -uv pip install -e . -``` - -That's it. The editable install registers the entry point so Data Designer discovers your plugin automatically. Any changes you make to the plugin source code are picked up immediately without reinstalling. - -Once installed, your plugin works just like built-in column types: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -from data_designer_index_multiplier.config import IndexMultiplierColumnConfig - -data_designer = DataDesigner() -builder = dd.DataDesignerConfigBuilder() - -# Add a regular column -builder.add_column( - dd.SamplerColumnConfig( - name="category", - sampler_type="category", - params=dd.CategorySamplerParams(values=["A", "B", "C"]), - ) -) - -# Add your custom plugin column -builder.add_column( - IndexMultiplierColumnConfig( - name="scaled_index", - multiplier=5, - ) -) - -# Generate data -results = data_designer.create(builder, num_records=10) -print(results.load_dataset()) -``` - -Output: -``` - category scaled_index -0 B 0 -1 A 5 -2 C 10 -3 A 15 -4 B 20 -... -``` - ---- - -## Validating Your Plugin - -Data Designer provides a testing utility to validate that your plugin is structured correctly. Use `assert_valid_plugin` to check that your config and implementation classes are properly defined: - -```python -from data_designer.engine.testing.utils import assert_valid_plugin -from data_designer_index_multiplier.plugin import plugin - -# Raises AssertionError with a descriptive message if anything is wrong with the general plugin structure -assert_valid_plugin(plugin) -``` - -This validates that: - -- The config class is a subclass of `ConfigBase` -- For column generator plugins: the implementation class is a subclass of `ConfigurableTask` -- For seed reader plugins: the implementation class is a subclass of `SeedReader` - ---- - -## Multiple Plugins in One Package - -A single Python package can register multiple plugins. Simply define multiple `Plugin` instances and register each one as a separate entry point: - -```toml -[project.entry-points."data_designer.plugins"] -my-column-generator = "my_package.plugins.column_generator.plugin:column_generator_plugin" -my-seed-reader = "my_package.plugins.seed_reader.plugin:seed_reader_plugin" -``` - -For an example of this pattern, see the end-to-end test plugins in the [tests_e2e/](https://github.com/NVIDIA-NeMo/DataDesigner/tree/main/tests_e2e) directory. - -That's it! You now know how to create a Data Designer plugin. A local editable install (`uv pip install -e .`) is all you need to develop, test, and use your plugin. If you want to make it available for others to install via `pip install`, publish it to PyPI or your organization's package index. diff --git a/docs/plugins/filesystem_seed_reader.md b/docs/plugins/filesystem_seed_reader.md deleted file mode 100644 index 04e32daff..000000000 --- a/docs/plugins/filesystem_seed_reader.md +++ /dev/null @@ -1,167 +0,0 @@ -# FileSystemSeedReader Plugins - -!!! warning "Experimental Feature" - The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions). - -`FileSystemSeedReader` is the simplest way to build a seed reader plugin when your source data lives in a directory of files. You describe the files cheaply in `build_manifest(...)`, then optionally read and reshape them in `hydrate_row(...)`. - -This guide focuses on the filesystem-specific contract. The fastest way to learn it is usually to start with an inline reader over `DirectorySeedSource`, then package that reader later only if you need automatic plugin discovery or a brand-new `seed_type`. For a runnable single-file example, see the [Markdown Section Seed Reader recipe](../recipes/plugin_development/markdown_seed_reader.md). - -## What the framework owns - -When you inherit from `FileSystemSeedReader`, Data Designer already handles: - -- attachment-scoped filesystem context reuse -- file matching with `file_pattern` and `recursive` -- manifest sampling, `IndexRange`, `PartitionBlock`, and shuffle -- batching and DuckDB registration -- hydrated output schema validation via `output_columns` - -Most readers only need to implement `build_manifest(...)` and `hydrate_row(...)`. - -## Start with an existing filesystem config - -If your source data already fits `DirectorySeedSource` or `FileContentsSeedSource`, you do not need a new config model just to learn or prototype a reader. Reuse the built-in source type and override how one `DataDesigner` instance interprets that seed source. - -The Markdown recipe uses `DirectorySeedSource(path=..., file_pattern="*.md")` and pairs it with an inline reader: - -```python -import data_designer.config as dd -from pathlib import Path -from typing import Any - -from data_designer.engine.resources.seed_reader import FileSystemSeedReader, SeedReaderFileSystemContext - - -class MarkdownSectionDirectorySeedReader(FileSystemSeedReader[dd.DirectorySeedSource]): - output_columns = [ - "relative_path", - "file_name", - "section_index", - "section_header", - "section_content", - ] - - def build_manifest(self, *, context: SeedReaderFileSystemContext) -> list[dict[str, str]]: - matched_paths = self.get_matching_relative_paths( - context=context, - file_pattern=self.source.file_pattern, - recursive=self.source.recursive, - ) - return [ - { - "relative_path": relative_path, - "file_name": Path(relative_path).name, - } - for relative_path in matched_paths - ] - - def hydrate_row( - self, - *, - manifest_row: dict[str, Any], - context: SeedReaderFileSystemContext, - ) -> list[dict[str, Any]]: - ... -``` - -This approach lets you inspect the manifest and hydration contract without first creating a package, entry points, or a new `seed_type`. - -## Step 1: Build a cheap manifest - -`build_manifest(...)` should be inexpensive. Usually that means enumerating matching files and returning one logical row per file, without reading file contents yet. - -In this example, the manifest only tracks: - -- `relative_path` -- `file_name` - -That keeps selection and partitioning file-based. - -## Step 2: Hydrate one file into one or many rows - -`hydrate_row(...)` can return either: - -- a single record dict for `1:1` hydration -- an iterable of record dicts for `1:N` hydration - -If hydration changes the schema, set `output_columns` to the exact emitted schema: - -```python -output_columns = [ - "relative_path", - "file_name", - "section_index", - "section_header", - "section_content", -] -``` - -In the recipe implementation, `hydrate_row(...)` reads one file and emits one record per ATX heading section. - -Every emitted record must match `output_columns` exactly. Data Designer will raise a plugin-facing error if a hydrated record is missing a declared column or includes an undeclared one. - -## Step 3: Pass the reader to Data Designer - -Register the inline reader on the `DataDesigner` instance you want to use: - -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner - -data_designer = DataDesigner(seed_readers=[MarkdownSectionDirectorySeedReader()]) - -builder = dd.DataDesignerConfigBuilder() -builder.with_seed_dataset( - dd.DirectorySeedSource(path="sample_data", file_pattern="*.md"), -) -``` - -That pattern overrides how this `DataDesigner` instance handles the built-in `directory` seed source. Because `seed_readers` sets the registry for that instance, include any other readers you still want available. This is a good fit for local experiments, tests, and docs recipes. - -## Manifest-Based Selection Semantics - -Selection stays manifest-based even when `hydrate_row(...)` fans out. - -If the matched files are: - -```text -0 -> faq.md -1 -> guide.md -``` - -and `guide.md` hydrates into two section rows, then: - -```python -import data_designer.config as dd -from data_designer.config.seed import IndexRange - -builder.with_seed_dataset( - dd.DirectorySeedSource(path="sample_data", file_pattern="*.md"), - selection_strategy=IndexRange(start=1, end=1), -) -``` - -selects only `guide.md`, then returns **all** section rows emitted from `guide.md`. - -That means `get_seed_dataset_size()`, `IndexRange`, `PartitionBlock`, and shuffle all operate on manifest rows before hydration. - -## Package it later when needed - -If you want the same reader to be installable and auto-discovered as a plugin, then move from the inline pattern to a package: - -- define a config class that inherits from `FileSystemSeedSource` -- give it a unique `seed_type` -- create a `Plugin` object with `plugin_type=PluginType.SEED_READER` -- register that plugin via a `data_designer.plugins` entry point - -That extra packaging step is only necessary when you need a reusable plugin boundary. The reader logic itself still lives in the same `build_manifest(...)` and `hydrate_row(...)` methods shown above. - -## Advanced Hooks - -If you need more control, `FileSystemSeedReader` also lets you override: - -- `on_attach(...)` for per-attachment setup -- `create_filesystem_context(...)` for custom rooted filesystem behavior - -Most filesystem plugins do not need either hook. diff --git a/docs/plugins/models.md b/docs/plugins/models.md new file mode 100644 index 000000000..a83f5b79f --- /dev/null +++ b/docs/plugins/models.md @@ -0,0 +1,195 @@ +# Using Models in Plugins + +Model access belongs in column generator implementations, not config objects. Keep the config declarative by asking users for model aliases, then resolve those aliases at runtime through the model registry. + +Do not construct model clients in plugin configs, read API keys in configs, or bypass Data Designer's model providers. The engine builds a `ResourceProvider` and exposes its model registry to every generator at: + +```python +self.resource_provider.model_registry +``` + +## Access the registry + +Use a model-aware column generator base whenever your plugin needs the registry: + +| Need | Base class | Registry access | +|------|------------|-----------------| +| Primary model alias | `ColumnGeneratorWithModel` | Use `self.model`, `self.model_config`, and `self.inference_parameters`. | +| Multiple aliases or provider inspection | `ColumnGeneratorWithModelRegistry` | Use `self.get_model(alias)`, `self.get_model_config(alias)`, and `self.get_model_provider_name(alias)`. | + +`ColumnGeneratorWithModel` is a convenience subclass of `ColumnGeneratorWithModelRegistry`. It expects the config to have a `model_alias` field and resolves that one alias for you. For independent model calls, return `GenerationStrategy.CELL_BY_CELL` so the runtime can fan out rows like the built-in LLM, embedding, and image generators. Use full-column generation only when your plugin intentionally calls a batched API for the whole DataFrame. + +```python +from __future__ import annotations + +from data_designer.config.column_configs import GenerationStrategy +from data_designer.engine.column_generators.generators.base import ColumnGeneratorWithModel +from data_designer.engine.models.parsers.errors import ParserException + +from data_designer_sentiment_label.config import SentimentLabelColumnConfig + + +def parse_sentiment_label(response: str) -> str: + label = response.strip().lower() + if label not in {"positive", "neutral", "negative"}: + raise ParserException("Expected exactly one of: positive, neutral, negative.", source=response) + return label + + +class SentimentLabelColumnGenerator(ColumnGeneratorWithModel[SentimentLabelColumnConfig]): + @staticmethod + def get_generation_strategy() -> GenerationStrategy: + return GenerationStrategy.CELL_BY_CELL + + async def agenerate(self, data: dict) -> dict: + label, _ = await self.model.agenerate( + prompt=f"Classify the sentiment of this text: {data[self.config.source_column]}", + system_prompt="Return exactly one label: positive, neutral, or negative.", + parser=parse_sentiment_label, + max_correction_steps=self.resource_provider.run_config.max_conversation_correction_steps, + max_conversation_restarts=self.resource_provider.run_config.max_conversation_restarts, + purpose=f"running generation for column '{self.config.name}'", + ) + data[self.config.name] = label + return data +``` + +The matching config must include `model_alias: str` as a normal user-facing field: + +```python +from __future__ import annotations + +from typing import Literal + +from data_designer.config.base import SingleColumnConfig + + +class SentimentLabelColumnConfig(SingleColumnConfig): + column_type: Literal["sentiment-label"] = "sentiment-label" + source_column: str + model_alias: str + + @property + def required_columns(self) -> list[str]: + return [self.source_column] + + @property + def side_effect_columns(self) -> list[str]: + return [] +``` + +Users set that alias from default model settings or from `DataDesignerConfigBuilder(model_configs=...)`. + +## Use multiple models + +If your plugin uses multiple model aliases, inherit from `ColumnGeneratorWithModelRegistry` and resolve each alias explicitly with `self.get_model(...)`. + +The config must include a primary `model_alias: str` field. Startup health checks read it directly from any column config whose generator inherits from `ColumnGeneratorWithModelRegistry`, including generators that inherit through `ColumnGeneratorWithModel`. A config for this pattern might also define `judge_model_alias`, `critic_model_alias`, or another task-specific alias. + +Validate additional alias fields in `_validate()` or `_initialize()` with `get_model_config(...)` so missing aliases fail before generation starts. `get_model_config(alias)` only verifies that the alias is registered; it does not call the endpoint. Endpoint reachability is only exercised for the primary `model_alias` collected by the standard startup health check. + +The matching config shows which alias gets the standard startup health check and which alias the plugin validates itself: + +```python +from __future__ import annotations + +from typing import Literal + +from data_designer.config.base import SingleColumnConfig + + +class PairwiseJudgeColumnConfig(SingleColumnConfig): + column_type: Literal["pairwise-judge"] = "pairwise-judge" + question_column: str + model_alias: str + judge_model_alias: str + + @property + def required_columns(self) -> list[str]: + return [self.question_column] + + @property + def side_effect_columns(self) -> list[str]: + return [] +``` + +```python +from __future__ import annotations + +from data_designer.config.column_configs import GenerationStrategy +from data_designer.engine.column_generators.generators.base import ColumnGeneratorWithModelRegistry +from data_designer.engine.models.parsers.errors import ParserException + +from data_designer_pairwise_judge.config import PairwiseJudgeColumnConfig + + +def parse_score(response: str) -> int: + text = response.strip() + if text not in {"1", "2", "3", "4", "5"}: + raise ParserException("Expected an integer score from 1 to 5.", source=response) + return int(text) + + +class PairwiseJudgeColumnGenerator(ColumnGeneratorWithModelRegistry[PairwiseJudgeColumnConfig]): + @staticmethod + def get_generation_strategy() -> GenerationStrategy: + return GenerationStrategy.CELL_BY_CELL + + def _validate(self) -> None: + self.get_model_config(self.config.model_alias) + self.get_model_config(self.config.judge_model_alias) + + async def agenerate(self, data: dict) -> dict: + generator_model = self.get_model(self.config.model_alias) + judge_model = self.get_model(self.config.judge_model_alias) + retry_kwargs = { + "max_correction_steps": self.resource_provider.run_config.max_conversation_correction_steps, + "max_conversation_restarts": self.resource_provider.run_config.max_conversation_restarts, + } + + draft, _ = await generator_model.agenerate( + prompt=f"Draft an answer for: {data[self.config.question_column]}", + purpose=f"drafting an answer for column '{self.config.name}'", + **retry_kwargs, + ) + score, _ = await judge_model.agenerate( + prompt=f"Score this answer from 1 to 5: {draft}", + system_prompt="Return exactly one integer from 1 to 5.", + parser=parse_score, + purpose=f"judging an answer for column '{self.config.name}'", + **retry_kwargs, + ) + data[self.config.name] = {"draft": draft, "score": score} + return data +``` + +## What the registry returns + +`get_model(...)` returns a `ModelFacade`. Call the facade based on the modality your plugin needs: + +- Chat completion aliases use `model.generate(...)` or `await model.agenerate(...)` and return `(parsed_output, trace)`. +- Embedding aliases use `model.generate_text_embeddings(...)` or `await model.agenerate_text_embeddings(...)` and return `list[list[float]]`. +- Image aliases use `model.generate_image(...)` or `await model.agenerate_image(...)` and return `list[str]` of base64-encoded image data. + +Choose a model alias whose `ModelConfig.inference_parameters.generation_type` matches the facade method you call. The facade merges the alias's configured inference parameters into each request. + +Pass runtime context such as `prompt`, `system_prompt`, `parser`, `tool_alias`, `multi_modal_context`, `max_correction_steps`, `max_conversation_restarts`, and `purpose` at the call site. Parser functions should raise `ParserException` for invalid model responses; that is what allows `ModelFacade.generate(...)` and `ModelFacade.agenerate(...)` to run correction turns and conversation restarts. + +Prefer implementing `agenerate(...)` for model-backed plugins. The base `generate(...)` method can bridge to `agenerate(...)` for sync runs when the subclass only implements async generation. If your plugin has a sync-specific path, implement both `generate(...)` and `agenerate(...)`, as the built-in generators do. + +## Health checks and scheduling + +The model-aware bases mark the generator as LLM-bound, so the async scheduler treats the work like other model calls. + +Plugin discovery treats column generator implementations that inherit from `ColumnGeneratorWithModelRegistry` as model-generated column types for startup model health checks. The standard health-check collection reads a primary `model_alias` field directly from the config. Additional alias fields should be registration-validated by the plugin implementation; their endpoints are not pinged by the standard startup health check. + +## Built-in patterns + +The built-in model-backed generators use these same hooks: + +- `LLMTextCellGenerator`, `LLMCodeCellGenerator`, `LLMStructuredCellGenerator`, and `LLMJudgeCellGenerator` inherit through a chat-completion base that uses `ColumnGeneratorWithModel`. They render prompts from row data, call `self.model.generate(...)` or `self.model.agenerate(...)`, pass parsers into the `ModelFacade`, and store optional trace side-effect columns. +- `EmbeddingCellGenerator` uses `ColumnGeneratorWithModel` but calls the facade's embedding methods instead of chat completion. +- `ImageCellGenerator` uses `ColumnGeneratorWithModel`, renders a prompt, calls the facade's image methods, and writes generated media through the artifact storage supplied by the same `ResourceProvider`. +- `CustomColumnGenerator` is the inline-function counterpart: when users declare `model_aliases`, it builds a `models` dict from `resource_provider.model_registry`. Packaged plugins usually use `ColumnGeneratorWithModel` or `ColumnGeneratorWithModelRegistry` directly instead of recreating that dict. + +See [Column Generators](../code_reference/engine/column_generators.md) for the full base-class API and [Custom Model Settings](../concepts/models/custom-model-settings.md) for configuring model aliases. diff --git a/docs/plugins/overview.md b/docs/plugins/overview.md index 45a469c80..24fa57692 100644 --- a/docs/plugins/overview.md +++ b/docs/plugins/overview.md @@ -1,88 +1,35 @@ # Data Designer Plugins -!!! warning "Experimental Feature" - The plugin system is currently **experimental** and under active development. The documentation, examples, and plugin interface are subject to significant changes in future releases. If you encounter any issues, have questions, or have ideas for improvement, please consider starting [a discussion on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner/discussions). +Plugins let you add new object types to Data Designer without modifying the core library. Once installed, plugins behave like native Data Designer objects: they use the same declarative config patterns, builder APIs, discovery flow, and runtime execution paths as the built-in objects. -## What are plugins? +## Supported plugin types -Plugins are Python packages that extend Data Designer's capabilities without modifying the core library. Similar to [VS Code extensions](https://marketplace.visualstudio.com/vscode) and [Pytest plugins](https://docs.pytest.org/en/stable/reference/plugin_list.html), the plugin system empowers you to build specialized extensions for your specific use cases and share them with the community. +Data Designer supports three plugin types: -**Current capabilities**: Data Designer supports three plugin types: +- **Column generator plugins**: Custom [column generators](../code_reference/engine/column_generators.md) you pass to the config builder's [add_column](../code_reference/config/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_column) method. +- **Seed reader plugins**: Custom [seed readers](../code_reference/engine/seed_readers.md) that load data from new sources, such as databases, cloud storage, or custom file formats. +- **Processor plugins**: Custom [processor implementations](../code_reference/engine/processors.md) configured by processor config objects that transform data before batches, after batches, or after generation completes. Pass them to the config builder's [add_processor](../code_reference/config/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_processor) method. -- **Column Generator Plugins**: Custom column types you pass to the config builder's [add_column](../code_reference/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_column) method. -- **Seed Reader Plugins**: Custom seed dataset readers that let you load data from new sources (e.g., databases, cloud storage, custom formats). -- **Processor Plugins**: Custom processors that transform data before batches, after batches, or after generation completes. Pass them to the config builder's [add_processor](../code_reference/config_builder.md#data_designer.config.config_builder.DataDesignerConfigBuilder.add_processor) method. +## Use an Installed Plugin -## How do you use plugins? +Plugin packages register their `Plugin` objects through Python package [entry points](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata). Data Designer discovers installed plugin entry points automatically, so no extra registration code is required. Simply install the plugin package and use its new object types in your Data Designer workflow. -A Data Designer plugin is just a Python package configured with an [entry point](https://packaging.python.org/en/latest/guides/creating-and-discovering-plugins/#using-package-metadata) that points to a Data Designer `Plugin` object. Using a plugin is as simple as installing the package: +If you install a plugin after `data_designer` has already been imported, restart the Python process so plugin discovery can rebuild from the new entry points. -```bash -# Install a local plugin (for development and testing) -uv pip install -e /path/to/your/plugin +## Build a Plugin -# Or install a published plugin from PyPI -pip install data-designer-{plugin-name} -``` +For implementation instructions across all plugin types, see the [Build Your Own](build_your_own.md) section. -Once installed, plugins are automatically discovered and ready to use — no additional registration or configuration needed. See the [example plugin](example.md) for a complete walkthrough, or jump to [FileSystemSeedReader Plugins](filesystem_seed_reader.md) for filesystem-backed seed reader authoring. +## Find Plugins -## How do you create plugins? +NVIDIA-maintained plugin packages live in the [DataDesignerPlugins](https://github.com/NVIDIA-NeMo/DataDesignerPlugins) repository. See [Available Plugins](available.md) for lists of first-party and community-contributed plugins. -Creating a plugin involves three main steps: +## Discovery troubleshooting -### 1. Implement the Plugin Components +If a plugin is installed but not available, check these items first: -Each plugin has three components, and we recommend organizing them into separate files within a plugin subdirectory: - -- **`config.py`** -- Configuration class defining user-facing parameters - - Column generator plugins: inherit from `SingleColumnConfig` with a `column_type` discriminator - - Seed reader plugins: inherit from `SeedSource` with a `seed_type` discriminator - - Processor plugins: inherit from `ProcessorConfig` with a `processor_type` discriminator -- **`impl.py`** -- Implementation class containing the core logic - - Column generator plugins: inherit from `ColumnGeneratorFullColumn` or `ColumnGeneratorCellByCell` - - Seed reader plugins: inherit from `SeedReader` or `FileSystemSeedReader` for directory-backed sources - - Processor plugins: inherit from `Processor` and override callback methods (`process_before_batch`, `process_after_batch`, `process_after_generation`) -- **`plugin.py`** -- A `Plugin` instance that connects the config and implementation classes - -### 2. Package Your Plugin - -- Set up a Python package with `pyproject.toml` -- Register your plugin using entry points under `data_designer.plugins` -- Define dependencies (including `data-designer`) - -### 3. Install and Test Locally - -- Install your plugin locally with `uv pip install -e .` (editable mode) -- No publishing required — your plugin is usable immediately after a local install -- Iterate on your plugin code with fast feedback - -### 4. Share Your Plugin (Optional) - -- Publish to PyPI or another package index to make it installable by anyone via `pip install` -- This step is only needed if you want others outside your environment to use the plugin - -**Example entry point for a processor plugin:** - -```toml -[project.entry-points."data_designer.plugins"] -my-processor = "my_plugin.plugin:my_processor_plugin" -``` - -Where `my_processor_plugin` is a `Plugin` instance with `plugin_type=PluginType.PROCESSOR`: - -```python -from data_designer.plugins.plugin import Plugin, PluginType - -my_processor_plugin = Plugin( - config_qualified_name="my_plugin.config.MyProcessorConfig", - impl_qualified_name="my_plugin.impl.MyProcessor", - plugin_type=PluginType.PROCESSOR, -) -``` - -**Ready to get started?** - -- See the [Example Plugin](example.md) for a column generator walkthrough -- See [FileSystemSeedReader Plugins](filesystem_seed_reader.md) for filesystem-backed seed reader plugins -- See the [Markdown Section Seed Reader recipe](../recipes/plugin_development/markdown_seed_reader.md) for a runnable single-file `1:N` filesystem reader example +- The entry point group must be exactly `data_designer.plugins`. +- Check the value of the `DISABLE_DATA_DESIGNER_PLUGINS` environment variable. If it is set to `true`, entry point discovery is disabled. +- The plugin discriminator default must be a string. Use `column_type`, `seed_type`, or `processor_type`, depending on the plugin type. +- Avoid duplicate plugin names. Discovery stores plugins by `plugin.name`, which comes from the discriminator default. +- For plugin packages under development, call `assert_valid_plugin` on the plugin object to catch common structural issues at import time. diff --git a/docs/recipes/plugin_development/markdown_seed_reader.md b/docs/recipes/plugin_development/markdown_seed_reader.md index 22c8a8aed..6f81582df 100644 --- a/docs/recipes/plugin_development/markdown_seed_reader.md +++ b/docs/recipes/plugin_development/markdown_seed_reader.md @@ -9,7 +9,7 @@ This keeps the example focused on the actual seed reader contract: - declaring `output_columns` for the hydrated schema - keeping `IndexRange` selection manifest-based -Because the example reuses `DirectorySeedSource`, it does not register a brand-new `seed_type`. If you later want to package the same reader as an installable plugin, see [FileSystemSeedReader Plugins](../../plugins/filesystem_seed_reader.md). +Because the example reuses `DirectorySeedSource`, it does not register a brand-new `seed_type`. To package the same reader as an installable plugin, see [Build Your Own](../../plugins/build_your_own.md). ## Run the Recipe diff --git a/mkdocs.yml b/mkdocs.yml index 61859b333..50f49b26d 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -69,20 +69,37 @@ nav: - "Frontier Judge QA Filter": recipes/vlm_long_doc/frontier_judge.md - Plugins: - Overview: plugins/overview.md - - Example Plugin: plugins/example.md - - FileSystemSeedReader Plugins: plugins/filesystem_seed_reader.md - - Available Plugin List: plugins/available.md + - Build Your Own: plugins/build_your_own.md + - Using Models: plugins/models.md + - Available Plugins: plugins/available.md - Code Reference: - - models: code_reference/models.md - - mcp: code_reference/mcp.md - - column_configs: code_reference/column_configs.md - - config_builder: code_reference/config_builder.md - - data_designer_config: code_reference/data_designer_config.md - - run_config: code_reference/run_config.md - - sampler_params: code_reference/sampler_params.md - - validator_params: code_reference/validator_params.md - - processors: code_reference/processors.md - - analysis: code_reference/analysis.md + - Overview: code_reference/index.md + # Keep module reference pages ordered alphabetically by nav label within each package group. + - Config: + - Overview: code_reference/config/index.md + - analysis: code_reference/config/analysis.md + - column_configs: code_reference/config/column_configs.md + - config_builder: code_reference/config/config_builder.md + - data_designer_config: code_reference/config/data_designer_config.md + - mcp: code_reference/config/mcp.md + - models: code_reference/config/models.md + - plugins: code_reference/config/plugins.md + - processors: code_reference/config/processors.md + - run_config: code_reference/config/run_config.md + - sampler_params: code_reference/config/sampler_params.md + - seeds: code_reference/config/seeds.md + - validator_params: code_reference/config/validator_params.md + - Engine: + - Overview: code_reference/engine/index.md + - column_generators: code_reference/engine/column_generators.md + - mcp: code_reference/engine/mcp.md + - processors: code_reference/engine/processors.md + - seed_readers: code_reference/engine/seed_readers.md + - Interface: + - Overview: code_reference/interface/index.md + - data_designer: code_reference/interface/data_designer.md + - errors: code_reference/interface/errors.md + - results: code_reference/interface/results.md - Dev Notes: # NOTE: Order is most recent -> oldest (so sidebar shows recent first!) - devnotes/index.md @@ -197,6 +214,7 @@ markdown_extensions: base_path: - docs/ - . + url_download: true - pymdownx.highlight: pygments_lang_class: true use_pygments: true diff --git a/packages/data-designer-config/src/data_designer/config/analysis/column_profilers.py b/packages/data-designer-config/src/data_designer/config/analysis/column_profilers.py index f175570cc..9b1ccec38 100644 --- a/packages/data-designer-config/src/data_designer/config/analysis/column_profilers.py +++ b/packages/data-designer-config/src/data_designer/config/analysis/column_profilers.py @@ -58,7 +58,8 @@ class JudgeScoreProfilerConfig(ConfigBase): Must match a model alias defined in the Data Designer configuration. summary_score_sample_size: Number of score samples to include when prompting the LLM to generate summaries. Larger sample sizes provide more context but increase - token usage. Must be at least 1. Defaults to 20. + token usage. Must be at least 1 when provided. Set to None to skip LLM-generated + summaries. Defaults to 20. """ model_alias: str diff --git a/packages/data-designer-config/src/data_designer/config/analysis/column_statistics.py b/packages/data-designer-config/src/data_designer/config/analysis/column_statistics.py index 6b970926a..89c9883f6 100644 --- a/packages/data-designer-config/src/data_designer/config/analysis/column_statistics.py +++ b/packages/data-designer-config/src/data_designer/config/analysis/column_statistics.py @@ -268,7 +268,7 @@ class ValidationColumnStatistics(GeneralColumnStatistics): Inherits general statistics plus validation-specific metrics including the count and percentage of records that passed validation. Stores results from validation logic - (Python, SQL, or remote) executed against target columns. + (Python, SQL, local callable, or remote) executed against target columns. Attributes: num_valid_records: Number of records that passed validation. diff --git a/packages/data-designer-config/src/data_designer/config/analysis/dataset_profiler.py b/packages/data-designer-config/src/data_designer/config/analysis/dataset_profiler.py index c4ff5b969..5b3c9cfe1 100644 --- a/packages/data-designer-config/src/data_designer/config/analysis/dataset_profiler.py +++ b/packages/data-designer-config/src/data_designer/config/analysis/dataset_profiler.py @@ -24,14 +24,14 @@ class DatasetProfilerResults(BaseModel): """Container for complete dataset profiling and analysis results. - Stores profiling results for a generated dataset, including statistics for all columns, - dataset-level metadata, and optional advanced profiler results. Provides methods for - computing derived metrics and generating formatted reports. + Stores profiling results for a generated dataset, including statistics for configured columns, + dataset-level metadata, side-effect column names, and optional advanced profiler results. + Provides methods for computing derived metrics and generating formatted reports. Attributes: num_records: Actual number of records successfully generated in the dataset. target_num_records: Target number of records that were requested to be generated. - column_statistics: List of statistics objects for all columns in the dataset. Each + column_statistics: List of statistics objects for configured columns. Each column has statistics appropriate to its type. Must contain at least one column. side_effect_column_names: Column names that were generated as side effects of other columns. column_profiles: Column profiler results for specific columns when configured. diff --git a/packages/data-designer-config/src/data_designer/config/base.py b/packages/data-designer-config/src/data_designer/config/base.py index 31f0df571..11a694a6e 100644 --- a/packages/data-designer-config/src/data_designer/config/base.py +++ b/packages/data-designer-config/src/data_designer/config/base.py @@ -90,6 +90,9 @@ class SingleColumnConfig(ConfigBase, ABC): name: Unique name of the column to be generated. drop: If True, the column will be generated but removed from the final dataset. Useful for intermediate columns that are dependencies for other columns. + allow_resize: If True, the generator may emit a different number of rows than + it received (1:N or N:1). Explicit ``skip`` gates are invalid on resize + columns, and upstream skip propagation is not applied to them. column_type: Discriminator field that identifies the specific column type. Subclasses must override this field to specify the column type with a `Literal` value. skip: Optional expression gate for conditional generation. @@ -171,6 +174,8 @@ class ProcessorConfig(ConfigBase, ABC): Attributes: name: Unique name of the processor, used to identify the processor in results and to name output artifacts on disk. + processor_type: Discriminator field that identifies the specific processor type. + Subclasses must override this field with a ``Literal`` value. """ name: str = Field( diff --git a/packages/data-designer-config/src/data_designer/config/column_configs.py b/packages/data-designer-config/src/data_designer/config/column_configs.py index 88dffe9a4..e7776f061 100644 --- a/packages/data-designer-config/src/data_designer/config/column_configs.py +++ b/packages/data-designer-config/src/data_designer/config/column_configs.py @@ -28,17 +28,18 @@ class GenerationStrategy(str, Enum): class SamplerColumnConfig(SingleColumnConfig): - """Configuration for columns generated using numerical samplers. + """Configuration for columns generated using built-in samplers. - Sampler columns provide efficient data generation using numerical samplers for - common data types and distributions. Supported samplers include UUID generation, + Sampler columns provide efficient data generation for common data types and + distributions. Supported samplers include UUID generation, datetime/timedelta sampling, person generation, category / subcategory sampling, and various statistical distributions (uniform, gaussian, binomial, poisson, scipy). Attributes: sampler_type (required): Type of sampler to use. Available types include: "uuid", "category", "subcategory", "uniform", "gaussian", "bernoulli", - "bernoulli_mixture", "binomial", "poisson", "scipy", "person", "datetime", "timedelta". + "bernoulli_mixture", "binomial", "poisson", "scipy", "person", + "person_from_faker", "datetime", "timedelta". params (required): Parameters specific to the chosen sampler type. Type varies based on the `sampler_type` (e.g., `CategorySamplerParams`, `UniformSamplerParams`, `PersonSamplerParams`). conditional_params: Optional dictionary for conditional parameters. The dict keys @@ -475,7 +476,7 @@ class ValidationColumnConfig(SingleColumnConfig): DataFrame with target columns and must return a DataFrame with validation results. - "local_callable": Call a local Python function with the data. Only supported when running DataDesigner locally. - - "remote": Send data to a remote HTTP endpoint for validation. Useful for + - "remote": Send data to a remote HTTP endpoint for validation. validator_params (required): Parameters specific to the validator type. Type varies by validator: - CodeValidatorParams: Specifies code language (python or SQL dialect like "sql:postgres", "sql:mysql"). diff --git a/packages/data-designer-config/src/data_designer/config/config_builder.py b/packages/data-designer-config/src/data_designer/config/config_builder.py index 7112aec7d..f69e63706 100644 --- a/packages/data-designer-config/src/data_designer/config/config_builder.py +++ b/packages/data-designer-config/src/data_designer/config/config_builder.py @@ -293,7 +293,10 @@ def add_column( The current Data Designer config builder instance. Raises: - BuilderConfigurationError: If the column name collides with an existing seed dataset column. + BuilderConfigurationError: If neither a column config nor the required constructor + arguments are provided. + InvalidColumnTypeError: If the provided column config is not one of the supported + column config types. """ if column_config is None: if name is None or column_type is None: @@ -615,7 +618,7 @@ def get_processor_configs(self) -> list[ProcessorConfigT]: """Get processor configuration objects. Returns: - A dictionary of processor configuration objects by dataset builder stage. + A list of processor configuration objects. """ return self._processor_configs diff --git a/packages/data-designer-config/src/data_designer/config/data_designer_config.py b/packages/data-designer-config/src/data_designer/config/data_designer_config.py index 86381332d..d7c42e6a8 100644 --- a/packages/data-designer-config/src/data_designer/config/data_designer_config.py +++ b/packages/data-designer-config/src/data_designer/config/data_designer_config.py @@ -22,7 +22,7 @@ class DataDesignerConfig(ExportableConfigBase): """Configuration for NeMo Data Designer. This class defines the main configuration structure for NeMo Data Designer, - which orchestrates the generation of synthetic data. + which the engine consumes when generating synthetic data. Attributes: columns: Required list of column configurations defining how each column @@ -34,6 +34,7 @@ class DataDesignerConfig(ExportableConfigBase): seed_config: Optional seed dataset settings to use for generation. constraints: Optional list of column constraints. profilers: Optional list of column profilers for analyzing generated data characteristics. + processors: Optional list of processor configurations for post-generation transformations. """ columns: list[Annotated[ColumnConfigT, Field(discriminator="column_type")]] = Field(min_length=1) diff --git a/packages/data-designer-config/src/data_designer/config/models.py b/packages/data-designer-config/src/data_designer/config/models.py index 9e3d8c44c..482f78308 100644 --- a/packages/data-designer-config/src/data_designer/config/models.py +++ b/packages/data-designer-config/src/data_designer/config/models.py @@ -285,7 +285,7 @@ class BaseInferenceParams(ConfigBase, ABC): """Base configuration for inference parameters. Attributes: - generation_type: Type of generation (chat-completion or embedding). Acts as discriminator. + generation_type: Type of generation (chat-completion, embedding, or image). Acts as discriminator. max_parallel_requests: Maximum number of parallel requests to the model API. timeout: Timeout in seconds for each request. extra_body: Additional parameters to pass to the model API. diff --git a/packages/data-designer-config/src/data_designer/config/processors.py b/packages/data-designer-config/src/data_designer/config/processors.py index 07c4b8b6e..9f0fa52c2 100644 --- a/packages/data-designer-config/src/data_designer/config/processors.py +++ b/packages/data-designer-config/src/data_designer/config/processors.py @@ -45,7 +45,7 @@ class DropColumnsProcessorConfig(ProcessorConfig): """Drop columns from the output dataset (prefer ``drop=True`` in the column config). This processor removes specified columns from the generated dataset. The dropped - columns are saved separately in a `dropped-columns` directory for reference. + columns are saved separately in the `dropped-columns-parquet-files` directory for reference. When this processor is added via the config builder, the corresponding column configs are automatically marked with `drop = True`. @@ -66,7 +66,7 @@ class SchemaTransformProcessorConfig(ProcessorConfig): This processor creates a new dataset with a transformed schema. Each key in the template becomes a column in the output, and values are Jinja2 templates that can reference any column in the batch. The transformed dataset is written to - a `processors-outputs/{processor_name}/` directory alongside the main dataset. + a `processors-files/{processor_name}/` directory alongside the main dataset. Attributes: template (required): Dictionary defining the output schema. Keys are new column names, diff --git a/packages/data-designer-config/src/data_designer/config/sampler_params.py b/packages/data-designer-config/src/data_designer/config/sampler_params.py index c6f73f34b..fa10892d5 100644 --- a/packages/data-designer-config/src/data_designer/config/sampler_params.py +++ b/packages/data-designer-config/src/data_designer/config/sampler_params.py @@ -93,7 +93,7 @@ class DatetimeSamplerParams(ConfigBase): Attributes: start (required): Earliest possible datetime for the sampling range (inclusive). Must be a valid datetime string parseable by pandas.to_datetime(). - end (required): Latest possible datetime for the sampling range (inclusive). Must be a valid + end (required): Exclusive upper bound for the sampling range. Must be a valid datetime string parseable by pandas.to_datetime(). unit: Time unit for sampling granularity. Options: - "Y": Years @@ -105,7 +105,7 @@ class DatetimeSamplerParams(ConfigBase): """ start: str = Field(..., description="Earliest possible datetime for sampling range, inclusive.") - end: str = Field(..., description="Latest possible datetime for sampling range, inclusive.") + end: str = Field(..., description="Exclusive upper bound for datetime sampling range.") unit: Literal["Y", "M", "D", "h", "m", "s"] = Field( default="D", description="Sampling units, e.g. the smallest possible time interval between samples.", @@ -394,13 +394,13 @@ class UniformSamplerParams(ConfigBase): Attributes: low (required): Lower bound of the uniform distribution (inclusive). Can be any real number. - high (required): Upper bound of the uniform distribution (inclusive). Must be greater than `low`. + high (required): Upper bound of the uniform distribution. Must be greater than `low`. decimal_places: Optional number of decimal places to round sampled values to. If None, values are not rounded and may have many decimal places. """ low: float = Field(..., description="Lower bound of the uniform distribution, inclusive.") - high: float = Field(..., description="Upper bound of the uniform distribution, inclusive.") + high: float = Field(..., description="Upper bound of the uniform distribution.") decimal_places: int | None = Field( default=None, description="Number of decimal places to round the sampled values to." ) @@ -418,9 +418,9 @@ class PersonSamplerParams(ConfigBase): """Parameters for sampling synthetic person data with demographic attributes. Generates realistic synthetic person data including names, addresses, phone numbers, and other - demographic information. Data can be sampled from managed datasets (when available) or generated - using Faker. The sampler supports filtering by locale, sex, age, geographic location, and can - optionally include synthetic persona descriptions. + demographic information from managed datasets. The sampler supports filtering by locale, sex, age, + geographic location, and selected managed-dataset fields, and can optionally include synthetic + persona descriptions. For Faker-generated person data, use PersonFromFakerSamplerParams. Attributes: locale: Locale string determining the language and geographic region for synthetic people. @@ -436,9 +436,8 @@ class PersonSamplerParams(ConfigBase): with_synthetic_personas: If True, appends additional synthetic persona columns including personality traits, interests, and background descriptions. Only supported for certain locales with managed datasets. - sample_dataset_when_available: If True, samples from curated managed datasets when available - for the specified locale. If False or unavailable, falls back to Faker-generated data. - Managed datasets typically provide more realistic and diverse synthetic people. + select_field_values: Optional field-value filters for managed datasets. Supported field + names are checked against the managed person data fields. """ locale: str = Field( diff --git a/packages/data-designer-config/src/data_designer/config/seed.py b/packages/data-designer-config/src/data_designer/config/seed.py index bdd9dae29..901f8890b 100644 --- a/packages/data-designer-config/src/data_designer/config/seed.py +++ b/packages/data-designer-config/src/data_designer/config/seed.py @@ -57,7 +57,7 @@ def to_index_range(self, dataset_size: int) -> IndexRange: class SeedConfig(ConfigBase): """Configuration for sampling data from a seed dataset. - Args: + Attributes: source: A SeedSource defining where the seed data exists sampling_strategy: Strategy for how to sample rows from the dataset. - ORDERED: Read rows sequentially in their original order. diff --git a/packages/data-designer-config/src/data_designer/config/seed_source.py b/packages/data-designer-config/src/data_designer/config/seed_source.py index bfd94fbfc..57a7eb9fc 100644 --- a/packages/data-designer-config/src/data_designer/config/seed_source.py +++ b/packages/data-designer-config/src/data_designer/config/seed_source.py @@ -28,6 +28,10 @@ class SeedSource(BaseModel, ABC): All subclasses must define a `seed_type` field with a Literal value. This serves as a discriminated union discriminator. + + Attributes: + seed_type: Discriminator field that identifies the specific seed source type. + Subclasses must override this field with a ``Literal`` value. """ seed_type: str @@ -88,6 +92,24 @@ class HuggingFaceSeedSource(SeedSource): class FileSystemSeedSource(SeedSource, ABC): + """Base class for seed sources backed by a directory of files. + + Use this base when a seed reader needs to enumerate files under a directory + on disk and turn each (or groups of them) into seed rows. Concrete plugin + configs declare a ``Literal`` ``seed_type`` and pair with a + ``FileSystemSeedReader`` implementation. + + Attributes: + path: Directory containing seed artifacts. Relative paths are resolved + from the current working directory when the config is loaded, not + from the config file location. + file_pattern: Case-sensitive filename pattern used to match files under + the provided directory. Patterns match basenames only, not relative + paths. Defaults to ``'*'``. + recursive: Whether to search nested subdirectories under the provided + directory for matching files. Defaults to ``True``. + """ + _runtime_path: str | None = PrivateAttr(default=None) path: str = Field( diff --git a/packages/data-designer-config/src/data_designer/plugins/plugin.py b/packages/data-designer-config/src/data_designer/plugins/plugin.py index a961062ad..0f2ec050e 100644 --- a/packages/data-designer-config/src/data_designer/plugins/plugin.py +++ b/packages/data-designer-config/src/data_designer/plugins/plugin.py @@ -18,6 +18,18 @@ class PluginType(str, Enum): + """The kind of Data Designer extension a plugin contributes. + + Attributes: + COLUMN_GENERATOR: A custom column type whose config inherits from + ``SingleColumnConfig`` and uses ``column_type`` as its discriminator. + SEED_READER: A custom seed dataset reader whose config inherits from + ``SeedSource`` (or ``FileSystemSeedSource``) and uses ``seed_type`` + as its discriminator. + PROCESSOR: A custom processor whose config inherits from + ``ProcessorConfig`` and uses ``processor_type`` as its discriminator. + """ + COLUMN_GENERATOR = "column-generator" SEED_READER = "seed-reader" PROCESSOR = "processor" @@ -65,12 +77,31 @@ def _check_class_exists_in_file(filepath: str, class_name: str) -> None: class Plugin(BaseModel): + """Declares a Data Designer plugin by tying a config class to its implementation class. + + A plugin package exposes one ``Plugin`` instance per extension through an entry + point in the ``data_designer.plugins`` group. Data Designer discovers the entry + point on import, loads the referenced classes, and registers the plugin so its + config type is usable like any built-in Data Designer object. + + Attributes: + impl_qualified_name: Fully-qualified import path of the implementation class, + e.g. ``'my_plugin.impl.MyColumnGenerator'``. The plugin loader verifies + that the referenced class exists. + config_qualified_name: Fully-qualified import path of the config class, + e.g. ``'my_plugin.config.MyConfig'``. The class must define a Literal + discriminator field with a string default. + plugin_type: The kind of extension this plugin contributes. Determines which + discriminator field name is required on the config class: ``column_type``, + ``seed_type``, or ``processor_type``. + """ + impl_qualified_name: str = Field( ..., description="The fully-qualified name of the implementation class object, e.g. 'my_plugin.generator.MyColumnGenerator'", ) config_qualified_name: str = Field( - ..., description="The fully-qualified name o the config class object, e.g. 'my_plugin.config.MyConfig'" + ..., description="The fully-qualified name of the config class object, e.g. 'my_plugin.config.MyConfig'" ) plugin_type: PluginType = Field(..., description="The type of plugin") diff --git a/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py b/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py index ff4c8de5f..2431c0eb6 100644 --- a/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py +++ b/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/base.py @@ -67,7 +67,7 @@ def can_generate_from_scratch(self) -> bool: @property def is_llm_bound(self) -> bool: - """Whether this generator makes LLM/HTTP calls during generation.""" + """Whether this generator makes model/API calls during generation.""" return False @property @@ -217,18 +217,52 @@ def log_pre_generation(self) -> None: class ColumnGeneratorCellByCell(ColumnGenerator[TaskConfigT], ABC): + """Base class for column generators invoked once per row. + + Override ``generate`` to return the complete row mapping after adding the + generated value. The engine calls the generator once per row and may run + calls concurrently. Use this base when generation is independent per row + (e.g. an LLM call per row, a per-row transform). + """ + @staticmethod def get_generation_strategy() -> GenerationStrategy: return GenerationStrategy.CELL_BY_CELL @abstractmethod - def generate(self, data: dict) -> dict: ... + def generate(self, data: dict) -> dict: + """Generate one row's output from a single row's upstream values. + + Args: + data: Current row mapping containing the upstream values available to this column. + + Returns: + Complete row mapping with existing keys preserved and the new column value added. + Include declared side-effect columns when the config creates them. + """ class ColumnGeneratorFullColumn(ColumnGenerator[TaskConfigT], ABC): + """Base class for column generators that transform a full batch at once. + + Override ``generate`` to return the complete batch DataFrame after adding + generated values. Use this base when generation is vectorizable or when an + external API accepts batched input more efficiently than per-row calls. + """ + @staticmethod def get_generation_strategy() -> GenerationStrategy: return GenerationStrategy.FULL_COLUMN @abstractmethod - def generate(self, data: pd.DataFrame) -> pd.DataFrame: ... + def generate(self, data: pd.DataFrame) -> pd.DataFrame: + """Generate an entire batch of row outputs. + + Args: + data: DataFrame containing the upstream columns this generator depends on. + + Returns: + DataFrame containing the input columns plus the new column and any side-effect + columns. When ``config.allow_resize`` is ``False``, the row count must match + the input; when it is ``True``, the row count may change. + """ diff --git a/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py b/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py index 6206b3674..b4c863542 100644 --- a/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py +++ b/packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py @@ -122,7 +122,7 @@ def __repr__(self) -> str: class CustomColumnGenerator(ColumnGenerator[CustomColumnConfig]): """Column generator that uses a user-provided callable function. - Supports two strategies based on config.strategy: + Supports two strategies based on config.generation_strategy: - cell_by_cell: Processes rows one at a time (dict -> dict), parallelized by framework. - full_column: Processes entire batch (DataFrame -> DataFrame) for vectorized ops. diff --git a/packages/data-designer-engine/src/data_designer/engine/processing/__init__.py b/packages/data-designer-engine/src/data_designer/engine/processing/__init__.py new file mode 100644 index 000000000..52a7a9daf --- /dev/null +++ b/packages/data-designer-engine/src/data_designer/engine/processing/__init__.py @@ -0,0 +1,2 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 diff --git a/packages/data-designer-engine/src/data_designer/engine/processing/processors/__init__.py b/packages/data-designer-engine/src/data_designer/engine/processing/processors/__init__.py new file mode 100644 index 000000000..52a7a9daf --- /dev/null +++ b/packages/data-designer-engine/src/data_designer/engine/processing/processors/__init__.py @@ -0,0 +1,2 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 diff --git a/packages/data-designer-engine/src/data_designer/engine/resources/__init__.py b/packages/data-designer-engine/src/data_designer/engine/resources/__init__.py new file mode 100644 index 000000000..52a7a9daf --- /dev/null +++ b/packages/data-designer-engine/src/data_designer/engine/resources/__init__.py @@ -0,0 +1,2 @@ +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 diff --git a/packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py b/packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py index 21428f19a..2c1c16c78 100644 --- a/packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py +++ b/packages/data-designer-engine/src/data_designer/engine/resources/seed_reader.py @@ -51,23 +51,32 @@ class SeedReaderError(DataDesignerError): ... @dataclass(frozen=True) class SeedReaderFileSystemContext: + """Filesystem and root path available to filesystem seed-reader plugins.""" + fs: AbstractFileSystem root_path: Path class SeedReaderBatch(Protocol): + """Batch object returned by seed readers and convertible to a DataFrame.""" + def to_pandas(self) -> pd.DataFrame: ... class SeedReaderBatchReader(Protocol): + """Reader that yields seed batches until exhausted.""" + def read_next_batch(self) -> SeedReaderBatch: ... @dataclass class PandasSeedReaderBatch: + """Seed-reader batch backed by an in-memory pandas DataFrame.""" + dataframe: pd.DataFrame def to_pandas(self) -> pd.DataFrame: + """Return the batch as a pandas DataFrame.""" return self.dataframe @@ -76,6 +85,7 @@ def create_seed_reader_output_dataframe( records: list[dict[str, Any]], output_columns: list[str], ) -> pd.DataFrame: + """Create a DataFrame and verify hydrated records match the declared output schema.""" if not records: return lazy.pd.DataFrame(records, columns=output_columns) diff --git a/packages/data-designer/src/data_designer/interface/data_designer.py b/packages/data-designer/src/data_designer/interface/data_designer.py index 242131ba1..9f142afa6 100644 --- a/packages/data-designer/src/data_designer/interface/data_designer.py +++ b/packages/data-designer/src/data_designer/interface/data_designer.py @@ -113,13 +113,14 @@ class DataDesigner(DataDesignerInterface[DatasetCreationResults]): orchestrates the dataset creation and profiling processes. Args: - artifact_path: Path where generated artifacts will be stored. - dataset_name: Name for the generated dataset. Defaults to "dataset". - This will be used as the dataset folder name in the artifact path. + artifact_path: Path where generated artifacts will be stored. If not + provided, artifacts are stored in an `artifacts` directory under the + current working directory. model_providers: Optional list of model providers for LLM generation. If None, uses default providers. - secret_resolver: Resolver for handling secrets and credentials. Defaults to - EnvironmentResolver which reads secrets from environment variables. + secret_resolver: Resolver for handling secrets and credentials. If None, + uses the default composite resolver, which checks environment variables + and plaintext values. seed_readers: Optional list of seed readers. If None, uses default readers. managed_assets_path: Path to the managed assets directory. This is used to point to the location of managed datasets and other assets used during dataset generation. @@ -131,7 +132,7 @@ class DataDesigner(DataDesignerInterface[DatasetCreationResults]): This allows clients to customize how managed datasets are accessed (e.g., using custom fsspec clients for S3 or other remote storage). mcp_providers: Optional list of MCP provider configurations to enable tool-calling for - LLM generation columns. Supports both MCPProvider (remote/SSE) and + LLM generation columns. Supports both MCPProvider (remote SSE or Streamable HTTP) and LocalStdioMCPProvider (local subprocess). """ diff --git a/packages/data-designer/src/data_designer/interface/results.py b/packages/data-designer/src/data_designer/interface/results.py index 07692ff00..599ad1af0 100644 --- a/packages/data-designer/src/data_designer/interface/results.py +++ b/packages/data-designer/src/data_designer/interface/results.py @@ -57,7 +57,7 @@ def load_analysis(self) -> DatasetProfilerResults: Returns: DatasetProfilerResults containing statistical analysis and quality metrics - for each column in the generated dataset. + for configured columns in the generated dataset. """ return self._analysis