feat(envs): add concatenate-safe dataset creation and concatenation utility #333

srthkdev · 2025-09-16T16:10:30Z

Add concatenate_safe argument to make_dataset to standardize column types for safe concatenation
Serialize info dictionaries as JSON strings to ensure compatible column types
Always standardize key columns (prompt, completion, answer, task, reward, info) when concatenate_safe is True
Convert incompatible column types to strings using PyArrow checks before dataset creation
Implement static method concatenate_datasets to merge multiple datasets with schema alignment
Handle missing columns, type inconsistencies, and optional split column in concatenated dataset
Rename parse_completion_tokens method to process_completion_tokens for clarity

Description

This PR fixes the dataset concatenation issue in the verifiers library where multiple datasets created from env.make_dataset had different columns with different pyarrow types, preventing concatenation of multiple datasets.
solves #321

The fix implements proper schema standardization in the make_dataset method and enhances the concatenate_datasets static method to handle type incompatibilities intelligently. This allows users to run multiple evaluation results against different eval environments and push them as a single dataset instance with splits for each benchmark.

Key improvements:

Enhanced make_dataset method to ensure consistent schemas with standard columns
Improved concatenate_datasets method with intelligent type inference and standardization
Added concatenate_safe parameter (default: True) to ensure compatibility by default
Proper handling of the 'info' column to ensure consistent formatting
Added split tracking to identify the source of each example in concatenated datasets

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass
New tests have been added to cover the changes
Tests have been run locally with uv run pytest

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Additional Notes

This fix addresses the issue mentioned in the original problem where prime-rl had to implement a custom make_dataset function to handle concatenation of datasets from different environments. The solution has been properly upstreamed to the verifiers library with comprehensive testing and real-world validation.

The fix ensures that datasets from different environments can now be properly concatenated regardless of their original schema differences, with proper type handling and split identification.

…tility - Add concatenate_safe argument to make_dataset to standardize column types for safe concatenation - Serialize info dictionaries as JSON strings to ensure compatible column types - Always standardize key columns (prompt, completion, answer, task, reward, info) when concatenate_safe is True - Convert incompatible column types to strings using PyArrow checks before dataset creation - Implement static method concatenate_datasets to merge multiple datasets with schema alignment - Handle missing columns, type inconsistencies, and optional split column in concatenated dataset - Rename parse_completion_tokens method to process_completion_tokens for clarity

srthkdev added 2 commits September 16, 2025 21:35

ruff fix

fde1c50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(envs): add concatenate-safe dataset creation and concatenation utility #333

feat(envs): add concatenate-safe dataset creation and concatenation utility #333

Uh oh!

srthkdev commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(envs): add concatenate-safe dataset creation and concatenation utility #333

Are you sure you want to change the base?

feat(envs): add concatenate-safe dataset creation and concatenation utility #333

Uh oh!

Conversation

srthkdev commented Sep 16, 2025

Description

Type of Change

Testing

Checklist

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant