Skip to content

Conversation

@srthkdev
Copy link
Contributor

  • Add concatenate_safe argument to make_dataset to standardize column types for safe concatenation
  • Serialize info dictionaries as JSON strings to ensure compatible column types
  • Always standardize key columns (prompt, completion, answer, task, reward, info) when concatenate_safe is True
  • Convert incompatible column types to strings using PyArrow checks before dataset creation
  • Implement static method concatenate_datasets to merge multiple datasets with schema alignment
  • Handle missing columns, type inconsistencies, and optional split column in concatenated dataset
  • Rename parse_completion_tokens method to process_completion_tokens for clarity

Description

This PR fixes the dataset concatenation issue in the verifiers library where multiple datasets created from env.make_dataset had different columns with different pyarrow types, preventing concatenation of multiple datasets.
solves #321

The fix implements proper schema standardization in the make_dataset method and enhances the concatenate_datasets static method to handle type incompatibilities intelligently. This allows users to run multiple evaluation results against different eval environments and push them as a single dataset instance with splits for each benchmark.

Key improvements:

  • Enhanced make_dataset method to ensure consistent schemas with standard columns
  • Improved concatenate_datasets method with intelligent type inference and standardization
  • Added concatenate_safe parameter (default: True) to ensure compatibility by default
  • Proper handling of the 'info' column to ensure consistent formatting
  • Added split tracking to identify the source of each example in concatenated datasets

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Test improvement

Testing

  • All existing tests pass
  • New tests have been added to cover the changes
  • Tests have been run locally with uv run pytest

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

This fix addresses the issue mentioned in the original problem where prime-rl had to implement a custom make_dataset function to handle concatenation of datasets from different environments. The solution has been properly upstreamed to the verifiers library with comprehensive testing and real-world validation.

The fix ensures that datasets from different environments can now be properly concatenated regardless of their original schema differences, with proper type handling and split identification.

…tility

- Add concatenate_safe argument to make_dataset to standardize column types for safe concatenation
- Serialize info dictionaries as JSON strings to ensure compatible column types
- Always standardize key columns (prompt, completion, answer, task, reward, info) when concatenate_safe is True
- Convert incompatible column types to strings using PyArrow checks before dataset creation
- Implement static method concatenate_datasets to merge multiple datasets with schema alignment
- Handle missing columns, type inconsistencies, and optional split column in concatenated dataset
- Rename parse_completion_tokens method to process_completion_tokens for clarity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant