Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto dataset concatenation prototype #128

Merged
merged 4 commits into from
Jan 27, 2025
Merged

Auto dataset concatenation prototype #128

merged 4 commits into from
Jan 27, 2025

Conversation

jlamypoirier
Copy link
Collaborator

✨ Description

Fixes: #120. A basic approach, to be refined in #123 .

πŸ” Type of change

Select all that apply:

  • πŸ› Bug fix (non-breaking change that addresses a specific issue)
  • πŸš€ New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • πŸ“ˆ Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • πŸ› οΈ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • πŸ“¦ Dependency bump (updates dependencies, including Dockerfile or package changes)
  • πŸ“ Documentation change (updates documentation, including new content or typo fixes)
  • πŸ”§ Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review January 23, 2025 01:00
Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 169 to 171
class GPTComposedDatasetConfig(GPTIndexedDatasetConfig):
_abstract: typing.ClassVar[bool] = False
type_: typing.ClassVar[str | None] = "composed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class GPTComposedDatasetConfig(GPTIndexedDatasetConfig):
_abstract: typing.ClassVar[bool] = False
type_: typing.ClassVar[str | None] = "composed"
class GPTConcatenatedMemmapConfig(GPTIndexedDatasetConfig):
_abstract: typing.ClassVar[bool] = False
type_: typing.ClassVar[str | None] = "concatenated_memmap"

for your convenience, so that we can merge this easily.

@@ -10,6 +10,7 @@
from fast_llm.data.data.gpt.data import GPTData
from fast_llm.data.dataset.gpt.config import (
GPTBlendedDatasetConfig,
GPTComposedDatasetConfig,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GPTComposedDatasetConfig,
GPTConcatenatedMemmapConfig,

Comment on lines 408 to 409
{"type": "composed", "path": _DATASET_PREFIX_MIX_COMPOSED},
GPTComposedDatasetConfig,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"type": "composed", "path": _DATASET_PREFIX_MIX_COMPOSED},
GPTComposedDatasetConfig,
{"type": "concatenated_memmap", "path": _DATASET_PREFIX_MIX_CONCATENATED_MEMMAP},
GPTConcatenatedMemmapConfig,

@@ -81,11 +88,16 @@ def get_test_data_and_samples(
return data, samples


DATASET_PREFIX_MIX_1 = DATASET_PREFIX.with_name("blended_mix_1")
_DATASET_PREFIX_MIX_1 = DATASET_PREFIX.with_name("blended_mix_1")
_DATASET_PREFIX_MIX_COMPOSED = DATASET_CACHE / "composed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_DATASET_PREFIX_MIX_COMPOSED = DATASET_CACHE / "composed"
_DATASET_PREFIX_MIX_CONCATENATED_MEMMAP = DATASET_CACHE / "concatenated_memmap"

@jlamypoirier jlamypoirier merged commit 6dc77a0 into main Jan 27, 2025
3 of 4 checks passed
@jlamypoirier jlamypoirier deleted the auto_concatenate branch January 27, 2025 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] Generate concatenated datasets automatically
2 participants