Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prepare command #38

Merged
merged 31 commits into from
Nov 13, 2024
Merged

Add prepare command #38

merged 31 commits into from
Nov 13, 2024

Conversation

tscholak
Copy link
Collaborator

@tscholak tscholak commented Nov 10, 2024

✨ Description

Extracted and refined the dataset preparation script from #17.
Made it a command like train or convert.
Example call and config:

fast-llm prepare gpt_memmap --config foo.yaml

or

torchrun --standalone --nnodes 1 --nproc_per_node=1 --no_python \
    fast-llm prepare gpt_memmap --config foo.yaml

where foo.yaml contains:

output_path: /tmp/foo

loading_workers: 4
tokenize_workers: 4
saving_workers: 4

dataset:
  name_or_path: stas/openwebtext-10k

tokenizer:
  path: /tmp/SmolLM-135M/tokenizer.json

Run git clone https://huggingface.co/HuggingFaceTB/SmolLM-135M in tmp to get that tokenizer file.

This will produce:

/tmp/foo
├── downloaded_dataset
│   ├── cache-1e5559f36da9962e_00002_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00003_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00001_of_00004.arrow
│   ├── cache-1e5559f36da9962e_00000_of_00004.arrow
│   ├── data-00001-of-00004.arrow
│   ├── dataset_info.json
│   ├── data-00000-of-00004.arrow
│   ├── data-00002-of-00004.arrow
│   ├── data-00003-of-00004.arrow
│   ├── ok
│   └── state.json
├── shard_0_0.idx
├── shard_0_0.bin
└── fast_llm_dataset.json

with fast_llm_dataset.json reading:

{
    "datasets": [
        {
            "prefix": "shard_0_0",
            "num_documents": 10000,
            "num_tokens": 11569536,
            "weight": 1.0
        }
    ]
}

The downloaded_dataset can be deleted afterwards. It is not used by Fast-LLM.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

  1. Fixed memory-mapped indexed dataset and added round-trip tests
  2. Added prepare_dataset command
  3. Simplified Dockerfile

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General:

  • 📜 I have read and followed the contributing guidelines.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration:

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing:

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact:

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

N/A


📝 Additional Notes

N/A

@tscholak
Copy link
Collaborator Author

tscholak commented Nov 11, 2024

Note to self:

@tscholak
Copy link
Collaborator Author

tscholak commented Nov 11, 2024

@jlamypoirier

  1. I introduced some classes so that we can have different data preparations for different models. This class hierarchy should eventually go somewhere else, depending on what the outcome of [Prototype] Flexible dataset configuration #34 will be. I only made one implementation for GPTs so far. We need another one for VLMs, which can be extracted from StarDoc model training #5. I'm coordinating with @akshaykalkunte to do just that.

  2. About the changes to the Dockerfile: I realized that we can install Fast-LLM in editable mode globally if we make /app writable for everyone. I therefore removed the fast-llm user. This also resolves a problem where fast-llm isn't usable if the user's id in the job doesn't match the id of the fast-llm user of the image. These changes are part of this PR because I needed them to test data preparation on Toolkit with the images created by the CI action. On Toolkit, the user has the id 13013 whereas CI builds the image with user id 1000. I don't think we can make any assumptions about the environment in which the official Fast-LLM image will be deployed, which is why it's better to remove user creation altogether.

  3. I looked into using Fast-LLM's existing Distributed and DistributedConfig but found them too difficult to adapt to this straightforward CPU-only use case. I do not want to have to deal with distributed dims or CUDA rng initializations for running this simple data preparation code on multiple nodes. Users shouldn't have to bring GPUs for data processing if they aren't needed.

@tscholak tscholak mentioned this pull request Nov 11, 2024
from fast_llm.data.gpt.memmap import GPTMemmapDataset
import pytest

def dtype_arrays(dtype: np.dtype, min_size: int=1, max_size: int=100) -> st.SearchStrategy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following what the hypothesis module brings here. You seem to be just creating a list of random arrays, is that right? This can easily be done in plain numpy with the same function complexity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is that hypothesis will try to shrink the inputs to the minimal reproducible value in case of a problem

Copy link
Collaborator

@jlamypoirier jlamypoirier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, assuming my proposed modifications are ok

@tscholak tscholak merged commit 2905d38 into main Nov 13, 2024
2 checks passed
@tscholak tscholak deleted the tscholak/prepare-dataset branch November 13, 2024 01:38
@tscholak tscholak changed the title Add prepare_dataset command Add prepare command Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants