-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add prepare command #38
Conversation
Note to self:
|
|
tests/test_memmap_dataset.py
Outdated
from fast_llm.data.gpt.memmap import GPTMemmapDataset | ||
import pytest | ||
|
||
def dtype_arrays(dtype: np.dtype, min_size: int=1, max_size: int=100) -> st.SearchStrategy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following what the hypothesis
module brings here. You seem to be just creating a list of random arrays, is that right? This can easily be done in plain numpy with the same function complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is that hypothesis will try to shrink the inputs to the minimal reproducible value in case of a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, assuming my proposed modifications are ok
✨ Description
Extracted and refined the dataset preparation script from #17.
Made it a command like
train
orconvert
.Example call and config:
or
where
foo.yaml
contains:Run
git clone https://huggingface.co/HuggingFaceTB/SmolLM-135M
intmp
to get that tokenizer file.This will produce:
with
fast_llm_dataset.json
reading:The
downloaded_dataset
can be deleted afterwards. It is not used by Fast-LLM.🔍 Type of change
Select all that apply:
📝 Changes
prepare_dataset
commandDockerfile
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General:
Dependencies and Configuration:
Testing:
Performance Impact:
📊 Performance Impact Details
N/A
📝 Additional Notes
N/A