-
Notifications
You must be signed in to change notification settings - Fork 61
[QEff. Finetune]: Added Base dataset class and SFT dataset classes along with its test cases. #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Updated the dataset.py file to only enable support for SFTDataset types. Created test file to check the functionalities accordingly. Signed-off-by: Dhiraj Kumar Sah <[email protected]>
quic-meetkuma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it has really good changes and extended test cases. Some minor polishing is needed before merge. Thanks. :)
| raise RuntimeError("Either provide completion_template or completion_func in the config.") | ||
|
|
||
| # Call parent class __init__ which will call _initialize_dataset | ||
| super().__init__(dataset_name, split, seed, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good cleanup in init
| self.dataset = splitted_dataset["train"] | ||
| else: | ||
| # Load dataset from HuggingFace | ||
| db = load_dataset_builder(self.dataset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good addition over load_dataset.
Reduced the use of MagicMock to create dataset to a minimal level. Couldn't find a dummy HF dataset for SFT task so using a dummy dataset for that purpose. Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Signed-off-by: Dhiraj Kumar Sah <[email protected]>
quic-meetkuma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. Post that we can merge. Really good suite of testcases which covers almost all the cases of SFTDataset class.
Thanks
Moved apply_train_test_split to dataset_utils.py now. Additional check for JSON file path validity was added and test was added for it as well. _setup_template method doesn't modify self.dataset directly, same for apply_train_test_split. Signed-off-by: Dhiraj Kumar Sah <[email protected]>
quic-meetkuma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let us merge it.
quic-akuruvil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Edited the SFTDataset class to enable custom dataset loading.
Updated the dataset.py file to only enable support for SFTDataset type.
Created test file to check the functionalities.