Skip to content

Conversation

@quic-dhirajku
Copy link
Contributor

Edited the SFTDataset class to enable custom dataset loading.
Updated the dataset.py file to only enable support for SFTDataset type.
Created test file to check the functionalities.

Updated the dataset.py file to only enable support for SFTDataset types.
Created test file to check the functionalities accordingly.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
@quic-meetkuma quic-meetkuma changed the title Ft datasets [QEff. Finetune]: Added Base dataset class and SFT dataset classes along with its test cases. Dec 2, 2025
Copy link
Contributor

@quic-meetkuma quic-meetkuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it has really good changes and extended test cases. Some minor polishing is needed before merge. Thanks. :)

raise RuntimeError("Either provide completion_template or completion_func in the config.")

# Call parent class __init__ which will call _initialize_dataset
super().__init__(dataset_name, split, seed, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good cleanup in init

self.dataset = splitted_dataset["train"]
else:
# Load dataset from HuggingFace
db = load_dataset_builder(self.dataset_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good addition over load_dataset.

Reduced the use of MagicMock to create dataset to a minimal level.
Couldn't find a dummy HF dataset for SFT task so using a dummy dataset for that purpose.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Copy link
Contributor

@quic-meetkuma quic-meetkuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. Post that we can merge. Really good suite of testcases which covers almost all the cases of SFTDataset class.

Thanks

Moved apply_train_test_split to dataset_utils.py now.
Additional check for JSON file path validity was added and test was added for it as well.
_setup_template method doesn't modify self.dataset directly, same for apply_train_test_split.

Signed-off-by: Dhiraj Kumar Sah <[email protected]>
Copy link
Contributor

@quic-meetkuma quic-meetkuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let us merge it.

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@quic-meetkuma quic-meetkuma merged commit 5cd3fd1 into quic:ft_experimental Dec 5, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants