-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Adding a img from memray |
I'm having the same issue where concatenation seems to use a huge amount of RAM. # Load all chunks and concatenate them into a final dataset.
chunk_datasets = [
Dataset.load_from_disk(file, keep_in_memory=False)
for file in tqdm(chunk_files, desc="Loading chunk datasets")
]
logging.info("Concatenating chunk datasets...")
final_dataset = concatenate_datasets(chunk_datasets) This is a real issue for me as the final dataset is a few terabytes in size. I'm using datasets version |
I did have a short look, the error seems to be from datasets/src/datasets/table.py Lines 48 to 50 in 5f8d2ad
Did not have the time to test jet: https://github.com/sam-hey/datasets/tree/fix/concatenate_datasets I will probably have a better look in a couple of days. |
Describe the bug
When loading a dataset from disk, concatenating it, and starting the training process, the RAM usage progressively increases until the kernel terminates the process due to excessive memory consumption.
#2276
Steps to reproduce the bug
Expected behavior
I would not expect RAM utilization to increase after concatenation. Removing the concatenation step resolves the issue
Environment info
sentence-transformers==3.1.1
datasets==3.2.0
python3.10
The text was updated successfully, but these errors were encountered: