Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

sam-hey · 2025-01-16T16:33:10Z

Describe the bug

When loading a dataset from disk, concatenating it, and starting the training process, the RAM usage progressively increases until the kernel terminates the process due to excessive memory consumption.

#2276

Steps to reproduce the bug

from datasets import  DatasetDict, concatenate_datasets

dataset = DatasetDict.load_from_disk("data")

...
...

combined_dataset = concatenate_datasets(
        [dataset[split] for split in dataset]
    )

#start SentenceTransformer training

Expected behavior

I would not expect RAM utilization to increase after concatenation. Removing the concatenation step resolves the issue

Environment info

sentence-transformers==3.1.1
datasets==3.2.0

python3.10

sam-hey · 2025-01-17T07:54:21Z

Adding a img from memray
https://gist.github.com/sam-hey/00c958f13fb0f7b54d17197fe353002f

nepfaff · 2025-03-26T14:41:24Z

I'm having the same issue where concatenation seems to use a huge amount of RAM.

# Load all chunks and concatenate them into a final dataset.
        chunk_datasets = [
            Dataset.load_from_disk(file, keep_in_memory=False)
            for file in tqdm(chunk_files, desc="Loading chunk datasets")
        ]
        logging.info("Concatenating chunk datasets...")
        final_dataset = concatenate_datasets(chunk_datasets)

This is a real issue for me as the final dataset is a few terabytes in size. I'm using datasets version 3.1.0. Also tested with version 3.4.1

sam-hey · 2025-03-27T17:38:07Z

I did have a short look, the error seems to be from memory_map and the stream not being closed.

datasets/src/datasets/table.py

Lines 48 to 50 in 5f8d2ad

    
           def _memory_mapped_record_batch_reader_from_file(filename: str) -> pa.RecordBatchStreamReader: 
        
               memory_mapped_stream = pa.memory_map(filename) 
        
               return pa.ipc.open_stream(memory_mapped_stream)

Did not have the time to test jet: https://github.com/sam-hey/datasets/tree/fix/concatenate_datasets

I will probably have a better look in a couple of days.

sam-hey mentioned this issue Mar 29, 2025

fix: loading of datasets from Disk(#7373) #7489

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

sam-hey commented Jan 16, 2025 •

edited

Loading

sam-hey commented Jan 17, 2025 •

edited

Loading

nepfaff commented Mar 26, 2025 •

edited

Loading

sam-hey commented Mar 27, 2025 •

edited

Loading

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

Excessive RAM Usage After Dataset Concatenation concatenate_datasets #7373

Comments

sam-hey commented Jan 16, 2025 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

sam-hey commented Jan 17, 2025 • edited Loading

nepfaff commented Mar 26, 2025 • edited Loading

sam-hey commented Mar 27, 2025 • edited Loading

sam-hey commented Jan 16, 2025 •

edited

Loading

sam-hey commented Jan 17, 2025 •

edited

Loading

nepfaff commented Mar 26, 2025 •

edited

Loading

sam-hey commented Mar 27, 2025 •

edited

Loading