wrong number of shards generated in dataset_info.json

After dataset generation, the output file has total of 32 tf-record files in train and 8 in val, which is expected.

But in reality, the number of shards is only 25 for some reason and some shards are missing (as below). What could be the cause?

dataset folder: (missing some shards)
<img width="1203" alt="Screenshot 2024-10-03 at 10 22 43" src="https://github.com/user-attachments/assets/98734aa9-de81-4ce4-a098-fd73b6e7b014">

from dataset_info.json
```
"splits": [
    {
      "filepathTemplate": "{DATASET}-{SPLIT}.{FILEFORMAT}-{SHARD_X_OF_Y}",
      "name": "train",
      "numBytes": "3124634980",
      "shardLengths": [
        "49",
        "65",
        "51",
        "71",
        "70",
        "54",
        "67",
        "66",
        "61",
        "63",
        "56",
        "53",
        "60",
        "61",
        "71",
        "62",
        "68",
        "71",
        "58",
        "68",
        "64",
        "71",
        "71",
        "70",
        "58"
      ]
    },
    {
      "filepathTemplate": "{DATASET}-{SPLIT}.{FILEFORMAT}-{SHARD_X_OF_Y}",
      "name": "val",
      "numBytes": "758231159",
      "shardLengths": [
        "72",
        "67",
        "58",
        "72",
        "82",
        "68",
        "70",
        "61"
      ]
    }
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong number of shards generated in dataset_info.json #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

wrong number of shards generated in dataset_info.json #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions