Skip to content

wrong number of shards generated in dataset_info.json #15

@aliciaji1993

Description

@aliciaji1993

After dataset generation, the output file has total of 32 tf-record files in train and 8 in val, which is expected.

But in reality, the number of shards is only 25 for some reason and some shards are missing (as below). What could be the cause?

dataset folder: (missing some shards)
Screenshot 2024-10-03 at 10 22 43

from dataset_info.json

"splits": [
    {
      "filepathTemplate": "{DATASET}-{SPLIT}.{FILEFORMAT}-{SHARD_X_OF_Y}",
      "name": "train",
      "numBytes": "3124634980",
      "shardLengths": [
        "49",
        "65",
        "51",
        "71",
        "70",
        "54",
        "67",
        "66",
        "61",
        "63",
        "56",
        "53",
        "60",
        "61",
        "71",
        "62",
        "68",
        "71",
        "58",
        "68",
        "64",
        "71",
        "71",
        "70",
        "58"
      ]
    },
    {
      "filepathTemplate": "{DATASET}-{SPLIT}.{FILEFORMAT}-{SHARD_X_OF_Y}",
      "name": "val",
      "numBytes": "758231159",
      "shardLengths": [
        "72",
        "67",
        "58",
        "72",
        "82",
        "68",
        "70",
        "61"
      ]
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions