[fix] automatically regenerate the cache when it doesn't match #72

yjmm10 · 2024-08-14T03:44:49Z

name: Pull Request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''

Description

[Please include a summary of the changes and the related issue. (Just overwrite this session directly)]

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

The code follows the Python style guide.
Code and files are well organized.
All tests pass.
New code is covered by tests.
The pull request is directed to the corresponding topic branch.
[Optional] We would be very happy if gitmoji 🧑‍💻 could be used to assist the commit message 💬!

Licensing:

By submitting this pull request, I confirm that:

My contribution is made under the MIT License.
I have not included any code from questionable or non-compliant sources (GPL, AGPL, ... etc).
I understand that all contributions to this repository must comply with the MIT License, and I promise that my contributions do not violate this license.
I have not used any code or content from sources that conflict with the MIT License or are otherwise legally questionable.

Additional Information

Add any other information about the PR here.

Note:
We hope that the pull request is mainly for the implementation or modification of the project function. If it is a documentation error or a small typo, we are very welcome to send it to [email protected]. Alternatively, you may open an issue to report these. Thank you for your understanding and cooperation!

yjmm10 · 2024-08-14T03:45:53Z

automatically regenerate the cache when it doesn't match

henrytsui000 · 2024-08-14T05:29:25Z

Hi,

I was indeed planning to write validation for the cache, so you've saved me some time! Thank you for your contribution.

I'll proceed with a code review.

Please make sure your code fully complies with the license (i.e., fulfills the licensing checklist).

Best regards,
Henry Tsui

yjmm10 · 2024-08-14T05:34:11Z

Yes, the code satisftis the lincese.
I added the workflow of the TRAIN branch, if you don't need it, you can remove.

henrytsui000 · 2024-08-14T05:44:11Z

Hi,

I'm a bit confused about why we don't use images_path directly. What is the benefit of separating them when loading the data and then combining them back when the dataset's __getitem__ method is called?

yjmm10 · 2024-08-14T05:54:05Z

Hi,

I'm a bit confused about why we don't use images_path directly. What is the benefit of separating them when loading the data and then combining them back when the dataset's __getitem__ method is called?

At first, I also wanted to directly use the image path. But later, after I modified the directory file, the entire cache needed to be regenerated. If the relative dataset root directory is used, then modifying the path of the root directory randomly will not effect the cache, which is particularly effective in keeping the cache available when the data set is relatively large.

In addition, the verification of this method is more effective for supporting a single data set. Because I extracted some data from it to compare the paths. If all are verified, it is equivalent to regenerating. I saw that there is a hash for verification in the cache of yolov9.

yjmm10 · 2024-08-14T05:57:23Z

If you don't often rename your dataset, I think the full image path is better.

Abdul-Mukit · 2024-08-15T01:15:08Z

yolo/tools/data_loader.py

@@ -100,7 +109,7 @@ def filter_data(self, dataset_path: Path, phase_name: str) -> list:

            labels = self.load_valid_labels(image_id, image_seg_annotations)

-            img_path = images_path / image_name
+            img_path = Path("images") / phase_name / image_name
            data.append((img_path, labels))


Instead of image_path can we consider using image_id as the key?
That is instead do this:
data.append((image_id, labels)) where image_id = Path(image_name).stem
This should meet the requirement of this MR that we can move around dataset randomly.
This also makes the code a bit more predictable. I think it is more intuitive that a cache has keys that are just file names and not the information about train / test split. In txt based YOLO datasets the only thing common among image and label.txt is the common name without the extension. I think it is fitting that the cache should have that in common too. Furthermore, you already have the phase name in the file name train.cache and val.cache. No need to include phase_name again inside the *.cache files.

The current check for cache is just checking whether the train/test split are same as phase_name: if data[0][0].parent == Path("images")/phase_name:. This will no longer be needed. Please also see my comment about this line inside the load_data function.

It's a good idea that solved the input info and the question of validation.

Abdul-Mukit · 2024-08-15T01:25:50Z

yolo/tools/data_loader.py

@@ -54,6 +55,14 @@ def load_data(self, dataset_path: Path, phase_name: str):
        else:
            data = torch.load(cache_path)
            logger.info("📦 Loaded {} cache", phase_name)
+            # Validate cache
+            if data[0][0].parent == Path("images")/phase_name:


This is not verifying the content of the cache. This is just checking whether the phase_name is the same in the data read from cache.
Assuming we only save the image_id (just the file name without extension), we should be verifying two things:

Lenght of images mentioned in the instances_<phase_name>.json is equals to the number of elements in data.

Is the set of image names (excluding extension) form the json file is equal to the set of image_ids recovered from cache.

Regarding the second question, I'd like to make an addition. If a complete check is made to see if the cache is consistent with the json file, when dealing with a relatively large data set, the efficiency is similar to regenerating the cache. At present, I only make a comparison for one of them, and there are still quite a few problems.

I agree with you. The second part will be costly.

Abdul-Mukit · 2024-08-15T01:29:47Z

yolo/tools/data_loader.py

@@ -133,6 +142,7 @@ def load_valid_labels(self, label_path: str, seg_data_one_img: list) -> Union[Te

    def get_data(self, idx):
        img_path, bboxes = self.data[idx]
+        img_path = self.data_root / Path(img_path)
        img = Image.open(img_path).convert("RGB")


In the current proposed solution, we are splitting the image_path in filter_data then we are completing it back in this function. Such distribution makes the code less predictable and difficult to grasp unless the developers know the code base from the heart. I think we should come up with a better solution.

I wonder if it's possible to achieve a similar effect as in yolov8, leaving these paths to be specified by the users. The existing code is overly strict in its requirements for the data format.

When open image, it's the same that get the iamge path from image_id.

Not sure, if I fully got your point. If you are suggesting that using image_id (image file name minus the extension) would make things easier, I agree.

Maybe setting the key to the actual image name with file extension is better?
Saw your comment: #72 (comment)
We already have available to us the dataset_path, phase_name variable. "images" folder name string is common in both coco and yolo datasets. That means we can reconstruct the original path of the image easily if we use the image file name with extension as the key.
image_path = dataset_path / "Images" / phase_name / image_file_name

Extions will help resolve .jpg/.png ambiguity.

[fix] automatically regenerate the cache when it doesn't match

8f2ecc2

yjmm10 changed the base branch from main to TRAIN August 14, 2024 03:46

yongjian_zhang added 2 commits August 14, 2024 05:13

[fix] short the path of cache file

d2c36b1

add workflow

86da903

Abdul-Mukit mentioned this pull request Aug 15, 2024

Validation fails when image names are not int convertible. #67

Open

Abdul-Mukit reviewed Aug 15, 2024

View reviewed changes

Abdul-Mukit mentioned this pull request Aug 15, 2024

Draft: Switch to image_file_name as the standard key for cache and other dictionaries. #77

Closed

yjmm10 mentioned this pull request Aug 19, 2024

Add a COCO to YOLO converter. #80

Open

[fix] automatically regenerate the cache when it doesn't match #72

Are you sure you want to change the base?

[fix] automatically regenerate the cache when it doesn't match #72

Uh oh!

Conversation

yjmm10 commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Checklist:

Licensing:

Additional Information

Uh oh!

yjmm10 commented Aug 14, 2024

Uh oh!

henrytsui000 commented Aug 14, 2024

Uh oh!

yjmm10 commented Aug 14, 2024

Uh oh!

henrytsui000 commented Aug 14, 2024

Uh oh!

yjmm10 commented Aug 14, 2024

Uh oh!

yjmm10 commented Aug 14, 2024

Uh oh!

Abdul-Mukit Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yjmm10 Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

Abdul-Mukit Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yjmm10 Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

Abdul-Mukit Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

Abdul-Mukit Aug 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yjmm10 Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

yjmm10 Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

Abdul-Mukit Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

Abdul-Mukit Aug 15, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yjmm10 commented Aug 14, 2024 •

edited

Loading

Abdul-Mukit Aug 15, 2024 •

edited

Loading

Abdul-Mukit Aug 15, 2024 •

edited

Loading

Abdul-Mukit Aug 15, 2024 •

edited

Loading