Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] automatically regenerate the cache when it doesn't match #72

Open
wants to merge 3 commits into
base: TRAIN
Choose a base branch
from

Conversation

yjmm10
Copy link

@yjmm10 yjmm10 commented Aug 14, 2024


name: Pull Request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''


Description

[Please include a summary of the changes and the related issue. (Just overwrite this session directly)]

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • The code follows the Python style guide.
  • Code and files are well organized.
  • All tests pass.
  • New code is covered by tests.
  • The pull request is directed to the corresponding topic branch.
  • [Optional] We would be very happy if gitmoji 🧑‍💻 could be used to assist the commit message 💬!

Licensing:

By submitting this pull request, I confirm that:

  • My contribution is made under the MIT License.
  • I have not included any code from questionable or non-compliant sources (GPL, AGPL, ... etc).
  • I understand that all contributions to this repository must comply with the MIT License, and I promise that my contributions do not violate this license.
  • I have not used any code or content from sources that conflict with the MIT License or are otherwise legally questionable.

Additional Information

Add any other information about the PR here.

Note:
We hope that the pull request is mainly for the implementation or modification of the project function. If it is a documentation error or a small typo, we are very welcome to send it to [email protected]. Alternatively, you may open an issue to report these. Thank you for your understanding and cooperation!

@yjmm10
Copy link
Author

yjmm10 commented Aug 14, 2024

automatically regenerate the cache when it doesn't match

@yjmm10 yjmm10 changed the base branch from main to TRAIN August 14, 2024 03:46
@henrytsui000
Copy link
Member

Hi,

I was indeed planning to write validation for the cache, so you've saved me some time! Thank you for your contribution.

I'll proceed with a code review.

Please make sure your code fully complies with the license (i.e., fulfills the licensing checklist).

Best regards,
Henry Tsui

@yjmm10
Copy link
Author

yjmm10 commented Aug 14, 2024

Yes, the code satisftis the lincese.
I added the workflow of the TRAIN branch, if you don't need it, you can remove.

@henrytsui000
Copy link
Member

Hi,

I'm a bit confused about why we don't use images_path directly. What is the benefit of separating them when loading the data and then combining them back when the dataset's __getitem__ method is called?

@yjmm10
Copy link
Author

yjmm10 commented Aug 14, 2024

Hi,

I'm a bit confused about why we don't use images_path directly. What is the benefit of separating them when loading the data and then combining them back when the dataset's __getitem__ method is called?

At first, I also wanted to directly use the image path. But later, after I modified the directory file, the entire cache needed to be regenerated. If the relative dataset root directory is used, then modifying the path of the root directory randomly will not effect the cache, which is particularly effective in keeping the cache available when the data set is relatively large.

In addition, the verification of this method is more effective for supporting a single data set. Because I extracted some data from it to compare the paths. If all are verified, it is equivalent to regenerating. I saw that there is a hash for verification in the cache of yolov9.

@yjmm10
Copy link
Author

yjmm10 commented Aug 14, 2024

If you don't often rename your dataset, I think the full image path is better.

@@ -100,7 +109,7 @@ def filter_data(self, dataset_path: Path, phase_name: str) -> list:

labels = self.load_valid_labels(image_id, image_seg_annotations)

img_path = images_path / image_name
img_path = Path("images") / phase_name / image_name
data.append((img_path, labels))
Copy link
Contributor

@Abdul-Mukit Abdul-Mukit Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of image_path can we consider using image_id as the key?
That is instead do this:
data.append((image_id, labels)) where image_id = Path(image_name).stem
This should meet the requirement of this MR that we can move around dataset randomly.
This also makes the code a bit more predictable. I think it is more intuitive that a cache has keys that are just file names and not the information about train / test split. In txt based YOLO datasets the only thing common among image and label.txt is the common name without the extension. I think it is fitting that the cache should have that in common too. Furthermore, you already have the phase name in the file name train.cache and val.cache. No need to include phase_name again inside the *.cache files.

The current check for cache is just checking whether the train/test split are same as phase_name: if data[0][0].parent == Path("images")/phase_name:. This will no longer be needed. Please also see my comment about this line inside the load_data function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea that solved the input info and the question of validation.

@@ -54,6 +55,14 @@ def load_data(self, dataset_path: Path, phase_name: str):
else:
data = torch.load(cache_path)
logger.info("📦 Loaded {} cache", phase_name)
# Validate cache
if data[0][0].parent == Path("images")/phase_name:
Copy link
Contributor

@Abdul-Mukit Abdul-Mukit Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not verifying the content of the cache. This is just checking whether the phase_name is the same in the data read from cache.
Assuming we only save the image_id (just the file name without extension), we should be verifying two things:

  1. Lenght of images mentioned in the instances_<phase_name>.json is equals to the number of elements in data.
  2. Is the set of image names (excluding extension) form the json file is equal to the set of image_ids recovered from cache.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the second question, I'd like to make an addition. If a complete check is made to see if the cache is consistent with the json file, when dealing with a relatively large data set, the efficiency is similar to regenerating the cache. At present, I only make a comparison for one of them, and there are still quite a few problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. The second part will be costly.

@@ -133,6 +142,7 @@ def load_valid_labels(self, label_path: str, seg_data_one_img: list) -> Union[Te

def get_data(self, idx):
img_path, bboxes = self.data[idx]
img_path = self.data_root / Path(img_path)
img = Image.open(img_path).convert("RGB")
Copy link
Contributor

@Abdul-Mukit Abdul-Mukit Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current proposed solution, we are splitting the image_path in filter_data then we are completing it back in this function. Such distribution makes the code less predictable and difficult to grasp unless the developers know the code base from the heart. I think we should come up with a better solution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's possible to achieve a similar effect as in yolov8, leaving these paths to be specified by the users. The existing code is overly strict in its requirements for the data format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When open image, it's the same that get the iamge path from image_id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if I fully got your point. If you are suggesting that using image_id (image file name minus the extension) would make things easier, I agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe setting the key to the actual image name with file extension is better?
Saw your comment: #72 (comment)
We already have available to us the dataset_path, phase_name variable. "images" folder name string is common in both coco and yolo datasets. That means we can reconstruct the original path of the image easily if we use the image file name with extension as the key.
image_path = dataset_path / "Images" / phase_name / image_file_name

Extions will help resolve .jpg/.png ambiguity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants