-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] automatically regenerate the cache when it doesn't match #72
base: TRAIN
Are you sure you want to change the base?
Conversation
automatically regenerate the cache when it doesn't match |
Hi, I was indeed planning to write validation for the cache, so you've saved me some time! Thank you for your contribution. I'll proceed with a code review. Please make sure your code fully complies with the license (i.e., fulfills the licensing checklist). Best regards, |
Yes, the code satisftis the lincese. |
Hi, I'm a bit confused about why we don't use |
At first, I also wanted to directly use the image path. But later, after I modified the directory file, the entire cache needed to be regenerated. If the relative dataset root directory is used, then modifying the path of the root directory randomly will not effect the cache, which is particularly effective in keeping the cache available when the data set is relatively large. In addition, the verification of this method is more effective for supporting a single data set. Because I extracted some data from it to compare the paths. If all are verified, it is equivalent to regenerating. I saw that there is a hash for verification in the cache of yolov9. |
If you don't often rename your dataset, I think the full image path is better. |
@@ -100,7 +109,7 @@ def filter_data(self, dataset_path: Path, phase_name: str) -> list: | |||
|
|||
labels = self.load_valid_labels(image_id, image_seg_annotations) | |||
|
|||
img_path = images_path / image_name | |||
img_path = Path("images") / phase_name / image_name | |||
data.append((img_path, labels)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of image_path
can we consider using image_id
as the key?
That is instead do this:
data.append((image_id, labels))
where image_id = Path(image_name).stem
This should meet the requirement of this MR that we can move around dataset randomly.
This also makes the code a bit more predictable. I think it is more intuitive that a cache has keys that are just file names and not the information about train / test split. In txt based YOLO datasets the only thing common among image and label.txt is the common name without the extension. I think it is fitting that the cache should have that in common too. Furthermore, you already have the phase name in the file name train.cache
and val.cache
. No need to include phase_name
again inside the *.cache files.
The current check for cache is just checking whether the train/test split are same as phase_name
: if data[0][0].parent == Path("images")/phase_name:
. This will no longer be needed. Please also see my comment about this line inside the load_data
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea that solved the input info and the question of validation.
@@ -54,6 +55,14 @@ def load_data(self, dataset_path: Path, phase_name: str): | |||
else: | |||
data = torch.load(cache_path) | |||
logger.info("📦 Loaded {} cache", phase_name) | |||
# Validate cache | |||
if data[0][0].parent == Path("images")/phase_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not verifying the content of the cache. This is just checking whether the phase_name
is the same in the data
read from cache.
Assuming we only save the image_id
(just the file name without extension), we should be verifying two things:
- Lenght of images mentioned in the instances_<phase_name>.json is equals to the number of elements in
data
. - Is the
set
of image names (excluding extension) form the json file is equal to theset
ofimage_id
s recovered from cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the second question, I'd like to make an addition. If a complete check is made to see if the cache is consistent with the json file, when dealing with a relatively large data set, the efficiency is similar to regenerating the cache. At present, I only make a comparison for one of them, and there are still quite a few problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. The second part will be costly.
@@ -133,6 +142,7 @@ def load_valid_labels(self, label_path: str, seg_data_one_img: list) -> Union[Te | |||
|
|||
def get_data(self, idx): | |||
img_path, bboxes = self.data[idx] | |||
img_path = self.data_root / Path(img_path) | |||
img = Image.open(img_path).convert("RGB") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current proposed solution, we are splitting the image_path in filter_data
then we are completing it back in this function. Such distribution makes the code less predictable and difficult to grasp unless the developers know the code base from the heart. I think we should come up with a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's possible to achieve a similar effect as in yolov8, leaving these paths to be specified by the users. The existing code is overly strict in its requirements for the data format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When open image, it's the same that get the iamge path from image_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, if I fully got your point. If you are suggesting that using image_id (image file name minus the extension) would make things easier, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe setting the key to the actual image name with file extension is better?
Saw your comment: #72 (comment)
We already have available to us the dataset_path
, phase_name
variable. "images" folder name string is common in both coco and yolo datasets. That means we can reconstruct the original path of the image easily if we use the image file name with extension as the key.
image_path = dataset_path / "Images" / phase_name / image_file_name
Extions will help resolve .jpg/.png ambiguity.
name: Pull Request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''
Description
[Please include a summary of the changes and the related issue. (Just overwrite this session directly)]
Type of Change
Please delete options that are not relevant.
Checklist:
Licensing:
By submitting this pull request, I confirm that:
Additional Information
Add any other information about the PR here.
Note:
We hope that the pull request is mainly for the implementation or modification of the project function. If it is a documentation error or a small typo, we are very welcome to send it to [email protected]. Alternatively, you may open an issue to report these. Thank you for your understanding and cooperation!