Improve Robustness and Error Handling in ImageFolder Dataset Builder#5567
Closed
swalehmwadime wants to merge 3 commits intotensorflow:masterfrom
swalehmwadime:master
Closed
Improve Robustness and Error Handling in ImageFolder Dataset Builder#5567swalehmwadime wants to merge 3 commits intotensorflow:masterfrom swalehmwadime:master
swalehmwadime wants to merge 3 commits intotensorflow:masterfrom
swalehmwadime:master
Conversation
fix: Improve robustness, error handling, and performance in ImageFolder dataset builder - Introduced uniform seeding using `hash(split_name)` in `_get_split_label_images` to ensure more consistent shuffling. - Added validation to check if `root_dir` exists before proceeding with data extraction. - Removed unused parameters such as `read_config` in `_as_dataset` method. - Enhanced docstrings for better clarity. - Improved error handling for non-existent directories in `_get_split_label_images`. - General cleanup and performance considerations for handling large datasets.
Improve Robustness and Error Handling in ImageFolder Dataset Builder
Collaborator
|
Thank you for the contribution! Some tests are failing, please fix. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR enhances the
ImageFolderdataset builder by introducing several improvements to robustness, error handling, and potential performance considerations. Key changes include:hash(split_name)for seed generation, ensuring uniform randomness across runs.root_direxists, preventing runtime errors related to non-existent directories.Key Changes
hash(split_name)for consistent random seeding.root_dirbefore attempting to load data, raising an appropriate error if it doesn’t exist._as_datasetmethod by removing unusedread_configparameter.Why This Matters
These updates ensure that the
ImageFolderdataset builder is more reliable and easier to use, especially when working with large and complex datasets. The improvements to shuffling, error handling, and code clarity will help users avoid common pitfalls and improve overall performance.Testing and Validation
These changes have been tested with various directory structures and image formats to ensure they work as expected. The deterministic shuffling and directory validation significantly improve the reliability of the dataset builder.
Future Work
While this PR focuses on improving robustness and error handling, future work could explore further optimizations, such as caching and prefetching strategies, to enhance performance when dealing with large datasets.