Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_dir() and is_file() are not working properly for gs #493

Open
shanirosen-airis opened this issue Dec 24, 2024 · 7 comments
Open

is_dir() and is_file() are not working properly for gs #493

shanirosen-airis opened this issue Dec 24, 2024 · 7 comments

Comments

@shanirosen-airis
Copy link

shanirosen-airis commented Dec 24, 2024

Hey,
I tried to use the is_dir and is_file functions from both s3 and gs, but discovered that:

from cloudpathlib import AnyPath

p1 = AnyPath("gs://my-bucket/test/test_dir")
p2 = AnyPath("gs://my-bucket/test/test_dir/")

print(p1.is_dir())  # True
print(p2.is_dir())  # False

p1 = AnyPath("s3://my-bucket/test/test_dir")
p2 = AnyPath("s3://my-bucket/test/test_dir/")

print(p1.is_dir())  # True
print(p2.is_dir())  # True

looks like in gs everything is classified as a file unless I strip the last "/".
Any Idea why is this happening?
using cloudpathlib==0.20.0

Thanks in advance

@pjbull
Copy link
Member

pjbull commented Dec 24, 2024

This doesn't repro generically, so it has something to do with the configuration of your bucket/storage and the objects that actually exist in your storage. For example, I see:

In [1]: from cloudpathlib import CloudPath

In [2]: CloudPath('gs://cloudpathlib-test-bucket/performance_tests/').is_dir()
Out[2]: True

In [2]: CloudPath('gs://cloudpathlib-test-bucket/performance_tests').is_dir()
Out[2]: True

Do you have more information about your use case?

A few helpful questions:

@shanirosen-airis
Copy link
Author

shanirosen-airis commented Dec 29, 2024

Thanks for the response! I'm still getting this for some reason, I just created a folder using the gs ui:
image

>>> from cloudpathlib import AnyPath
>>> p1 = AnyPath("gs://airis-packages-tests/folder1")
>>> p1.exists()
True
>>> p1.is_dir()
True
>>> p2 = AnyPath("gs://airis-packages-tests/folder1/")
>>> p2.is_dir()
False

For some reason the blob metadata function runs only on "folder1/" and on "folder" I got None:

Blob: folder1/
Bucket: airis-packages-tests
Storage class: STANDARD
ID: airis-packages-tests/folder1//1735484279960870
Size: 0 bytes
Updated: 2024-12-29 14:58:00.020000+00:00
Generation: 1735484279960870
Metageneration: 1
Etag: CKbqlOCezYoDEAE=
Owner: None
Component count: None
Crc32c: AAAAAA==
md5_hash: 1B2M2Y8AsgTpgAmY7PhCfg==
Cache-control: None
Content-type: text/plain
Content-disposition: None
Content-encoding: None
Content-language: None
Metadata: None
Medialink: https://storage.googleapis.com/download/storage/v1/b/airis-packages-tests/o/folder1%2F?generation=1735484279960870&alt=media
Custom Time: None
Temporary hold:  disabled
Event based hold:  disabled
Retention mode: None
Retention retain until time: None

From this looks like the blob object has "/' which I don't know why. Maybe that's the issue?
The bucket itself was created with gcp's default configuration

@shanirosen-airis
Copy link
Author

Quick update: I tried creating a path with gsutil: gsutil cp file1.txt gs://airis-packages-tests/folder8/123.txt and is_dir worked perfectly fine on folder8. Seems like the issue is with creating folders with gs UI, maybe it creates an empty blob or something like that. Is it possible to support that as well? Thanks in advance!

@shanirosen-airis
Copy link
Author

Hey @pjbull, any updates on this? 😅

@fafnirZ
Copy link
Contributor

fafnirZ commented Feb 12, 2025

I similarly created a folder via the gui on a simulated fs bucket (default) and the following snippet returned something interesting:

bucket.get_blob("aaa/")._properties

returns

{"kind": "storage#object", "id": "{bucket}/aaa//{some_id} "...}

this seems to me like GUI generated folders are considered as file objects (but not really? since this object can have children)

I also performed:

p = GSPath("bucket/aaa/file.txt")
p.touch()
p.unlink()

and even after that operation the folder seems to persist in gcs for some weird reason, whereas non gui created folders automatically gets deleted if all children blobs are removed.

On why .is_dir() fails, my guess is because
https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/gs/gsclient.py#L155
is_file_or_dir considers the path a file object so long as bucket.get_blob yields a non null value.
note I tested .if_file() and it returned True

so potentially a fix is just to embed a check inside the if block on L155 which makes an exception for values ending in /?
i.e.

        if blob is not None:
            if blob.name.endswith("/"):
               return "dir"
            return "file"

something like that could potentially work? happy to submit something if this solution is acceptable

@jayqi
Copy link
Member

jayqi commented Feb 12, 2025

Hi @shanirosen-airis, @fafnirZ,

Unfortunately, this is a tricky gotcha that we haven't totally figured out the right UX for.

It is indeed the case that when you create a folder in the web console GUI, there is a fake file that gets created with the folder's name.

This is because object stores generally have a "flat" address space. Unlike the file system on your computer, they don't have the concept of a folder at all. This is true not just of GCS, but also Amazon S3 and Azure Blob Storage. When you have an object gs://somebucket/somedir/myfile.txt, the object's identifier is the entire string somedir/myfile.txt where the / character is not special and no different than the r or the m.

The natural question is then: what is the console GUI doing? The web consoles for these services parse the paths and split on the / to give you the illusion of directories. That's why you see the behavior "non gui created folders automatically gets deleted if all children blobs are removed". It's because those folders weren't real and never existed, not because they get deleted when you delete the file.

Then, why the behavior when you use the "Create folder" action? The only way for the console GUI to know to show you a folder at somedir/ before you create any objects with a name somedir/{whatever} is if it creates a dummy placeholder object named somedir/. This also means that @fafnirZ's use of "children objects" is incorrect. Two objects somedir/ and somedir/myfile.txt are two sibling objects in a flat object store.

So, at a basic level, cloudpathlib is doing the correct naive thing because it detects there is a file in your bucket named somedir/. This is understandably confusing where cloudpathlib doesn't match the behavior of the console GUI, which has some logic by which it pretends that these are folders. We probably need to figure out what heuristic each cloud's console GUI is using to treat the dummy objects as folders and copy it. A quick fix like "check if there's a trailing /" may not be safe because a user can legitimately create real objects named something/ that they want to treat as a file.

@jayqi
Copy link
Member

jayqi commented Feb 12, 2025

See also #51 for discussion about this and why this is not straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants