diff --git a/HISTORY.md b/HISTORY.md index cfacb629..dc733aad 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -3,6 +3,7 @@ ## Unreleased - Fixed `rmtree` fail on Azure with no `hns` and more than 256 blobs to drop (Issue [#509](https://github.com/drivendataorg/cloudpathlib/issues/509), PR [#508](https://github.com/drivendataorg/cloudpathlib/pull/508), thanks @alikefia) +- Added support for http(s) urls with `HttpClient`, `HttpPath`, `HttpsClient`, and `HttpsPath`. (Issue [#455](https://github.com/drivendataorg/cloudpathlib/issues/455 ), PR [#468](https://github.com/drivendataorg/cloudpathlib/pull/468)) ## v0.21.0 (2025-03-03) diff --git a/README.md b/README.md index 5ca8ef50..f76eb223 100644 --- a/README.md +++ b/README.md @@ -124,88 +124,97 @@ list(root_dir.glob('**/*.txt')) Most methods and properties from `pathlib.Path` are supported except for the ones that don't make sense in a cloud context. There are a few additional methods or properties that relate to specific cloud services or specifically for cloud paths. -| Methods + properties | `AzureBlobPath` | `S3Path` | `GSPath` | -|:-----------------------|:------------------|:-----------|:-----------| -| `absolute` | ✅ | ✅ | ✅ | -| `anchor` | ✅ | ✅ | ✅ | -| `as_uri` | ✅ | ✅ | ✅ | -| `drive` | ✅ | ✅ | ✅ | -| `exists` | ✅ | ✅ | ✅ | -| `glob` | ✅ | ✅ | ✅ | -| `is_absolute` | ✅ | ✅ | ✅ | -| `is_dir` | ✅ | ✅ | ✅ | -| `is_file` | ✅ | ✅ | ✅ | -| `is_relative_to` | ✅ | ✅ | ✅ | -| `iterdir` | ✅ | ✅ | ✅ | -| `joinpath` | ✅ | ✅ | ✅ | -| `match` | ✅ | ✅ | ✅ | -| `mkdir` | ✅ | ✅ | ✅ | -| `name` | ✅ | ✅ | ✅ | -| `open` | ✅ | ✅ | ✅ | -| `parent` | ✅ | ✅ | ✅ | -| `parents` | ✅ | ✅ | ✅ | -| `parts` | ✅ | ✅ | ✅ | -| `read_bytes` | ✅ | ✅ | ✅ | -| `read_text` | ✅ | ✅ | ✅ | -| `relative_to` | ✅ | ✅ | ✅ | -| `rename` | ✅ | ✅ | ✅ | -| `replace` | ✅ | ✅ | ✅ | -| `resolve` | ✅ | ✅ | ✅ | -| `rglob` | ✅ | ✅ | ✅ | -| `rmdir` | ✅ | ✅ | ✅ | -| `samefile` | ✅ | ✅ | ✅ | -| `stat` | ✅ | ✅ | ✅ | -| `stem` | ✅ | ✅ | ✅ | -| `suffix` | ✅ | ✅ | ✅ | -| `suffixes` | ✅ | ✅ | ✅ | -| `touch` | ✅ | ✅ | ✅ | -| `unlink` | ✅ | ✅ | ✅ | -| `with_name` | ✅ | ✅ | ✅ | -| `with_stem` | ✅ | ✅ | ✅ | -| `with_suffix` | ✅ | ✅ | ✅ | -| `write_bytes` | ✅ | ✅ | ✅ | -| `write_text` | ✅ | ✅ | ✅ | -| `as_posix` | ❌ | ❌ | ❌ | -| `chmod` | ❌ | ❌ | ❌ | -| `cwd` | ❌ | ❌ | ❌ | -| `expanduser` | ❌ | ❌ | ❌ | -| `group` | ❌ | ❌ | ❌ | -| `hardlink_to` | ❌ | ❌ | ❌ | -| `home` | ❌ | ❌ | ❌ | -| `is_block_device` | ❌ | ❌ | ❌ | -| `is_char_device` | ❌ | ❌ | ❌ | -| `is_fifo` | ❌ | ❌ | ❌ | -| `is_mount` | ❌ | ❌ | ❌ | -| `is_reserved` | ❌ | ❌ | ❌ | -| `is_socket` | ❌ | ❌ | ❌ | -| `is_symlink` | ❌ | ❌ | ❌ | -| `lchmod` | ❌ | ❌ | ❌ | -| `link_to` | ❌ | ❌ | ❌ | -| `lstat` | ❌ | ❌ | ❌ | -| `owner` | ❌ | ❌ | ❌ | -| `readlink` | ❌ | ❌ | ❌ | -| `root` | ❌ | ❌ | ❌ | -| `symlink_to` | ❌ | ❌ | ❌ | -| `as_url` | ✅ | ✅ | ✅ | -| `clear_cache` | ✅ | ✅ | ✅ | -| `cloud_prefix` | ✅ | ✅ | ✅ | -| `copy` | ✅ | ✅ | ✅ | -| `copytree` | ✅ | ✅ | ✅ | -| `download_to` | ✅ | ✅ | ✅ | -| `etag` | ✅ | ✅ | ✅ | -| `fspath` | ✅ | ✅ | ✅ | -| `is_junction` | ✅ | ✅ | ✅ | -| `is_valid_cloudpath` | ✅ | ✅ | ✅ | -| `rmtree` | ✅ | ✅ | ✅ | -| `upload_from` | ✅ | ✅ | ✅ | -| `validate` | ✅ | ✅ | ✅ | -| `walk` | ✅ | ✅ | ✅ | -| `with_segments` | ✅ | ✅ | ✅ | -| `blob` | ✅ | ❌ | ✅ | -| `bucket` | ❌ | ✅ | ✅ | -| `container` | ✅ | ❌ | ❌ | -| `key` | ❌ | ✅ | ❌ | -| `md5` | ✅ | ❌ | ✅ | +| Methods + properties | `AzureBlobPath` | `GSPath` | `HttpsPath` | `S3Path` | +|:-----------------------|:------------------|:-----------|:--------------|:-----------| +| `absolute` | ✅ | ✅ | ✅ | ✅ | +| `anchor` | ✅ | ✅ | ✅ | ✅ | +| `as_uri` | ✅ | ✅ | ✅ | ✅ | +| `drive` | ✅ | ✅ | ✅ | ✅ | +| `exists` | ✅ | ✅ | ✅ | ✅ | +| `glob` | ✅ | ✅ | ✅ | ✅ | +| `is_absolute` | ✅ | ✅ | ✅ | ✅ | +| `is_dir` | ✅ | ✅ | ✅ | ✅ | +| `is_file` | ✅ | ✅ | ✅ | ✅ | +| `is_junction` | ✅ | ✅ | ✅ | ✅ | +| `is_relative_to` | ✅ | ✅ | ✅ | ✅ | +| `iterdir` | ✅ | ✅ | ✅ | ✅ | +| `joinpath` | ✅ | ✅ | ✅ | ✅ | +| `match` | ✅ | ✅ | ✅ | ✅ | +| `mkdir` | ✅ | ✅ | ✅ | ✅ | +| `name` | ✅ | ✅ | ✅ | ✅ | +| `open` | ✅ | ✅ | ✅ | ✅ | +| `parent` | ✅ | ✅ | ✅ | ✅ | +| `parents` | ✅ | ✅ | ✅ | ✅ | +| `parts` | ✅ | ✅ | ✅ | ✅ | +| `read_bytes` | ✅ | ✅ | ✅ | ✅ | +| `read_text` | ✅ | ✅ | ✅ | ✅ | +| `relative_to` | ✅ | ✅ | ✅ | ✅ | +| `rename` | ✅ | ✅ | ✅ | ✅ | +| `replace` | ✅ | ✅ | ✅ | ✅ | +| `resolve` | ✅ | ✅ | ✅ | ✅ | +| `rglob` | ✅ | ✅ | ✅ | ✅ | +| `rmdir` | ✅ | ✅ | ✅ | ✅ | +| `samefile` | ✅ | ✅ | ✅ | ✅ | +| `stat` | ✅ | ✅ | ✅ | ✅ | +| `stem` | ✅ | ✅ | ✅ | ✅ | +| `suffix` | ✅ | ✅ | ✅ | ✅ | +| `suffixes` | ✅ | ✅ | ✅ | ✅ | +| `touch` | ✅ | ✅ | ✅ | ✅ | +| `unlink` | ✅ | ✅ | ✅ | ✅ | +| `walk` | ✅ | ✅ | ✅ | ✅ | +| `with_name` | ✅ | ✅ | ✅ | ✅ | +| `with_segments` | ✅ | ✅ | ✅ | ✅ | +| `with_stem` | ✅ | ✅ | ✅ | ✅ | +| `with_suffix` | ✅ | ✅ | ✅ | ✅ | +| `write_bytes` | ✅ | ✅ | ✅ | ✅ | +| `write_text` | ✅ | ✅ | ✅ | ✅ | +| `as_posix` | ❌ | ❌ | ❌ | ❌ | +| `chmod` | ❌ | ❌ | ❌ | ❌ | +| `cwd` | ❌ | ❌ | ❌ | ❌ | +| `expanduser` | ❌ | ❌ | ❌ | ❌ | +| `group` | ❌ | ❌ | ❌ | ❌ | +| `hardlink_to` | ❌ | ❌ | ❌ | ❌ | +| `home` | ❌ | ❌ | ❌ | ❌ | +| `is_block_device` | ❌ | ❌ | ❌ | ❌ | +| `is_char_device` | ❌ | ❌ | ❌ | ❌ | +| `is_fifo` | ❌ | ❌ | ❌ | ❌ | +| `is_mount` | ❌ | ❌ | ❌ | ❌ | +| `is_reserved` | ❌ | ❌ | ❌ | ❌ | +| `is_socket` | ❌ | ❌ | ❌ | ❌ | +| `is_symlink` | ❌ | ❌ | ❌ | ❌ | +| `lchmod` | ❌ | ❌ | ❌ | ❌ | +| `lstat` | ❌ | ❌ | ❌ | ❌ | +| `owner` | ❌ | ❌ | ❌ | ❌ | +| `readlink` | ❌ | ❌ | ❌ | ❌ | +| `root` | ❌ | ❌ | ❌ | ❌ | +| `symlink_to` | ❌ | ❌ | ❌ | ❌ | +| `as_url` | ✅ | ✅ | ✅ | ✅ | +| `clear_cache` | ✅ | ✅ | ✅ | ✅ | +| `client` | ✅ | ✅ | ✅ | ✅ | +| `cloud_prefix` | ✅ | ✅ | ✅ | ✅ | +| `copy` | ✅ | ✅ | ✅ | ✅ | +| `copytree` | ✅ | ✅ | ✅ | ✅ | +| `download_to` | ✅ | ✅ | ✅ | ✅ | +| `from_uri` | ✅ | ✅ | ✅ | ✅ | +| `fspath` | ✅ | ✅ | ✅ | ✅ | +| `full_match` | ✅ | ✅ | ✅ | ✅ | +| `is_valid_cloudpath` | ✅ | ✅ | ✅ | ✅ | +| `parser` | ✅ | ✅ | ✅ | ✅ | +| `rmtree` | ✅ | ✅ | ✅ | ✅ | +| `upload_from` | ✅ | ✅ | ✅ | ✅ | +| `validate` | ✅ | ✅ | ✅ | ✅ | +| `etag` | ✅ | ✅ | ❌ | ✅ | +| `blob` | ✅ | ✅ | ❌ | ❌ | +| `bucket` | ❌ | ✅ | ❌ | ✅ | +| `md5` | ✅ | ✅ | ❌ | ❌ | +| `container` | ✅ | ❌ | ❌ | ❌ | +| `delete` | ❌ | ❌ | ✅ | ❌ | +| `get` | ❌ | ❌ | ✅ | ❌ | +| `head` | ❌ | ❌ | ✅ | ❌ | +| `key` | ❌ | ❌ | ❌ | ✅ | +| `parsed_url` | ❌ | ❌ | ✅ | ❌ | +| `post` | ❌ | ❌ | ✅ | ❌ | +| `put` | ❌ | ❌ | ✅ | ❌ | ---- diff --git a/cloudpathlib/__init__.py b/cloudpathlib/__init__.py index da4fe28e..84ed31b2 100644 --- a/cloudpathlib/__init__.py +++ b/cloudpathlib/__init__.py @@ -4,9 +4,11 @@ from .azure.azblobclient import AzureBlobClient from .azure.azblobpath import AzureBlobPath from .cloudpath import CloudPath, implementation_registry -from .s3.s3client import S3Client -from .gs.gspath import GSPath from .gs.gsclient import GSClient +from .gs.gspath import GSPath +from .http.httpclient import HttpClient, HttpsClient +from .http.httppath import HttpPath, HttpsPath +from .s3.s3client import S3Client from .s3.s3path import S3Path @@ -27,6 +29,10 @@ "implementation_registry", "GSClient", "GSPath", + "HttpClient", + "HttpsClient", + "HttpPath", + "HttpsPath", "S3Client", "S3Path", ] diff --git a/cloudpathlib/cloudpath.py b/cloudpathlib/cloudpath.py index 5845e929..f7621c5b 100644 --- a/cloudpathlib/cloudpath.py +++ b/cloudpathlib/cloudpath.py @@ -27,7 +27,6 @@ Generator, List, Optional, - Sequence, Tuple, Type, TYPE_CHECKING, @@ -299,11 +298,11 @@ def __setstate__(self, state: Dict[str, Any]) -> None: @property def _no_prefix(self) -> str: - return self._str[len(self.cloud_prefix) :] + return self._str[len(self.anchor) :] @property def _no_prefix_no_drive(self) -> str: - return self._str[len(self.cloud_prefix) + len(self.drive) :] + return self._str[len(self.anchor) + len(self.drive) :] @overload @classmethod @@ -909,9 +908,9 @@ def relative_to(self, other: Self, walk_up: bool = False) -> PurePosixPath: # absolute) if not isinstance(other, CloudPath): raise ValueError(f"{self} is a cloud path, but {other} is not") - if self.cloud_prefix != other.cloud_prefix: + if self.anchor != other.anchor: raise ValueError( - f"{self} is a {self.cloud_prefix} path, but {other} is a {other.cloud_prefix} path" + f"{self} is a {self.anchor} path, but {other} is a {other.anchor} path" ) kwargs = dict(walk_up=walk_up) @@ -939,6 +938,9 @@ def full_match(self, pattern: str, case_sensitive: Optional[bool] = None) -> boo # strip scheme from start of pattern before testing if pattern.startswith(self.anchor + self.drive): pattern = pattern[len(self.anchor + self.drive) :] + elif pattern.startswith(self.anchor): + # for http paths, keep leading slash + pattern = pattern[len(self.anchor) - 1 :] # remove drive, which is kept on normal dispatch to pathlib return PurePosixPath(self._no_prefix_no_drive).full_match( # type: ignore[attr-defined] @@ -969,7 +971,7 @@ def parent(self) -> Self: return self._dispatch_to_path("parent") @property - def parents(self) -> Sequence[Self]: + def parents(self) -> Tuple[Self, ...]: return self._dispatch_to_path("parents") @property @@ -1224,7 +1226,7 @@ def copytree(self, destination, force_overwrite_to_cloud=None, ignore=None): ) elif subpath.is_dir(): subpath.copytree( - destination / subpath.name, + destination / (subpath.name + ("" if subpath.name.endswith("/") else "/")), force_overwrite_to_cloud=force_overwrite_to_cloud, ignore=ignore, ) @@ -1258,8 +1260,8 @@ def _new_cloudpath(self, path: Union[str, os.PathLike]) -> Self: path = path[1:] # add prefix/anchor if it is not already - if not path.startswith(self.cloud_prefix): - path = f"{self.cloud_prefix}{path}" + if not path.startswith(self.anchor): + path = f"{self.anchor}{path}" return self.client.CloudPath(path) diff --git a/cloudpathlib/http/__init__.py b/cloudpathlib/http/__init__.py new file mode 100644 index 00000000..ccf7452e --- /dev/null +++ b/cloudpathlib/http/__init__.py @@ -0,0 +1,9 @@ +from .httpclient import HttpClient, HttpsClient +from .httppath import HttpPath, HttpsPath + +__all__ = [ + "HttpClient", + "HttpPath", + "HttpsClient", + "HttpsPath", +] diff --git a/cloudpathlib/http/httpclient.py b/cloudpathlib/http/httpclient.py new file mode 100644 index 00000000..4f1fe87a --- /dev/null +++ b/cloudpathlib/http/httpclient.py @@ -0,0 +1,201 @@ +from datetime import datetime, timezone +import http +import os +import re +import urllib.request +import urllib.parse +import urllib.error +from pathlib import Path +from typing import Iterable, Optional, Tuple, Union, Callable +import shutil +import mimetypes + +from cloudpathlib.client import Client, register_client_class +from cloudpathlib.enums import FileCacheMode + +from .httppath import HttpPath + + +@register_client_class("http") +class HttpClient(Client): + def __init__( + self, + file_cache_mode: Optional[Union[str, FileCacheMode]] = None, + local_cache_dir: Optional[Union[str, os.PathLike]] = None, + content_type_method: Optional[Callable] = mimetypes.guess_type, + auth: Optional[urllib.request.BaseHandler] = None, + custom_list_page_parser: Optional[Callable[[str], Iterable[str]]] = None, + custom_dir_matcher: Optional[Callable[[str], bool]] = None, + write_file_http_method: Optional[str] = "PUT", + ): + """Class constructor. Creates an HTTP client that can be used to interact with HTTP servers + using the cloudpathlib library. + + Args: + file_cache_mode (Optional[Union[str, FileCacheMode]]): How often to clear the file cache; see + [the caching docs](https://cloudpathlib.drivendata.org/stable/caching/) for more information + about the options in cloudpathlib.eums.FileCacheMode. + local_cache_dir (Optional[Union[str, os.PathLike]]): Path to directory to use as cache + for downloaded files. If None, will use a temporary directory. Default can be set with + the `CLOUDPATHLIB_LOCAL_CACHE_DIR` environment variable. + content_type_method (Optional[Callable]): Function to call to guess media type (mimetype) when + uploading files. Defaults to `mimetypes.guess_type`. + auth (Optional[urllib.request.BaseHandler]): Authentication handler to use for the client. Defaults to None, which will use the default handler. + custom_list_page_parser (Optional[Callable[[str], Iterable[str]]]): Function to call to parse pages that list directories. Defaults to looking for `` tags with `href`. + custom_dir_matcher (Optional[Callable[[str], bool]]): Function to call to identify a url that is a directory. Defaults to a lambda that checks if the path ends with a `/`. + write_file_http_method (Optional[str]): HTTP method to use when writing files. Defaults to "PUT", but some servers may want "POST". + """ + super().__init__(file_cache_mode, local_cache_dir, content_type_method) + self.auth = auth + + if self.auth is None: + self.opener = urllib.request.build_opener() + else: + self.opener = urllib.request.build_opener(self.auth) + + self.custom_list_page_parser = custom_list_page_parser + + self.dir_matcher = ( + custom_dir_matcher if custom_dir_matcher is not None else lambda x: x.endswith("/") + ) + + self.write_file_http_method = write_file_http_method + + def _get_metadata(self, cloud_path: HttpPath) -> dict: + with self.opener.open(cloud_path.as_url()) as response: + last_modified = response.headers.get("Last-Modified", None) + + if last_modified is not None: + # per https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified + last_modified = datetime.strptime(last_modified, "%a, %d %b %Y %H:%M:%S %Z") + + # should always be utc https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified#gmt + last_modified = last_modified.replace(tzinfo=timezone.utc) + + return { + "size": int(response.headers.get("Content-Length", 0)), + "last_modified": last_modified, + "content_type": response.headers.get("Content-Type", None), + } + + def _download_file(self, cloud_path: HttpPath, local_path: Union[str, os.PathLike]) -> Path: + local_path = Path(local_path) + with self.opener.open(cloud_path.as_url()) as response: + # Ensure parent directory exists before opening file + local_path.parent.mkdir(parents=True, exist_ok=True) + with local_path.open("wb") as out_file: + shutil.copyfileobj(response, out_file) + return local_path + + def _exists(self, cloud_path: HttpPath) -> bool: + request = urllib.request.Request(cloud_path.as_url(), method="HEAD") + try: + with self.opener.open(request) as response: + return response.status == 200 + except (urllib.error.HTTPError, urllib.error.URLError) as e: + if isinstance(e, urllib.error.URLError) or e.code == 404: + return False + raise + + def _move_file(self, src: HttpPath, dst: HttpPath, remove_src: bool = True) -> HttpPath: + # .fspath will download the file so the local version can be uploaded + self._upload_file(src.fspath, dst) + if remove_src: + self._remove(src) + return dst + + def _remove(self, cloud_path: HttpPath, missing_ok: bool = True) -> None: + request = urllib.request.Request(cloud_path.as_url(), method="DELETE") + try: + with self.opener.open(request) as response: + if response.status != 204: + raise Exception(f"Failed to delete {cloud_path}.") + except urllib.error.HTTPError as e: + if e.code == 404 and missing_ok: + pass + else: + raise FileNotFoundError(f"Failed to delete {cloud_path}.") + + def _list_dir(self, cloud_path: HttpPath, recursive: bool) -> Iterable[Tuple[HttpPath, bool]]: + try: + with self.opener.open(cloud_path.as_url()) as response: + # Parse the directory listing + for path, is_dir in self._parse_list_dir_response( + response.read().decode(), base_url=str(cloud_path) + ): + yield path, is_dir + + # If it's a directory and recursive is True, list the contents of the directory + if recursive and is_dir: + yield from self._list_dir(path, recursive=True) + + except Exception as e: # noqa E722 + raise NotImplementedError( + f"Unable to parse response as a listing of files; please provide a custom parser as `custom_list_page_parser`. Error raised: {e}" + ) + + def _upload_file(self, local_path: Union[str, os.PathLike], cloud_path: HttpPath) -> HttpPath: + local_path = Path(local_path) + if self.content_type_method is not None: + content_type, _ = self.content_type_method(local_path) + + headers = {"Content-Type": content_type or "application/octet-stream"} + + with local_path.open("rb") as file_data: + request = urllib.request.Request( + cloud_path.as_url(), + data=file_data.read(), + method=self.write_file_http_method, + headers=headers, + ) + with self.opener.open(request) as response: + if response.status != 201 and response.status != 200: + raise Exception(f"Failed to upload {local_path} to {cloud_path}.") + return cloud_path + + def _get_public_url(self, cloud_path: HttpPath) -> str: + return cloud_path.as_url() + + def _generate_presigned_url(self, cloud_path: HttpPath, expire_seconds: int = 60 * 60) -> str: + raise NotImplementedError("Presigned URLs are not supported using urllib.") + + def _parse_list_dir_response( + self, response: str, base_url: str + ) -> Iterable[Tuple[HttpPath, bool]]: + # Ensure base_url ends with a trailing slash so joining works + if not base_url.endswith("/"): + base_url += "/" + + def _simple_links(html: str) -> Iterable[str]: + return re.findall(r' Tuple[http.client.HTTPResponse, bytes]: + request = urllib.request.Request(url.as_url(), method=method, **kwargs) + with self.opener.open(request) as response: + # eager read of response content, which is not available after + # the connection is closed when we exit the context manager. + return response, response.read() + + +HttpClient.HttpPath = HttpClient.CloudPath # type: ignore + + +@register_client_class("https") +class HttpsClient(HttpClient): + pass + + +HttpsClient.HttpsPath = HttpsClient.CloudPath # type: ignore diff --git a/cloudpathlib/http/httppath.py b/cloudpathlib/http/httppath.py new file mode 100644 index 00000000..3f42a82d --- /dev/null +++ b/cloudpathlib/http/httppath.py @@ -0,0 +1,163 @@ +import datetime +import http +import os +from pathlib import Path, PurePosixPath +from tempfile import TemporaryDirectory +from typing import Any, Tuple, TYPE_CHECKING, Union, Optional +import urllib + +from ..cloudpath import CloudPath, NoStatError, register_path_class + + +if TYPE_CHECKING: + from .httpclient import HttpClient, HttpsClient + + +@register_path_class("http") +class HttpPath(CloudPath): + cloud_prefix = "http://" + client: "HttpClient" + + def __init__( + self, + cloud_path: Union[str, "HttpPath"], + client: Optional["HttpClient"] = None, + ) -> None: + super().__init__(cloud_path, client) + + self._path = ( + PurePosixPath(self._url.path) + if self._url.path.startswith("/") + else PurePosixPath(f"/{self._url.path}") + ) + + @property + def _local(self) -> Path: + """Cached local version of the file.""" + # remove params, query, fragment to get local path + return self.client._local_cache_dir / self._url.path.lstrip("/") + + def _dispatch_to_path(self, func: str, *args, **kwargs) -> Any: + sup = super()._dispatch_to_path(func, *args, **kwargs) + + # some dispatch methods like "__truediv__" strip trailing slashes; + # for http paths, we need to keep them to indicate directories + if func == "__truediv__" and str(args[0]).endswith("/"): + return self._new_cloudpath(str(sup) + "/") + + else: + return sup + + @property + def parsed_url(self) -> urllib.parse.ParseResult: + return self._url + + @property + def drive(self) -> str: + # For HTTP paths, no drive; use .anchor for scheme + netloc + return self._url.netloc + + @property + def anchor(self) -> str: + return f"{self._url.scheme}://{self._url.netloc}/" + + @property + def _no_prefix_no_drive(self) -> str: + # netloc appears in anchor and drive for httppath; so don't double count + return self._str[len(self.anchor) - 1 :] + + def is_dir(self, follow_symlinks: bool = True) -> bool: + if not self.exists(): + return False + + # Use client default to identify directories + return self.client.dir_matcher(str(self)) + + def is_file(self, follow_symlinks: bool = True) -> bool: + if not self.exists(): + return False + + return not self.client.dir_matcher(str(self)) + + def mkdir(self, parents: bool = False, exist_ok: bool = False) -> None: + pass # no-op for HTTP Paths + + def touch(self, exist_ok: bool = True) -> None: + if self.exists(): + if not exist_ok: + raise FileExistsError(f"File already exists: {self}") + + raise NotImplementedError( + "Touch not implemented for existing HTTP files since we can't update the modified time; " + "use `put()` or write to the file instead." + ) + else: + empty_file = Path(TemporaryDirectory().name) / "empty_file.txt" + empty_file.parent.mkdir(parents=True, exist_ok=True) + empty_file.write_text("") + self.client._upload_file(empty_file, self) + + def stat(self, follow_symlinks: bool = True) -> os.stat_result: + try: + meta = self.client._get_metadata(self) + except: # noqa E722 + raise NoStatError(f"Could not get metadata for {self}") + + return os.stat_result( + ( # type: ignore + None, # mode + None, # ino + self.cloud_prefix, # dev, + None, # nlink, + None, # uid, + None, # gid, + meta.get("size", 0), # size, + None, # atime, + meta.get( + "last_modified", datetime.datetime.fromtimestamp(0) + ).timestamp(), # mtime, + None, # ctime, + ) + ) + + def as_url(self, presign: bool = False, expire_seconds: int = 60 * 60) -> str: + if presign: + raise NotImplementedError("Presigning not supported for HTTP paths") + + return ( + self._url.geturl() + ) # recreate from what was initialized so we have the same query params, etc. + + @property + def name(self) -> str: + return self._path.name + + @property + def parents(self) -> Tuple["HttpPath", ...]: + return super().parents + (self._new_cloudpath(""),) + + def get(self, **kwargs) -> Tuple[http.client.HTTPResponse, bytes]: + """Issue a get request with `urllib.request.Request`""" + return self.client.request(self, "GET", **kwargs) + + def put(self, **kwargs) -> Tuple[http.client.HTTPResponse, bytes]: + """Issue a put request with `urllib.request.Request`""" + return self.client.request(self, "PUT", **kwargs) + + def post(self, **kwargs) -> Tuple[http.client.HTTPResponse, bytes]: + """Issue a post request with `urllib.request.Request`""" + return self.client.request(self, "POST", **kwargs) + + def delete(self, **kwargs) -> Tuple[http.client.HTTPResponse, bytes]: + """Issue a delete request with `urllib.request.Request`""" + return self.client.request(self, "DELETE", **kwargs) + + def head(self, **kwargs) -> Tuple[http.client.HTTPResponse, bytes]: + """Issue a head request with `urllib.request.Request`""" + return self.client.request(self, "HEAD", **kwargs) + + +@register_path_class("https") +class HttpsPath(HttpPath): + cloud_prefix: str = "https://" + client: "HttpsClient" diff --git a/docs/docs/http.md b/docs/docs/http.md new file mode 100644 index 00000000..ce1846cf --- /dev/null +++ b/docs/docs/http.md @@ -0,0 +1,208 @@ +# HTTP Support in CloudPathLib + +We support `http://` and `https://` URLs with `CloudPath`, but these behave somewhat differently from typical cloud provider URIs (e.g., `s3://`, `gs://`) or local file paths. This document describes those differences, caveats, and the additional configuration options available. + + > **Note:** We don't currently automatically detect `http` links to cloud storage providers (for example, `http://s3.amazonaws.com/bucket/key`) and treat those as `S3Path`, `GSPath`, etc. They will be treated as normal urls (i.e., `HttpPath` objects). + +## Basic Usage + +```python +from cloudpathlib import CloudPath + +# Create a path object +path = CloudPath("https://example.com/data/file.txt") + +# Read file contents +text = path.read_text() +binary = path.read_bytes() + +# Get parent directory +parent = path.parent # https://example.com/data/ + +# Join paths +subpath = path.parent / "other.txt" # https://example.com/data/other.txt + +# Check if file exists +if path.exists(): + print("File exists!") + +# Get file name and suffix +print(path.name) # "file.txt" +print(path.suffix) # ".txt" + +# List directory contents (if server supports directory listings) +data_dir = CloudPath("https://example.com/data/") +for child_path in data_dir.iterdir(): + print(child_path) +``` + +## How HTTP Paths Differ + + - HTTP servers are not necessarily structured like file systems. Operations such as listing directories, removing files, or creating folders depend on whether the server supports them. + - For many operations (e.g., uploading, removing files), this implementation relies on specific HTTP verbs like `PUT` or `DELETE`. If the server does not allow these verbs, those operations will fail. + - While some cloud storage backends (e.g., AWS S3) provide robust directory emulation, a basic HTTP server may only partially implement these concepts (e.g., listing a directory might just be an HTML page with links). + - HTTP URLs often include more than just a path, for example query strings, fragments, and other URL modifiers that are not part of the path. These are handled differently than with other cloud storage providers. + +## URL components + +You can access the various components of a URL via the `HttpPath.parsed_url` property, which is a [`urllib.parse.ParseResult`](https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse) object. + +For example for the following URL: + +``` +https://username:password@www.example.com:8080/path/to/resource?query=param#fragment +``` + +The components are: + +```mermaid +flowchart LR + + %% Define colors for each block + classDef scheme fill:#FFD700,stroke:#000,stroke-width:1px,color:#000 + classDef netloc fill:#ADD8E6,stroke:#000,stroke-width:1px,color:#000 + classDef path fill:#98FB98,stroke:#000,stroke-width:1px,color:#000 + classDef query fill:#EE82EE,stroke:#000,stroke-width:1px,color:#000 + classDef fragment fill:#FFB6C1,stroke:#000,stroke-width:1px,color:#000 + + A[".scheme
https"]:::scheme + B[".netloc
username:password\@www‎.example.com:8080"]:::netloc + C[".path
/path/to/resource"]:::path + D[".query
query=param"]:::query + E[".fragment
fragment"]:::fragment + + A --> B --> C --> D --> E +``` + +To access the components of the URL, you can use the `HttpPath.parsed_url` property: + +```python +my_path = HttpPath("http://username:password@www.example.com:8080/path/to/resource?query=param#fragment") + +print(my_path.parsed_url.scheme) # "http" +print(my_path.parsed_url.netloc) # "username:password@www.example.com:8080" +print(my_path.parsed_url.path) # "/path/to/resource" +print(my_path.parsed_url.query) # "query=param" +print(my_path.parsed_url.fragment) # "fragment" + +# extra properties that are subcomponents of `netloc` +print(my_path.parsed_url.username) # "username" +print(my_path.parsed_url.password) # "password" +print(my_path.parsed_url.hostname) # "www.example.com" +print(my_path.parsed_url.port) # "8080" +``` + +### Preservation and Joining Behavior + + - **Params, query, and fragment** are part of the URL, but be aware that when you perform operations that return a new path (e.g., joining `my_path / "subdir"`, walking directories, fetching parents, etc.), these modifiers will be discarded unless you explicitly preserve them, since we operate under the assumption that these modifiers are tied to the specific URL. + - **netloc (including the subcomponents, username, password, hostname, port) and scheme** are preserved when joining. They are derived from the main portion of the URL (e.g., `http://username:password@www.example.com:8080`). + +### The `HttpPath.anchor` Property + +Because of naming conventions inherited from Python's `pathlib`, the "anchor" in a CloudPath (e.g., `my_path.anchor`) refers to `:///`. It does **not** include the "fragment" portion of a URL (which is sometimes also called the "anchor" in HTML contexts since it can refer to a `
` tag). In other words, `.anchor` returns something like `https://www.example.com/`, not `...#fragment`. To get the fragment, use `my_path.parsed_url.fragment`. + +## Required serverside HTTP verbs support + +Some operations require that the server support specific HTTP verbs. If your server does not support these verbs, the operation will fail. + +Your server needs to support these operations for them to succeed: + + - If your server does not allow `DELETE`, you will not be able to remove files via `HttpPath.unlink()` or `HttpPath.remove()`. + - If your server does not allow `PUT` (or `POST`, see next bullet), you won't be able to upload files. + - By default, we use `PUT` for creating or replacing a file. If you need `POST` for uploads, you can override the behavior by passing `write_file_http_method="POST"` to the `HttpClient` constructor. + +### Making requests with the `HttpPath` object + +`HttpPath` and `HttpsPath` expose direct methods to perform the relevant HTTP verbs: + +```python +response, content = my_path.get() # issues a GET +response, content = my_path.put() # issues a PUT +response, content = my_path.post() # issues a POST +response, content = my_path.delete() # issues a DELETE +response, content = my_path.head() # issues a HEAD +``` + +These methods are thin wrappers around the client's underlying `request(...)` method, so you can pass any arguments that [`urllib.request.Request`](https://docs.python.org/3/library/urllib.request.html#urllib.request.Request) supports, so you can pass content via `data=` and headers via `headers=`. + +## Authentication + +By default, `HttpClient` will build a simple opener with `urllib.request.build_opener()`, which typically handles no or basic system-wide HTTP auth. However, you can pass an implementation of `urllib.request.BaseHandler` (e.g., an `HTTPBasicAuthHandler`) to the `HttpClient` of `HttpsClient` constructors to handle authentication: + +```python +import urllib.request + +auth_handler = urllib.request.HTTPBasicAuthHandler() +auth_handler.add_password( + realm="Some Realm", + uri="http://www.example.com", + user="username", + passwd="password" +) + +client = HttpClient(auth=auth_handler) +my_path = client.CloudPath("http://www.example.com/secret/data.txt") + +# Now GET requests will include basic auth headers +content = my_path.read_text() +``` + +This can be extended to more sophisticated authentication approaches (e.g., OAuth, custom headers) by providing your own `BaseHandler` implementation. There are examples on the internet of handlers for most common authentication schemes. + +## Directory Assumptions + +Directories are handled differently from other `CloudPath` implementations: + + - By default, a URL is considered a directory if it **ends with a slash**. For example, `http://example.com/somedir/`. + - If you call `HttpPath.is_dir()`, it checks `my_url.endswith("/")` by default. You can override this with a custom function by passing `custom_dir_matcher` to `HttpClient`. This will allow you to implement custom logic for determining if a URL is a directory. The `custom_dir_matcher` will receive a string representing the URL, so if you need to interact with the server, you will need to make those requests within your `custom_dir_matcher` implementation. + +### Listing the Contents of a Directory + +We attempt to parse directory listings by calling `GET` on the directory URL (which presumably returns an HTML page that has a directory index). Our default parser looks for `` tags and yields them, assuming they are children. You can override this logic with `custom_list_page_parser` if your server's HTML or API returns a different listing format. For example: + +```python +def my_parser(html_content: str) -> Iterable[str]: + # for example, just get a with href and class "file-link" + # using beautifulsoup + soup = BeautifulSoup(html_content, "html.parser") + for link in soup.find_all("a", class_="file-link"): + yield link.get("href") + +client = HttpClient(custom_list_page_parser=my_parser) +my_dir = client.CloudPath("http://example.com/public/") + +for subpath, is_dir in my_dir.list_dir(recursive=False): + print(subpath, "dir" if is_dir else "file") +``` + +**Note**: If your server doesn't provide an HTML index or a suitable listing format that we can parse, you will see: + +``` +NotImplementedError("Unable to parse response as a listing of files; please provide a custom parser as `custom_list_page_parser`.") +``` + +In that case, you must provide a custom parser or avoid directory-listing operations altogether. + +## HTTP or HTTPS + +There are separate classes for `HttpClient`/`HttpPath` for `http://` and `HttpsClient`/`HttpsPath` for `https://`. However, from a usage standpoint, you can use either `CloudPath` or `AnyPath` to dispatch to the right subclass. + +```python +from cloudpathlib import AnyPath, CloudPath + +# AnyPath will automatically detect "http://" or "https://" (or local file paths) +my_path = AnyPath("https://www.example.com/files/info.txt") + +# CloudPath will dispatch to the correct subclass +my_path = CloudPath("https://www.example.com/files/info.txt") +``` + +If you explicitly instantiate a `HttpClient`, it will only handle `http://` paths. If you instantiate a `HttpsClient`, it will only handle `https://` paths. But `AnyPath` and `CloudPath` will route to the correct client class automatically. + +In general, you should use `HttpsClient` and work with `https://` urls wherever possible. + +## Additional Notes + + - **Caching**: This implementation uses the same local file caching mechanics as other CloudPathLib providers, controlled by `file_cache_mode` and `local_cache_dir`. However, for static HTTP servers, re-downloading or re-checking may not be as efficient as with typical cloud storages that return robust metadata. + - **"Move" or "Rename"**: The `_move_file` operation is implemented as an upload followed by a delete. This will fail if your server does not allow both `PUT` and `DELETE`. + diff --git a/docs/make_support_table.py b/docs/make_support_table.py index ad06142a..eb3a34f2 100644 --- a/docs/make_support_table.py +++ b/docs/make_support_table.py @@ -12,6 +12,7 @@ def print_table(): lib_methods = { v.path_class.__name__: {m for m in dir(v.path_class) if not m.startswith("_")} for k, v in cloudpathlib.cloudpath.implementation_registry.items() + if k not in ["http"] # just list https in table since they are the same } all_methods = copy(path_base) diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 5d710441..29743fb4 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -18,6 +18,7 @@ nav: - Home: "index.md" - Why cloudpathlib?: "why_cloudpathlib.ipynb" - Authentication: "authentication.md" + - HTTP URLs: "http.md" - Caching: "caching.ipynb" - AnyPath: "anypath-polymorphism.md" - Other Client settings: "other_client_settings.md" @@ -46,7 +47,11 @@ nav: markdown_extensions: - admonition - pymdownx.highlight - - pymdownx.superfences + - pymdownx.superfences: + custom_fences: + - name: mermaid + class: mermaid + format: !!python/name:pymdownx.superfences.fence_code_format - toc: permalink: True toc_depth: 3 diff --git a/tests/conftest.py b/tests/conftest.py index 301ffe87..9fc6e5ea 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -1,8 +1,13 @@ +from functools import wraps import os from pathlib import Path, PurePosixPath import shutil +import ssl +import time from tempfile import TemporaryDirectory from typing import Dict, Optional +from urllib.parse import urlparse +from urllib.request import HTTPSHandler from azure.storage.blob import BlobServiceClient from azure.storage.filedatalake import ( @@ -18,6 +23,8 @@ from cloudpathlib import AzureBlobClient, AzureBlobPath, GSClient, GSPath, S3Client, S3Path from cloudpathlib.cloudpath import implementation_registry +from cloudpathlib.http.httpclient import HttpClient, HttpsClient +from cloudpathlib.http.httppath import HttpPath, HttpsPath from cloudpathlib.local import ( local_azure_blob_implementation, LocalAzureBlobClient, @@ -32,6 +39,7 @@ import cloudpathlib.azure.azblobclient from cloudpathlib.azure.azblobclient import _hns_rmtree import cloudpathlib.s3.s3client +from .http_fixtures import http_server, https_server, utilities_dir # noqa: F401 from .mock_clients.mock_azureblob import MockBlobServiceClient, DEFAULT_CONTAINER_NAME from .mock_clients.mock_adls_gen2 import MockedDataLakeServiceClient from .mock_clients.mock_gs import ( @@ -40,6 +48,7 @@ MockTransferManager, ) from .mock_clients.mock_s3 import mocked_session_class_factory, DEFAULT_S3_BUCKET_NAME +from .utils import _sync_filesystem if os.getenv("USE_LIVE_CLOUD") == "1": @@ -115,6 +124,28 @@ def create_test_dir_name(request) -> str: return test_dir +@fixture +def wait_for_mkdir(monkeypatch): + """Fixture that patches os.mkdir to wait for directory creation for tests that sometimes are flaky.""" + original_mkdir = os.mkdir + + @wraps(original_mkdir) + def wrapped_mkdir(path, *args, **kwargs): + result = original_mkdir(path, *args, **kwargs) + _sync_filesystem() + + start = time.time() + + while not os.path.exists(path) and time.time() - start < 5: + time.sleep(0.01) + _sync_filesystem() + + assert os.path.exists(path), f"Directory {path} was not created" + return result + + monkeypatch.setattr(os, "mkdir", wrapped_mkdir) + + def _azure_fixture(conn_str_env_var, adls_gen2, request, monkeypatch, assets_dir): drive = os.getenv("LIVE_AZURE_CONTAINER", DEFAULT_CONTAINER_NAME) test_dir = create_test_dir_name(request) @@ -469,6 +500,82 @@ def local_s3_rig(request, monkeypatch, assets_dir): rig.client_class.reset_default_storage_dir() # reset local storage directory +class HttpProviderTestRig(CloudProviderTestRig): + def create_cloud_path(self, path: str, client=None): + """Http version needs to include netloc as well""" + if client: + return client.CloudPath( + cloud_path=f"{self.path_class.cloud_prefix}{self.drive}/{self.test_dir}/{path}" + ) + else: + return self.path_class( + cloud_path=f"{self.path_class.cloud_prefix}{self.drive}/{self.test_dir}/{path}" + ) + + +@fixture() +def http_rig(request, assets_dir, http_server): # noqa: F811 + test_dir = create_test_dir_name(request) + + host, server_dir = http_server + drive = urlparse(host).netloc + + # copy test assets + shutil.copytree(assets_dir, server_dir / test_dir) + _sync_filesystem() + + rig = CloudProviderTestRig( + path_class=HttpPath, + client_class=HttpClient, + drive=drive, + test_dir=test_dir, + ) + + rig.http_server_dir = server_dir + rig.client_class(**rig.required_client_kwargs).set_as_default_client() # set default client + + yield rig + + rig.client_class._default_client = None # reset default client + shutil.rmtree(server_dir) + _sync_filesystem() + + +@fixture() +def https_rig(request, assets_dir, https_server): # noqa: F811 + test_dir = create_test_dir_name(request) + + host, server_dir = https_server + drive = urlparse(host).netloc + + # copy test assets + shutil.copytree(assets_dir, server_dir / test_dir) + _sync_filesystem() + + skip_verify_ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) + skip_verify_ctx.check_hostname = False + skip_verify_ctx.load_verify_locations(utilities_dir / "insecure-test.pem") + + rig = CloudProviderTestRig( + path_class=HttpsPath, + client_class=HttpsClient, + drive=drive, + test_dir=test_dir, + required_client_kwargs=dict( + auth=HTTPSHandler(context=skip_verify_ctx, check_hostname=False) + ), + ) + + rig.http_server_dir = server_dir + rig.client_class(**rig.required_client_kwargs).set_as_default_client() # set default client + + yield rig + + rig.client_class._default_client = None # reset default client + shutil.rmtree(server_dir) + _sync_filesystem() + + # create azure fixtures for both blob and gen2 storage azure_rigs = fixture_union( "azure_rigs", @@ -478,6 +585,7 @@ def local_s3_rig(request, monkeypatch, assets_dir): ], ) + rig = fixture_union( "rig", [ @@ -489,6 +597,8 @@ def local_s3_rig(request, monkeypatch, assets_dir): local_azure_rig, local_s3_rig, local_gs_rig, + http_rig, + https_rig, ], ) @@ -500,3 +610,12 @@ def local_s3_rig(request, monkeypatch, assets_dir): custom_s3_rig, ], ) + +# run some http-specific tests on http and https +http_like_rig = fixture_union( + "http_like_rig", + [ + http_rig, + https_rig, + ], +) diff --git a/tests/http_fixtures.py b/tests/http_fixtures.py new file mode 100644 index 00000000..d43ce236 --- /dev/null +++ b/tests/http_fixtures.py @@ -0,0 +1,214 @@ +from datetime import datetime +from functools import partial +from http.server import HTTPServer, SimpleHTTPRequestHandler +import os +from pathlib import Path +import shutil +import ssl +import threading +import time +from urllib.request import urlopen +import socket + +from pytest import fixture +from tenacity import retry, stop_after_attempt, wait_fixed + +from .utils import _sync_filesystem + +utilities_dir = Path(__file__).parent / "utilities" + + +class TestHTTPRequestHandler(SimpleHTTPRequestHandler): + """Also allows PUT and DELETE requests for testing.""" + + @retry(stop=stop_after_attempt(5), wait=wait_fixed(0.1)) + def do_PUT(self): + length = int(self.headers["Content-Length"]) + path = Path(self.translate_path(self.path)) + + if path.is_dir(): + path.mkdir(parents=True, exist_ok=True) + else: + path.parent.mkdir(parents=True, exist_ok=True) + + _sync_filesystem() + + with path.open("wb") as f: + f.write(self.rfile.read(length)) + + # Ensure the file is flushed and synced to disk before returning + # The perf hit is ok here since this is a test server + f.flush() + os.fsync(f.fileno()) + + now = datetime.now().timestamp() + os.utime(path, (now, now)) + + self.send_response(201) + self.end_headers() + + @retry(stop=stop_after_attempt(5), wait=wait_fixed(0.1)) + def do_DELETE(self): + path = Path(self.translate_path(self.path)) + + try: + if path.is_dir(): + shutil.rmtree(path) + else: + path.unlink() + self.send_response(204) + except FileNotFoundError: + self.send_response(404) + + self.end_headers() + + @retry(stop=stop_after_attempt(5), wait=wait_fixed(0.1)) + def do_POST(self): + # roundtrip any posted JSON data for testing + content_length = int(self.headers["Content-Length"]) + post_data = self.rfile.read(content_length) + self.send_response(200) + self.send_header("Content-type", "application/json") + self.send_header("Content-Length", self.headers["Content-Length"]) + self.end_headers() + self.wfile.write(post_data) + + @retry(stop=stop_after_attempt(5), wait=wait_fixed(0.1)) + def do_GET(self): + super().do_GET() + + @retry(stop=stop_after_attempt(5), wait=wait_fixed(0.1)) + def do_HEAD(self): + super().do_HEAD() + + +def _http_server( + root_dir, + port=None, + hostname="127.0.0.1", + use_ssl=False, + certfile=None, + keyfile=None, + threaded=True, +): + root_dir.mkdir(exist_ok=True) + + scheme = "http" if not use_ssl else "https" + + # Find a free port if not specified + if port is None: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.bind((hostname, 0)) + port = s.getsockname()[1] + + def start_server(server_ready_event): + handler = partial(TestHTTPRequestHandler, directory=str(root_dir)) + httpd = HTTPServer((hostname, port), handler) + + if use_ssl: + if not certfile or not keyfile: + raise ValueError("certfile and keyfile must be provided if `ssl=True`") + + context = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER) + context.load_cert_chain(certfile=certfile, keyfile=keyfile) + context.check_hostname = False + httpd.socket = context.wrap_socket(httpd.socket, server_side=True) + + server_ready_event.set() + httpd.serve_forever() + + server_ready_event = threading.Event() + if threaded: + server_thread = threading.Thread( + target=start_server, args=(server_ready_event,), daemon=True + ) + server_thread.start() + server_ready_event.wait() + else: + start_server(server_ready_event) + + # Wait for server to be ready to accept connections + max_attempts = 100 + wait_time = 0.2 + + for attempt in range(max_attempts): + try: + if use_ssl: + req_context = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) + req_context.check_hostname = False + req_context.verify_mode = ssl.CERT_NONE + else: + req_context = None + + with urlopen( + f"{scheme}://{hostname}:{port}", context=req_context, timeout=1.0 + ) as response: + if response.status == 200: + break + except Exception: + if attempt == max_attempts - 1: + raise RuntimeError(f"Server failed to start after {max_attempts} attempts") + time.sleep(wait_time) + + return f"{scheme}://{hostname}:{port}", server_thread + + +@fixture(scope="module") +def http_server(tmp_path_factory, worker_id): + # port is now None, so OS will pick a free port + port = None + server_dir = tmp_path_factory.mktemp("server_files").resolve() + host, server_thread = _http_server(server_dir, port) + yield host, server_dir + server_thread.join(0) + if server_dir.exists(): + shutil.rmtree(server_dir) + + +@fixture(scope="module") +def https_server(tmp_path_factory, worker_id): + port = None + server_dir = tmp_path_factory.mktemp("server_files").resolve() + + # # Self‑signed cert for 127.0.0.1 (≈273 years validity) + # openssl req -x509 -out 127.0.0.1.crt -keyout 127.0.0.1.key \ + # -newkey rsa:2048 -nodes -sha256 -days 99999 \ + # -subj '/CN=127.0.0.1' \ + # -extensions EXT -config <( \ + # printf "[dn]\nCN=127.0.0.1\n\ + # [req]\ndistinguished_name = dn\n\ + # [EXT]\nsubjectAltName=IP:127.0.0.1\n\ + # keyUsage=digitalSignature\nextendedKeyUsage=serverAuth" ) + # # Convert to PEM (optional) + # openssl x509 -in 127.0.0.1.crt -out 127.0.0.1.pem -outform PEM + + host, server_thread = _http_server( + server_dir, + port, + use_ssl=True, + certfile=utilities_dir / "insecure-test.pem", + keyfile=utilities_dir / "insecure-test.key", + ) + + # Add this self-signed cert at the library level so it is used in tests + _original_create_context = ssl._create_default_https_context + + def _create_context_with_self_signed_cert(*args, **kwargs): + context = _original_create_context(*args, **kwargs) + context.load_cert_chain( + certfile=utilities_dir / "insecure-test.pem", + keyfile=utilities_dir / "insecure-test.key", + ) + context.load_verify_locations(cafile=utilities_dir / "insecure-test.pem") + return context + + ssl._create_default_https_context = _create_context_with_self_signed_cert + + yield host, server_dir + + ssl._create_default_https_context = _original_create_context + + server_thread.join(0) + + if server_dir.exists(): + shutil.rmtree(server_dir) diff --git a/tests/test_caching.py b/tests/test_caching.py index aefe912e..4fce4f6f 100644 --- a/tests/test_caching.py +++ b/tests/test_caching.py @@ -19,6 +19,7 @@ OverwriteNewerLocalError, ) from tests.conftest import CloudProviderTestRig +from tests.utils import _sync_filesystem def test_defaults_work_as_expected(rig: CloudProviderTestRig): @@ -189,7 +190,7 @@ def test_persistent_mode(rig: CloudProviderTestRig, tmpdir): assert client_cache_dir.exists() -def test_loc_dir(rig: CloudProviderTestRig, tmpdir): +def test_loc_dir(rig: CloudProviderTestRig, tmpdir, wait_for_mkdir): """Tests that local cache dir is used when specified and works' with the different cache modes. @@ -250,6 +251,7 @@ def test_loc_dir(rig: CloudProviderTestRig, tmpdir): assert cp.client.file_cache_mode == FileCacheMode.tmp_dir # download from cloud into the cache + _sync_filesystem() with cp.open("r") as f: _ = f.read() diff --git a/tests/test_client.py b/tests/test_client.py index fd58535b..3eceafc8 100644 --- a/tests/test_client.py +++ b/tests/test_client.py @@ -9,6 +9,7 @@ from cloudpathlib import CloudPath from cloudpathlib.client import register_client_class from cloudpathlib.cloudpath import implementation_registry, register_path_class +from cloudpathlib.http.httpclient import HttpClient, HttpsClient from cloudpathlib.s3.s3client import S3Client from cloudpathlib.s3.s3path import S3Path @@ -96,6 +97,10 @@ def _test_write_content_type(suffix, expected, rig_ref, check=True): for suffix, content_type in mimes: _test_write_content_type(suffix, content_type, rig, check=False) + if rig.client_class in [HttpClient, HttpsClient]: + # HTTP client doesn't support custom content types + return + # custom mime type method def my_content_type(path): # do lookup for content types I define; fallback to diff --git a/tests/test_cloudpath_file_io.py b/tests/test_cloudpath_file_io.py index 16c835f9..a7f6f0e9 100644 --- a/tests/test_cloudpath_file_io.py +++ b/tests/test_cloudpath_file_io.py @@ -14,17 +14,25 @@ CloudPathNotImplementedError, DirectoryNotEmptyError, ) +from cloudpathlib.http.httpclient import HttpClient, HttpsClient +from cloudpathlib.http.httppath import HttpPath, HttpsPath def test_file_discovery(rig): p = rig.create_cloud_path("dir_0/file0_0.txt") assert p.exists() - p2 = rig.create_cloud_path("dir_0/not_a_file") + p2 = rig.create_cloud_path("dir_0/not_a_file_yet.file") assert not p2.exists() p2.touch() assert p2.exists() - p2.touch(exist_ok=True) + + if rig.client_class not in [HttpClient, HttpsClient]: # not supported to touch existing + p2.touch(exist_ok=True) + else: + with pytest.raises(NotImplementedError): + p2.touch(exist_ok=True) + with pytest.raises(FileExistsError): p2.touch(exist_ok=False) p2.unlink(missing_ok=False) @@ -83,19 +91,19 @@ def glob_test_dirs(rig, tmp_path): def _make_glob_directory(root): (root / "dirB").mkdir() - (root / "dirB" / "fileB").write_text("fileB") + (root / "dirB" / "fileB.txt").write_text("fileB") (root / "dirC").mkdir() (root / "dirC" / "dirD").mkdir() - (root / "dirC" / "dirD" / "fileD").write_text("fileD") - (root / "dirC" / "fileC").write_text("fileC") - (root / "fileA").write_text("fileA") + (root / "dirC" / "dirD" / "fileD.txt").write_text("fileD") + (root / "dirC" / "fileC.txt").write_text("fileC") + (root / "fileA.txt").write_text("fileA") - cloud_root = rig.create_cloud_path("glob-tests") + cloud_root = rig.create_cloud_path("glob-tests/") cloud_root.mkdir() _make_glob_directory(cloud_root) - local_root = tmp_path / "glob-tests" + local_root = tmp_path / "glob-tests/" local_root.mkdir() _make_glob_directory(local_root) @@ -108,7 +116,7 @@ def _make_glob_directory(root): def _lstrip_path_root(path, root): rel_path = str(path)[len(str(root)) :] - return rel_path.rstrip("/") # agnostic to trailing slash + return rel_path.strip("/") def _assert_glob_results_match(cloud_results, local_results, cloud_root, local_root): @@ -181,6 +189,9 @@ def test_walk(glob_test_dirs): def test_list_buckets(rig): + if rig.path_class in [HttpPath, HttpsPath]: + return # no bucket listing for HTTP + # test we can list buckets buckets = list(rig.path_class(f"{rig.path_class.cloud_prefix}").iterdir()) assert len(buckets) > 0 @@ -331,6 +342,10 @@ def test_is_dir_is_file(rig, tmp_path): dir_nested_no_slash = rig.create_cloud_path("dir_1/dir_1_0") for test_case in [dir_slash, dir_no_slash, dir_nested_slash, dir_nested_no_slash]: + # skip no-slash cases, which are interpreted as files for http paths + if not str(test_case).endswith("/") and rig.path_class in [HttpPath, HttpsPath]: + continue + assert test_case.is_dir() assert not test_case.is_file() @@ -349,7 +364,7 @@ def test_is_dir_is_file(rig, tmp_path): def test_file_read_writes(rig, tmp_path): p = rig.create_cloud_path("dir_0/file0_0.txt") - p2 = rig.create_cloud_path("dir_0/not_a_file") + p2 = rig.create_cloud_path("dir_0/not_a_file.txt") p3 = rig.create_cloud_path("") text = "lalala" * 10_000 @@ -367,16 +382,20 @@ def test_file_read_writes(rig, tmp_path): before_touch = datetime.now() sleep(1) - p.touch() - if not getattr(rig, "is_custom_s3", False): - # Our S3Path.touch implementation does not update mod time for MinIO - assert datetime.fromtimestamp(p.stat().st_mtime) > before_touch + + if rig.path_class not in [HttpPath, HttpsPath]: # not supported to touch existing + p.touch() + + if not getattr(rig, "is_custom_s3", False): + # Our S3Path.touch implementation does not update mod time for MinIO + assert datetime.fromtimestamp(p.stat().st_mtime) > before_touch # no-op if not getattr(rig, "is_adls_gen2", False): p.mkdir() - assert p.etag is not None + if rig.path_class not in [HttpPath, HttpsPath]: # not supported to touch existing + assert p.etag is not None dest = rig.create_cloud_path("dir2/new_file0_0.txt") assert not dest.exists() @@ -414,6 +433,25 @@ def test_file_read_writes(rig, tmp_path): (p / "not_exists_file").download_to(dl_file) +def test_filenames(rig): + # test that we can handle filenames with special characters + p = rig.create_cloud_path("dir_0/new_file.txt") # real extension + p.write_text("hello") + assert p.read_text() == "hello" + + p2 = rig.create_cloud_path("dir_0/new_file") # no extension + p2.write_text("hello") + assert p2.read_text() == "hello" + + p3 = rig.create_cloud_path("dir_0/new_file.textfile") # long extension + p3.write_text("hello") + assert p3.read_text() == "hello" + + p4 = rig.create_cloud_path("dir_0/new_file.abc.def.txt") # multiple suffixes + p4.write_text("hello") + assert p4.read_text() == "hello" + + def test_dispatch_to_local_cache(rig): p = rig.create_cloud_path("dir_0/file0_1.txt") stat = p._dispatch_to_local_cache_path("stat") @@ -457,7 +495,7 @@ def test_cloud_path_download_to(rig, tmp_path): def test_fspath(rig): - p = rig.create_cloud_path("dir_0") + p = rig.create_cloud_path("dir_0/") assert os.fspath(p) == p.fspath diff --git a/tests/test_cloudpath_instantiation.py b/tests/test_cloudpath_instantiation.py index 4be6085c..4f7cdf5d 100644 --- a/tests/test_cloudpath_instantiation.py +++ b/tests/test_cloudpath_instantiation.py @@ -7,6 +7,7 @@ from cloudpathlib import AzureBlobPath, CloudPath, GSPath, S3Path from cloudpathlib.exceptions import InvalidPrefixError, MissingDependenciesError +from cloudpathlib.http.httppath import HttpPath, HttpsPath @pytest.mark.parametrize( @@ -45,6 +46,9 @@ def test_dispatch_error(): @pytest.mark.parametrize("path", ["b/k", "b/k", "b/k.file", "b/k", "b"]) def test_instantiation(rig, path): + if rig.path_class in [HttpPath, HttpsPath]: + path = "example-url.com/" + path + # check two cases of prefix for prefix in [rig.cloud_prefix.lower(), rig.cloud_prefix.upper()]: expected = prefix + path @@ -52,13 +56,17 @@ def test_instantiation(rig, path): assert repr(p) == f"{rig.path_class.__name__}('{expected}')" assert str(p) == expected - assert p._no_prefix == expected.split("://", 1)[-1] + if rig.path_class in [HttpPath, HttpsPath]: + assert p._no_prefix == path.replace("example-url.com/", "") + assert str(p._path) == path.replace("example-url.com", "") + + else: + assert p._no_prefix == expected.split("://", 1)[-1] + assert str(p._path) == expected.split(":/", 1)[-1] assert p._url.scheme == expected.split("://", 1)[0].lower() assert p._url.netloc == expected.split("://", 1)[-1].split("/")[0] - assert str(p._path) == expected.split(":/", 1)[-1] - def test_default_client_lazy(rig): cp = rig.path_class(rig.cloud_prefix + "testing/file.txt") @@ -106,7 +114,7 @@ def test_dependencies_not_loaded(rig, monkeypatch): def test_is_pathlike(rig): - p = rig.create_cloud_path("dir_0") + p = rig.create_cloud_path("dir_0/") assert isinstance(p, os.PathLike) diff --git a/tests/test_cloudpath_manipulation.py b/tests/test_cloudpath_manipulation.py index 9e314299..9e392881 100644 --- a/tests/test_cloudpath_manipulation.py +++ b/tests/test_cloudpath_manipulation.py @@ -5,6 +5,7 @@ import pytest from cloudpathlib import CloudPath +from cloudpathlib.http.httppath import HttpPath, HttpsPath def test_properties(rig): @@ -84,16 +85,27 @@ def test_joins(rig): if sys.version_info >= (3, 12): assert rig.create_cloud_path("a/b/c/d").match("A/*/C/D", case_sensitive=False) - assert rig.create_cloud_path("a/b/c/d").anchor == rig.cloud_prefix + if rig.path_class not in [HttpPath, HttpsPath]: + assert rig.create_cloud_path("a/b/c/d").anchor == rig.cloud_prefix + assert rig.create_cloud_path("a/b/c/d").parent == rig.create_cloud_path("a/b/c") - assert rig.create_cloud_path("a/b/c/d").parents == ( - rig.create_cloud_path("a/b/c"), - rig.create_cloud_path("a/b"), - rig.create_cloud_path("a"), - rig.path_class(f"{rig.cloud_prefix}{rig.drive}/{rig.test_dir}"), - rig.path_class(f"{rig.cloud_prefix}{rig.drive}"), - ) + if rig.path_class not in [HttpPath, HttpsPath]: + assert rig.create_cloud_path("a/b/c/d").parents == ( + rig.create_cloud_path("a/b/c"), + rig.create_cloud_path("a/b"), + rig.create_cloud_path("a"), + rig.path_class(f"{rig.cloud_prefix}{rig.drive}/{rig.test_dir}"), + rig.path_class(f"{rig.cloud_prefix}{rig.drive}"), + ) + else: + assert rig.create_cloud_path("a/b/c/d").parents == ( + rig.create_cloud_path("a/b/c"), + rig.create_cloud_path("a/b"), + rig.create_cloud_path("a"), + rig.path_class(f"{rig.cloud_prefix}{rig.drive}/{rig.test_dir}"), + rig.path_class(f"{rig.cloud_prefix}{rig.drive}/"), + ) assert rig.create_cloud_path("a").joinpath("b", "c") == rig.create_cloud_path("a/b/c") assert rig.create_cloud_path("a").joinpath(PurePosixPath("b"), "c") == rig.create_cloud_path( @@ -107,21 +119,32 @@ def test_joins(rig): == f"{rig.cloud_prefix}{rig.drive}/{rig.test_dir}/a/b/c" ) - assert rig.create_cloud_path("a/b/c/d").parts == ( - rig.cloud_prefix, - rig.drive, - rig.test_dir, - "a", - "b", - "c", - "d", - ) + if rig.path_class in [HttpPath, HttpsPath]: + assert rig.create_cloud_path("a/b/c/d").parts == ( + rig.cloud_prefix + rig.drive + "/", + rig.test_dir, + "a", + "b", + "c", + "d", + ) + else: + assert rig.create_cloud_path("a/b/c/d").parts == ( + rig.cloud_prefix, + rig.drive, + rig.test_dir, + "a", + "b", + "c", + "d", + ) def test_with_segments(rig): - assert rig.create_cloud_path("a/b/c/d").with_segments("x", "y", "z") == rig.client_class( - **rig.required_client_kwargs - ).CloudPath(f"{rig.cloud_prefix}x/y/z") + to_test = rig.create_cloud_path("a/b/c/d").with_segments("x", "y", "z") + assert to_test == rig.client_class(**rig.required_client_kwargs).CloudPath( + f"{to_test.anchor}x/y/z" + ) def test_is_junction(rig): diff --git a/tests/test_cloudpath_upload_copy.py b/tests/test_cloudpath_upload_copy.py index acf5e5ec..110537b8 100644 --- a/tests/test_cloudpath_upload_copy.py +++ b/tests/test_cloudpath_upload_copy.py @@ -4,12 +4,14 @@ import pytest +from cloudpathlib.http.httppath import HttpPath, HttpsPath from cloudpathlib.local import LocalGSPath, LocalS3Path, LocalS3Client from cloudpathlib.exceptions import ( CloudPathFileExistsError, CloudPathNotADirectoryError, OverwriteNewerCloudError, ) +from tests.utils import _sync_filesystem @pytest.fixture @@ -64,19 +66,21 @@ def test_upload_from_file(rig, upload_assets_dir): assert p.read_text() == "Hello from 2" # to file, file exists and is newer - p.touch() + sleep(1.1) + p.write_text("newer") with pytest.raises(OverwriteNewerCloudError): p.upload_from(upload_assets_dir / "upload_1.txt") # to file, file exists and is newer; overwrite - p.touch() + sleep(1.1) + p.write_text("even newer") sleep(1.1) p.upload_from(upload_assets_dir / "upload_1.txt", force_overwrite_to_cloud=True) assert p.exists() assert p.read_text() == "Hello from 1" # to dir, dir exists - p = rig.create_cloud_path("dir_0") # created by fixtures + p = rig.create_cloud_path("dir_0/") # created by fixtures assert p.exists() p.upload_from(upload_assets_dir / "upload_1.txt") assert (p / "upload_1.txt").exists() @@ -92,7 +96,7 @@ def test_upload_from_dir(rig, upload_assets_dir): assert assert_mirrored(p, upload_assets_dir) # to dir, dir exists - p2 = rig.create_cloud_path("dir_0") # created by fixtures + p2 = rig.create_cloud_path("dir_0/") # created by fixtures assert p2.exists() p2.upload_from(upload_assets_dir) @@ -100,12 +104,15 @@ def test_upload_from_dir(rig, upload_assets_dir): # a newer file exists on cloud sleep(1) - (p / "upload_1.txt").touch() + (p / "upload_1.txt").write_text("newer") with pytest.raises(OverwriteNewerCloudError): p.upload_from(upload_assets_dir) + _sync_filesystem() + # force overwrite - (p / "upload_1.txt").touch() + sleep(1) + (p / "upload_1.txt").write_text("even newer") (p / "upload_2.txt").unlink() p.upload_from(upload_assets_dir, force_overwrite_to_cloud=True) assert assert_mirrored(p, upload_assets_dir) @@ -135,9 +142,11 @@ def test_copy(rig, upload_assets_dir, tmpdir): # cloud to cloud -> make sure no local cache p_new = p.copy(p.parent / "new_upload_1.txt") assert p_new.exists() - assert not p_new._local.exists() # cache should never have been downloaded - assert not p._local.exists() # cache should never have been downloaded - assert p_new.read_text() == "Hello from 1" + + if rig.path_class not in [HttpPath, HttpsPath]: + assert not p_new._local.exists() # cache should never have been downloaded + assert not p._local.exists() # cache should never have been downloaded + assert p_new.read_text() == "Hello from 1" # cloud to cloud path as string cloud_dest = str(p.parent / "new_upload_0.txt") @@ -146,14 +155,15 @@ def test_copy(rig, upload_assets_dir, tmpdir): assert p_new.read_text() == "Hello from 1" # cloud to cloud directory - cloud_dest = rig.create_cloud_path("dir_1") # created by fixtures + cloud_dest = rig.create_cloud_path("dir_1/") # created by fixtures p_new = p.copy(cloud_dest) assert str(p_new) == str(p_new.parent / p.name) # file created assert p_new.exists() assert p_new.read_text() == "Hello from 1" # cloud to cloud overwrite - p_new.touch() + sleep(1.1) + p_new.write_text("p_new") with pytest.raises(OverwriteNewerCloudError): p_new = p.copy(p_new) @@ -193,7 +203,7 @@ def test_copy(rig, upload_assets_dir, tmpdir): (other_dir / p2.name).unlink() # cloud dir raises - cloud_dir = rig.create_cloud_path("dir_1") # created by fixtures + cloud_dir = rig.create_cloud_path("dir_1/") # created by fixtures with pytest.raises(ValueError) as e: p_new = cloud_dir.copy(Path(tmpdir.mkdir("test_copy_dir_fails"))) assert "use the method copytree" in str(e) @@ -207,12 +217,12 @@ def test_copytree(rig, tmpdir): p.copytree(local_out) with pytest.raises(CloudPathFileExistsError): - p = rig.create_cloud_path("dir_0") + p = rig.create_cloud_path("dir_0/") p_out = rig.create_cloud_path("dir_0/file0_0.txt") p.copytree(p_out) # cloud dir to local dir that exists - p = rig.create_cloud_path("dir_1") + p = rig.create_cloud_path("dir_1/") local_out = Path(tmpdir.mkdir("copytree_from_cloud")) p.copytree(local_out) assert assert_mirrored(p, local_out) @@ -228,12 +238,12 @@ def test_copytree(rig, tmpdir): assert assert_mirrored(p, local_out) # cloud dir to cloud dir that does not exist - p2 = rig.create_cloud_path("new_dir") + p2 = rig.create_cloud_path("new_dir/") p.copytree(p2) assert assert_mirrored(p2, p) # cloud dir to cloud dir that exists - p2 = rig.create_cloud_path("new_dir2") + p2 = rig.create_cloud_path("new_dir2/") (p2 / "existing_file.txt").write_text("asdf") # ensures p2 exists p.copytree(p2) assert assert_mirrored(p2, p, check_no_extra=False) @@ -251,7 +261,7 @@ def test_copytree(rig, tmpdir): (p / "dir2" / "file2.txt").write_text("ignore") # cloud dir to local dir but ignoring files (shutil.ignore_patterns) - p3 = rig.create_cloud_path("new_dir3") + p3 = rig.create_cloud_path("new_dir3/") p.copytree(p3, ignore=ignore_patterns("*.py", "dir*")) assert assert_mirrored(p, p3, check_no_extra=False) assert not (p3 / "ignored.py").exists() @@ -259,7 +269,7 @@ def test_copytree(rig, tmpdir): assert not (p3 / "dir2").exists() # cloud dir to local dir but ignoring files (custom function) - p4 = rig.create_cloud_path("new_dir4") + p4 = rig.create_cloud_path("new_dir4/") def _custom_ignore(path, names): ignore = [] diff --git a/tests/test_http.py b/tests/test_http.py new file mode 100644 index 00000000..4dbf30a2 --- /dev/null +++ b/tests/test_http.py @@ -0,0 +1,128 @@ +import urllib + +from tests.conftest import CloudProviderTestRig + + +def test_https(https_rig: CloudProviderTestRig): + """Basic tests for https""" + existing_file = https_rig.create_cloud_path("dir_0/file0_0.txt") + + # existence and listing + assert existing_file.exists() + assert existing_file.parent.exists() + assert existing_file.name in [f.name for f in existing_file.parent.iterdir()] + + # root level checks + root = list(existing_file.parents)[-1] + assert root.exists() + assert len(list(root.iterdir())) > 0 + + # reading and wrirting + existing_file.write_text("Hello from 0") + assert existing_file.read_text() == "Hello from 0" + + # creating new files + not_existing_file = https_rig.create_cloud_path("dir_0/new_file.txt") + + assert not not_existing_file.exists() + + not_existing_file.upload_from(existing_file) + + assert not_existing_file.read_text() == "Hello from 0" + + # deleteing + not_existing_file.unlink() + assert not not_existing_file.exists() + + # metadata + assert existing_file.stat().st_mtime != 0 + + +def test_http_verbs(http_like_rig: CloudProviderTestRig): + """Test that the http verbs work""" + p = http_like_rig.create_cloud_path("dir_0/file0_0.txt") + + # test put + p.put(data="Hello from 0".encode("utf-8"), headers={"Content-Type": "text/plain"}) + + # test get + resp, data = p.get() + assert resp.status == 200 + assert data.decode("utf-8") == "Hello from 0" + + # post + import json + + post_payload = {"key": "value"} + resp, data = p.post( + data=json.dumps(post_payload).encode(), headers={"Content-Type": "application/json"} + ) + assert resp.status == 200 + assert json.loads(data.decode("utf-8")) == post_payload + + # head + resp, data = p.head() + assert resp.status == 200 + assert data == b"" + + # delete + p.delete() + assert not p.exists() + + +def test_http_parsed_url(http_like_rig: CloudProviderTestRig): + """Test that the parsed_url property works""" + p = http_like_rig.create_cloud_path("dir_0/file0_0.txt") + assert p.parsed_url.scheme == http_like_rig.cloud_prefix.split("://")[0] + assert p.parsed_url.netloc == http_like_rig.drive + assert p.parsed_url.path == str(p).split(http_like_rig.drive)[1] + + +def test_http_url_decorations(http_like_rig: CloudProviderTestRig): + def _test_preserved_properties(base_url, returned_url): + parsed_base = urllib.parse.urlparse(str(base_url)) + parsed_returned = urllib.parse.urlparse(str(returned_url)) + + assert parsed_returned.scheme == parsed_base.scheme + assert parsed_returned.netloc == parsed_base.netloc + assert parsed_returned.username == parsed_base.username + assert parsed_returned.password == parsed_base.password + assert parsed_returned.hostname == parsed_base.hostname + assert parsed_returned.port == parsed_base.port + + p = http_like_rig.create_cloud_path("dir_0/file0_0.txt") + p.write_text("Hello!") + + # add some properties to the url + new_url = p.parsed_url._replace( + params="param=value", query="query=value&query2=value2", fragment="fragment-value" + ) + p = http_like_rig.path_class(urllib.parse.urlunparse(new_url)) + + # operations that should preserve properties of the original url and need to hit the server + # glob, iterdir, walk + _test_preserved_properties(p, next(p.parent.glob("*.txt"))) + _test_preserved_properties(p, next(p.parent.iterdir())) + _test_preserved_properties(p, next(p.parent.walk())[0]) + + # rename and replace? + new_location = p.with_name("other_file.txt") + _test_preserved_properties(p, p.rename(new_location)) + _test_preserved_properties(p, new_location.replace(p)) + + # operations that should preserve properties of the original url and don't hit the server + # so that we can add some other properties (e.g., username, password) + new_url = p.parsed_url._replace(netloc="user:pass@example.com:8000") + p = http_like_rig.path_class(urllib.parse.urlunparse(new_url)) + + # parent + _test_preserved_properties(p, p.parent) + + # joining / and joinpath + _test_preserved_properties(p, p.parent / "other_file.txt") + _test_preserved_properties(p, p.parent.joinpath("other_file.txt")) + + # with_name, with_suffix, with_stem + _test_preserved_properties(p, p.with_name("other_file.txt")) + _test_preserved_properties(p, p.with_suffix(".txt")) + _test_preserved_properties(p, p.with_stem("other_file")) diff --git a/tests/test_s3_specific.py b/tests/test_s3_specific.py index d9edc94e..58b2e21a 100644 --- a/tests/test_s3_specific.py +++ b/tests/test_s3_specific.py @@ -176,7 +176,7 @@ def test_directories(s3_like_rig): assert super_path.exists() assert not super_path.is_dir() - super_path = s3_like_rig.create_cloud_path("dir_0") + super_path = s3_like_rig.create_cloud_path("dir_0/") assert super_path.exists() assert super_path.is_dir() diff --git a/tests/utilities/insecure-test.crt b/tests/utilities/insecure-test.crt new file mode 100644 index 00000000..9bb5d9e4 --- /dev/null +++ b/tests/utilities/insecure-test.crt @@ -0,0 +1,19 @@ +-----BEGIN CERTIFICATE----- +MIIDDDCCAfSgAwIBAgIUZn3DPy1MuLcPNQGuGU8JfzvCpEIwDQYJKoZIhvcNAQEL +BQAwFDESMBAGA1UEAwwJMTI3LjAuMC4xMCAXDTI1MDQyMTAzMTQ1OVoYDzIyOTkw +MjAzMDMxNDU5WjAUMRIwEAYDVQQDDAkxMjcuMC4wLjEwggEiMA0GCSqGSIb3DQEB +AQUAA4IBDwAwggEKAoIBAQDlxF2z2I2XaDnLgV3exFPtjs9upFuUPTPthubaxRMz +PWGfNRg8fLqXDOe8E+KHgdXYeqTd0xkWZzfx+xwz1flvTBubgtan0yvri0bZIemk +gv7f8ABRAjNIQzpehIjXI9RZyU2JoPIN4+Q8WHZ8uc8uZtHOHsyMYoj2j0akUoic +ukoYlo6W8nN1ykBvhwnO9sRooPrYV9ViBhG9eaH/L0NzVv6cU3vHj3pKyO3cMQqW +4AfaSz+aFXx7ulRzxR5bphCy5281FqBgG76Y1lqOSUMTxfJQSnCCUe58DXy4CpfQ +rGrNiLV/yWz7xYKSeutcJxWsCMFLrI+S79IW6ntILS6pAgMBAAGjVDBSMA8GA1Ud +EQQIMAaHBH8AAAEwCwYDVR0PBAQDAgeAMBMGA1UdJQQMMAoGCCsGAQUFBwMBMB0G +A1UdDgQWBBSyGv/zfxIBK9Tm4/5uOuhh6pB3CDANBgkqhkiG9w0BAQsFAAOCAQEA +UWg3vZNCCUjPAqKAXEYZeBI9VNXim4egkmxn9FHgiraxapKc4RHpCmVdjpF5miFe +4hbcvHOxb9JclLVKP2oC7vkdYDtgkT8o264gy0eASHE8GP1YawjJlLeFFuJuxatu +NxZXKnMFQRPoZbD4KSImLy8xEy1FMslnBxcgxgqIKoyqwtt+HGO6ZnvdxDbRLZSQ +FNDNlqQYgnxf4zzNro9mtWHH/A/UA/vuRWRlppn9vy8k7X5VXlhEIAMmI4nPihhS +YmgpRntt8A0BLQcNNWcNw0b0IWLhpSWiREunkZDEMWDjoBwRhQpYxEC0zrKlQmwb +jhnl/rtIL+2Shly8zkxWew== +-----END CERTIFICATE----- diff --git a/tests/utilities/insecure-test.key b/tests/utilities/insecure-test.key new file mode 100644 index 00000000..b4adc213 --- /dev/null +++ b/tests/utilities/insecure-test.key @@ -0,0 +1,28 @@ +-----BEGIN PRIVATE KEY----- +MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQDlxF2z2I2XaDnL +gV3exFPtjs9upFuUPTPthubaxRMzPWGfNRg8fLqXDOe8E+KHgdXYeqTd0xkWZzfx ++xwz1flvTBubgtan0yvri0bZIemkgv7f8ABRAjNIQzpehIjXI9RZyU2JoPIN4+Q8 +WHZ8uc8uZtHOHsyMYoj2j0akUoicukoYlo6W8nN1ykBvhwnO9sRooPrYV9ViBhG9 +eaH/L0NzVv6cU3vHj3pKyO3cMQqW4AfaSz+aFXx7ulRzxR5bphCy5281FqBgG76Y +1lqOSUMTxfJQSnCCUe58DXy4CpfQrGrNiLV/yWz7xYKSeutcJxWsCMFLrI+S79IW +6ntILS6pAgMBAAECggEAAY42QdqoPt+lrkC0m4jUB10kS8zYWr2dRAAeODfWtOvQ +xyBvE7iOQF0sUbjDEylHHH8G3OBSvFcb2gkNH4tQwL1Kan19UivozSB6pG1g1NcK +QpfSNlPJb6i4uRcfYIHj6CBOLRg8mJwtcNYle1dzsQnYdkaW78Eaa6Ozk9jqdibj +w0fcsfp1Od5UHVqSsuHpN7N7MP78lD7nZ4h1oAUAHKJw3o5Np24cdgzfwsjmaV/M +RcTIVoRLoCiPj7ZrGMgCq3PsI14E3C02oYGVHqBsVzCkgzdBwqckuX8eTWs4Ae9/ +adV2cMIBe0EC3WA6cHVh/NS/fgSlRDw6/chz50WcPQKBgQDzo9qw+G9a2Vtkylyd +cnbY3oQVH+gygdULKAf1IxlRPMvuSEm1DqA3YKkQXO4ypf8llKZznyk5xIDfG28k +SIRUBQGoOeLMVley/EydXd6GslsHoK5kLmLerbqZHRdo7hdYsIIbHqU14OrLwLwK +3CJlSzpR1ProYufmDFRGt5SxpQKBgQDxbFfx+5aMNvKh7/NRxzor+2owcaSIE0FQ +4OV9xTZw+fU3vQl0BUzB+t2cZezOm8vJh2Xwkjp3Uz/3h2kZPe42HjZ733vSFDSS +rE+aKSG8ptu08bsVqOmQgkfjcIdxugbQoFLY/XWWHCglD3Lq1fUkMOBnne0yjQiW +5iTL6e8xtQKBgQDfsA528ID8PhclAI3rmE35asKFypea14zMA2La8/Con1L0YLYb +X2RFs59FAK1JHxKUZFg2S2jEOt++9ychftrPcRFGbG8IADXghLequ6Y0sMfWxvWV +0OjBXWu2a/k0Q3R33wZ087vnLaskir2akuWZbmoK+6mpdjVHBwbRLnd8aQKBgQC6 +/AYVhp2wlbJQ2C7ljN+yRvSU9r/PINK62KUGR2OGFyLk+8XBlYVAzJMt2geScjph +KTw8GpWr68+kYL127m98fOQIByy4piud2lWA+hCGM9oBCCS1fvD/mtghAPv2inVS +yonARHb5P2+cXJ3N4s8OK8jyl++p8m/PqAqh4NsA7QKBgCmFHpm+loiqG0is9v4l +/iBJUVjBrQlgjlyIYEJqLjNQ2w/vmZT067YSVCON88JWJEjKpE2zAc5C0miTJa7D +cRn2yIWPFm8emlLHjx+4CVXlfLR6lTiekbZWK2bs9KNZrCQXL/K/3lNEE/3MvqUD +dIELjg1KulUVY+7r07Pd54Ze +-----END PRIVATE KEY----- diff --git a/tests/utilities/insecure-test.pem b/tests/utilities/insecure-test.pem new file mode 100644 index 00000000..9bb5d9e4 --- /dev/null +++ b/tests/utilities/insecure-test.pem @@ -0,0 +1,19 @@ +-----BEGIN CERTIFICATE----- +MIIDDDCCAfSgAwIBAgIUZn3DPy1MuLcPNQGuGU8JfzvCpEIwDQYJKoZIhvcNAQEL +BQAwFDESMBAGA1UEAwwJMTI3LjAuMC4xMCAXDTI1MDQyMTAzMTQ1OVoYDzIyOTkw +MjAzMDMxNDU5WjAUMRIwEAYDVQQDDAkxMjcuMC4wLjEwggEiMA0GCSqGSIb3DQEB +AQUAA4IBDwAwggEKAoIBAQDlxF2z2I2XaDnLgV3exFPtjs9upFuUPTPthubaxRMz +PWGfNRg8fLqXDOe8E+KHgdXYeqTd0xkWZzfx+xwz1flvTBubgtan0yvri0bZIemk +gv7f8ABRAjNIQzpehIjXI9RZyU2JoPIN4+Q8WHZ8uc8uZtHOHsyMYoj2j0akUoic +ukoYlo6W8nN1ykBvhwnO9sRooPrYV9ViBhG9eaH/L0NzVv6cU3vHj3pKyO3cMQqW +4AfaSz+aFXx7ulRzxR5bphCy5281FqBgG76Y1lqOSUMTxfJQSnCCUe58DXy4CpfQ +rGrNiLV/yWz7xYKSeutcJxWsCMFLrI+S79IW6ntILS6pAgMBAAGjVDBSMA8GA1Ud +EQQIMAaHBH8AAAEwCwYDVR0PBAQDAgeAMBMGA1UdJQQMMAoGCCsGAQUFBwMBMB0G +A1UdDgQWBBSyGv/zfxIBK9Tm4/5uOuhh6pB3CDANBgkqhkiG9w0BAQsFAAOCAQEA +UWg3vZNCCUjPAqKAXEYZeBI9VNXim4egkmxn9FHgiraxapKc4RHpCmVdjpF5miFe +4hbcvHOxb9JclLVKP2oC7vkdYDtgkT8o264gy0eASHE8GP1YawjJlLeFFuJuxatu +NxZXKnMFQRPoZbD4KSImLy8xEy1FMslnBxcgxgqIKoyqwtt+HGO6ZnvdxDbRLZSQ +FNDNlqQYgnxf4zzNro9mtWHH/A/UA/vuRWRlppn9vy8k7X5VXlhEIAMmI4nPihhS +YmgpRntt8A0BLQcNNWcNw0b0IWLhpSWiREunkZDEMWDjoBwRhQpYxEC0zrKlQmwb +jhnl/rtIL+2Shly8zkxWew== +-----END CERTIFICATE----- diff --git a/tests/utils.py b/tests/utils.py new file mode 100644 index 00000000..34fe8e1f --- /dev/null +++ b/tests/utils.py @@ -0,0 +1,15 @@ +import platform +import os +import time + + +def _sync_filesystem(): + """Try to force sync of the filesystem to stabilize tests. + + On Windows, give the filesystem a moment to catch up since sync is not available. + """ + if platform.system() != "Windows": + os.sync() + else: + # On Windows, give the filesystem a moment to catch up + time.sleep(0.05)