feat(util): Add Dataform workspace exposed Methods#466
Conversation
aa9326c to
6822c83
Compare
|
is this WIP or ready ? |
Hey @IsmailMehdi, it should be ready now! |
6822c83 to
fbb21d7
Compare
IsmailMehdi
left a comment
There was a problem hiding this comment.
Now that [WIP] is dropped from the title, this is putatively merge-ready. The rename DataformHelper → DataformWorkspaceManager is a real improvement (matches the module name + intent). Three concrete points from the prior round are still open and worth landing in this revision rather than punting to follow-ups — inline. The longer list from the previous review (asymmetric teardown_workspace naming, no AlreadyExists handling for re-entry, download_and_zip directory-entry filtering, env_files overwrite contract, repo-quota awareness, test gaps, sparser docstrings on new public methods) is still applicable but secondary.
|
|
||
| from google.api_core import exceptions as api_exceptions | ||
| from google.cloud import dataform_v1beta1 | ||
| from google.cloud import storage |
There was a problem hiding this comment.
Unused import — storage isn't referenced anywhere in the module. Looks like a leftover from a planned GCS upload path. Remove it (CodeQL or pyflakes will eventually flag it anyway):
from google.api_core import exceptions as api_exceptions
from google.cloud import dataform_v1beta1| env_files: dict[str, str | bytes] | None = None, | ||
| ) -> str: | ||
| """Dynamically creates a Dataform repository and workspace.""" | ||
| job_id = session_dir.rstrip("/").split("/")[-1] |
There was a problem hiding this comment.
Deriving the repository ID from a path string is a leaky abstraction. session_dir.rstrip("/").split("/")[-1] couples this helper to filesystem-path semantics for what's conceptually just an ID:
- A caller passing a
pathlib.Path, a Windows path with\\, or a job UUID without any path component gets surprising results. - If
job_idcontains characters Dataform doesn't allow in repo names (A-Z,., length > 1024),_create_repositoryfails downstream with a cryptic API error rather than at the call site where the bad value originated.
Cleaner: take job_id directly. The caller (which already named the session dir after the job_id) knows it:
def setup_workspace(
self,
job_id: str,
dataset_domain: str = "default",
env_files: dict[str, str | bytes] | None = None,
) -> str:
repository_id = f"evalbench-{job_id}"
...If you genuinely need the session_dir for something else later, take both — but don't conflate "path string" with "ID extraction".
|
|
||
| def teardown_workspace(self, workspace_uri: str) -> None: | ||
| """Deletes the parent Dataform repository and all child workspaces.""" | ||
| repository_id = workspace_uri.split("/repositories/")[1].split("/")[0] |
There was a problem hiding this comment.
Brittle URI parser. workspace_uri.split("/repositories/")[1].split("/")[0] has two failure modes that produce confusing errors:
- If
workspace_uridoesn't contain/repositories/,split(...)[1]raisesIndexError: list index out of range— no hint that the input shape is wrong. - If a repository is named
repositories(unusual but allowed), the split returns the wrong segment.
Regex tightens both:
import re
_WORKSPACE_RE = re.compile(
r"^projects/[^/]+/locations/[^/]+/repositories/([^/]+)/workspaces/[^/]+$"
)
def teardown_workspace(self, workspace_uri: str) -> None:
m = _WORKSPACE_RE.match(workspace_uri)
if not m:
raise ValueError(f"Invalid workspace URI: {workspace_uri!r}")
repository_id = m.group(1)
self._delete_repository(repository_id)Bonus: this also catches the asymmetric-naming bug — the docstring says "deletes the parent Dataform repository and all child workspaces" (truthful) but the method is named teardown_workspace (misleading: it deletes the whole repo). Either rename to teardown_repository, or actually scope teardown to the named workspace and delete the repo only when it's empty.
| if isinstance(content, str) | ||
| else content | ||
| ) | ||
| self.client.write_file( |
There was a problem hiding this comment.
will this also create the directory if it's missing? do we need to check if a file path is a directory or a file?
There was a problem hiding this comment.
currently my plan is:
- this variable is a directory path (not name path)
- this directory will contain all files needed to upload to dataform cloud
- user can specify this information under config, so the
setup_dataform.shfile can pick up and callDataformWorkspaceManager
| @@ -1,35 +1,43 @@ | |||
| """Unit tests for DataformHelper utility.""" | |||
There was a problem hiding this comment.
can we rename this file to dataform_workspace_test.py to align with the new file name?
e4fd481 to
66b4639
Compare
| self.client.write_file( | ||
| request={ | ||
| "workspace": workspace_path, | ||
| "path": str(relative_path), |
There was a problem hiding this comment.
Windows path separator leaks to Dataform. On Windows, str(WindowsPath("definitions/my_view.sqlx")) returns "definitions\\my_view.sqlx". Dataform almost certainly rejects backslash-separated paths (or treats the backslash as a literal character in the filename).
The test acknowledges the platform difference at line 299 (normalized_paths = [p.replace("\\\\", "/") for p in paths_called]), so the assertion passes — but the actual request sent to Dataform still carries the backslash form. The test masks the bug rather than catching it.
One-character fix:
"path": relative_path.as_posix(),as_posix() is WindowsPath-aware and always returns forward-slash form. After the fix, the test can drop the normalize workaround and assert against "definitions/my_view.sqlx" directly, which proves the right value is being sent.
There was a problem hiding this comment.
has this been resolved? i don't see as_posix()
There was a problem hiding this comment.
Sorry I added it but some how it got reverted, added it back again!
73957b7 to
658452d
Compare
658452d to
9b6987d
Compare
|
/gcbrun |
|
/gcbrun |
dataform_workspace.pysetup_workspace,download_and_zip, andteardown_workspace