Skip to content

feat(util): Add Dataform workspace exposed Methods#466

Merged
IsmailMehdi merged 2 commits into
GoogleCloudPlatform:mainfrom
minzznguyen:dataform_workspace_manager
Jul 2, 2026
Merged

feat(util): Add Dataform workspace exposed Methods#466
IsmailMehdi merged 2 commits into
GoogleCloudPlatform:mainfrom
minzznguyen:dataform_workspace_manager

Conversation

@minzznguyen

Copy link
Copy Markdown
Contributor
  • rename file to dataform_workspace.py
  • change current functions to private, only expose 3 newly added functions - setup_workspace, download_and_zip, and teardown_workspace

@minzznguyen minzznguyen requested a review from IsmailMehdi as a code owner June 29, 2026 17:50
@minzznguyen minzznguyen force-pushed the dataform_workspace_manager branch 2 times, most recently from aa9326c to 6822c83 Compare June 29, 2026 18:18
@IsmailMehdi

Copy link
Copy Markdown
Collaborator

is this WIP or ready ?

@minzznguyen minzznguyen changed the title [WIP] feat(util): Add Dataform workspace exposed Methods feat(util): Add Dataform workspace exposed Methods Jun 29, 2026
@minzznguyen

Copy link
Copy Markdown
Contributor Author

is this WIP or ready ?

Hey @IsmailMehdi, it should be ready now!

@minzznguyen minzznguyen force-pushed the dataform_workspace_manager branch from 6822c83 to fbb21d7 Compare June 29, 2026 19:21

@IsmailMehdi IsmailMehdi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that [WIP] is dropped from the title, this is putatively merge-ready. The rename DataformHelperDataformWorkspaceManager is a real improvement (matches the module name + intent). Three concrete points from the prior round are still open and worth landing in this revision rather than punting to follow-ups — inline. The longer list from the previous review (asymmetric teardown_workspace naming, no AlreadyExists handling for re-entry, download_and_zip directory-entry filtering, env_files overwrite contract, repo-quota awareness, test gaps, sparser docstrings on new public methods) is still applicable but secondary.

Comment thread evalbench/util/dataform_workspace.py Outdated

from google.api_core import exceptions as api_exceptions
from google.cloud import dataform_v1beta1
from google.cloud import storage

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused importstorage isn't referenced anywhere in the module. Looks like a leftover from a planned GCS upload path. Remove it (CodeQL or pyflakes will eventually flag it anyway):

from google.api_core import exceptions as api_exceptions
from google.cloud import dataform_v1beta1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread evalbench/util/dataform_workspace.py Outdated
env_files: dict[str, str | bytes] | None = None,
) -> str:
"""Dynamically creates a Dataform repository and workspace."""
job_id = session_dir.rstrip("/").split("/")[-1]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deriving the repository ID from a path string is a leaky abstraction. session_dir.rstrip("/").split("/")[-1] couples this helper to filesystem-path semantics for what's conceptually just an ID:

  • A caller passing a pathlib.Path, a Windows path with \\, or a job UUID without any path component gets surprising results.
  • If job_id contains characters Dataform doesn't allow in repo names (A-Z, ., length > 1024), _create_repository fails downstream with a cryptic API error rather than at the call site where the bad value originated.

Cleaner: take job_id directly. The caller (which already named the session dir after the job_id) knows it:

def setup_workspace(
    self,
    job_id: str,
    dataset_domain: str = "default",
    env_files: dict[str, str | bytes] | None = None,
) -> str:
    repository_id = f"evalbench-{job_id}"
    ...

If you genuinely need the session_dir for something else later, take both — but don't conflate "path string" with "ID extraction".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread evalbench/util/dataform_workspace.py Outdated

def teardown_workspace(self, workspace_uri: str) -> None:
"""Deletes the parent Dataform repository and all child workspaces."""
repository_id = workspace_uri.split("/repositories/")[1].split("/")[0]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brittle URI parser. workspace_uri.split("/repositories/")[1].split("/")[0] has two failure modes that produce confusing errors:

  • If workspace_uri doesn't contain /repositories/, split(...)[1] raises IndexError: list index out of range — no hint that the input shape is wrong.
  • If a repository is named repositories (unusual but allowed), the split returns the wrong segment.

Regex tightens both:

import re

_WORKSPACE_RE = re.compile(
    r"^projects/[^/]+/locations/[^/]+/repositories/([^/]+)/workspaces/[^/]+$"
)

def teardown_workspace(self, workspace_uri: str) -> None:
    m = _WORKSPACE_RE.match(workspace_uri)
    if not m:
        raise ValueError(f"Invalid workspace URI: {workspace_uri!r}")
    repository_id = m.group(1)
    self._delete_repository(repository_id)

Bonus: this also catches the asymmetric-naming bug — the docstring says "deletes the parent Dataform repository and all child workspaces" (truthful) but the method is named teardown_workspace (misleading: it deletes the whole repo). Either rename to teardown_repository, or actually scope teardown to the named workspace and delete the repo only when it's empty.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Comment thread evalbench/util/dataform_workspace.py Outdated
if isinstance(content, str)
else content
)
self.client.write_file(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this also create the directory if it's missing? do we need to check if a file path is a directory or a file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently my plan is:

  • this variable is a directory path (not name path)
  • this directory will contain all files needed to upload to dataform cloud
  • user can specify this information under config, so the setup_dataform.sh file can pick up and call DataformWorkspaceManager

@@ -1,35 +1,43 @@
"""Unit tests for DataformHelper utility."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this file to dataform_workspace_test.py to align with the new file name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

@minzznguyen minzznguyen force-pushed the dataform_workspace_manager branch from e4fd481 to 66b4639 Compare June 30, 2026 01:52
Comment thread evalbench/util/dataform_workspace.py Outdated
self.client.write_file(
request={
"workspace": workspace_path,
"path": str(relative_path),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows path separator leaks to Dataform. On Windows, str(WindowsPath("definitions/my_view.sqlx")) returns "definitions\\my_view.sqlx". Dataform almost certainly rejects backslash-separated paths (or treats the backslash as a literal character in the filename).

The test acknowledges the platform difference at line 299 (normalized_paths = [p.replace("\\\\", "/") for p in paths_called]), so the assertion passes — but the actual request sent to Dataform still carries the backslash form. The test masks the bug rather than catching it.

One-character fix:

"path": relative_path.as_posix(),

as_posix() is WindowsPath-aware and always returns forward-slash form. After the fix, the test can drop the normalize workaround and assert against "definitions/my_view.sqlx" directly, which proves the right value is being sent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been resolved? i don't see as_posix()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I added it but some how it got reverted, added it back again!

@minzznguyen minzznguyen force-pushed the dataform_workspace_manager branch 4 times, most recently from 73957b7 to 658452d Compare July 1, 2026 18:06
graceqi-g
graceqi-g previously approved these changes Jul 1, 2026
@graceqi-g

Copy link
Copy Markdown
Collaborator

/gcbrun

@IsmailMehdi

Copy link
Copy Markdown
Collaborator

/gcbrun

@IsmailMehdi IsmailMehdi merged commit 9a75a7a into GoogleCloudPlatform:main Jul 2, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants