Skip to content

Conversation

erindru
Copy link
Collaborator

@erindru erindru commented Aug 22, 2025

I notice the Fabric tests failing on main with:

FAILED tests/core/engine_adapter/integration/test_integration.py::test_janitor[fabric-catalog] -
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://api.fabric.microsoft.com/v1/workspaces/**/warehouses

The cause was essentially the create warehouse API calls failing if the warehouse already existed.

SQLMesh code generally expects CREATE IF NOT EXISTS / DROP IF EXISTS semantics, so I implemented these.

In addition, some other tests are starting to become more flaky with issues like:

teardown failed on attempt 2! Exiting immediately!
        KeyError: <_pytest.stash.StashKey object at 0x7f1666072320>

and

duckdb.duckdb.IOException: IO Error: Could not set lock on file "testing.duckdb": Conflicting lock is held in /home/circleci/.pyenv/versions/3.12.11/bin/python3.12 (PID 3249).

so I had a go at addressing these too. The general theme is we run tests concurrently for speed but aren't always good at ensuring each test gets its own unique copy of things and can work in isolation from other tests

- bigquery
- clickhouse-cloud
- athena
#- snowflake
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: revert prior to merge

@erindru erindru force-pushed the erin/fix-fabric-janitor-test branch from dcff7e8 to 85c2dbf Compare August 24, 2025 23:21
@@ -5,7 +5,7 @@ gateways:
type: duckdb
catalogs:
memory: ':memory:'
testing: 'testing.duckdb'
testing: "{{ var('tmp_path') }}/testing.duckdb"
Copy link
Collaborator Author

@erindru erindru Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing flakiness in style_and_cicd_tests, example

duckdb.duckdb.IOException: IO Error: Could not set lock on file "testing.duckdb": Conflicting lock is held in /home/circleci/.pyenv/versions/3.12.11/bin/python3.12 (PID 3249). See also https://duckdb.org/docs/stable/connect/concurrency

The duckdb integration tests get run as part of that and without prefixing the path, testing.duckdb gets created in the sqlmesh root dir rather than the unique dir for each test.

This causes it to be re-used between tests and potentially accessed in parallel from multiple workers although there is an xdist_group that forces most of the tests to run sequentially

sushi_test_schema = ctx.add_test_suffix("sushi")
sushi_state_schema = ctx.add_test_suffix("sushi_state")
raw_test_schema = ctx.add_test_suffix("raw")

config = load_config_from_paths(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was duplicating logic already in TestContext, since it pre-dated TestContext.create_context()

@@ -2383,33 +2362,15 @@ def _normalize_snowflake(name: str, prefix_regex: str = "(sqlmesh__)(.*)"):

init_example_project(tmp_path, ctx.engine_type, schema_name=schema_name)

config = load_config_from_paths(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was duplicating logic already in TestContext, since it pre-dated TestContext.create_context()

@@ -27,15 +27,15 @@
logger = logging.getLogger(__name__)


@pytest.fixture(scope="session")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was session-scoped because its predecessor was session scoped and the session scope got retained through an earlier refactor.

However, I was encountering some very hard-to-pin-down issues that I have seen a bunch of times regarding StashKey, example:

teardown failed on attempt 2! Exiting immediately!
        Traceback (most recent call last):
          File "/home/********/.pyenv/versions/3.12.11/lib/python3.12/site-packages/_pytest/runner.py", line 344, in from_call
            result: TResult | None = func()
                                     ^^^^^^
        KeyError: <_pytest.stash.StashKey object at 0x7f1666072320>

That looks like a concurrency issue and I had a theory it was related to session and function scoped fixtures being mixed.

So i've made this function scoped because I couldnt see a strong reason for it to be session scoped

@erindru erindru changed the title Fix(fabric): Fix failing janitor test Fix(tests): Address some test flakiness Aug 25, 2025
@erindru erindru force-pushed the erin/fix-fabric-janitor-test branch from fd3dfe1 to 5f5a27d Compare August 25, 2025 06:05
@@ -212,6 +212,38 @@ def pytest_collection_modifyitems(items, *args, **kwargs):
item.add_marker("fast")


@pytest.hookimpl(hookwrapper=True, tryfirst=True)
def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo):
Copy link
Collaborator Author

@erindru erindru Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the only hook that can catch the StashKey errors. Fixtures that yield their values (like tmp_path) appear to hit different codepaths than other types of fixtures.

I tried both pytest_fixture_post_finalizer and pytest_runtest_teardown before resorting to this

@erindru erindru force-pushed the erin/fix-fabric-janitor-test branch from 5f5a27d to 2ad5091 Compare August 25, 2025 06:15
@@ -51,7 +51,7 @@ databricks_init() {

# Note: the cluster doesnt need to be running to create / drop catalogs, but it does need to be running to run the integration tests
echo "Ensuring cluster is running"
databricks clusters start $CLUSTER_ID || true
databricks clusters start $CLUSTER_ID
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make Databricks fail early with the cluster start error rather than time out for 40mins trying to run tests

Copy link
Contributor

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

@erindru erindru force-pushed the erin/fix-fabric-janitor-test branch from 6a024ea to e5f4287 Compare August 25, 2025 20:13
@erindru erindru merged commit ae4cd03 into main Aug 25, 2025
26 of 28 checks passed
@erindru erindru deleted the erin/fix-fabric-janitor-test branch August 25, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants