test: make xdist the default Makefile test path by Pouyanpi · Pull Request #1970 · NVIDIA-NeMo/Guardrails

Pouyanpi · 2026-06-02T10:32:49Z

Summary

Make the default Makefile test path run the suite in parallel with pytest-xdist, while keeping an explicit serial pytest escape hatch.

What Changed

make test now runs non-live pytest with xdist.
make test-parallel remains available as an alias for make test.
make test-serial runs plain pytest without xdist or live-env filtering.
Live provider keys and live-mode flags are cleared for the non-live parallel path.
Default FastEmbed usage in tests is replaced with deterministic embeddings unless a test explicitly opts into real embeddings.
Shared test state is isolated so the suite behaves correctly under xdist.
Real language-detection checks are gated so they do not run in normal unit-test mode.

Why

The recommended local and agent test path should be fast, deterministic, and safe to run without live provider credentials. This makes make test the project-level test command, while preserving make test-serial for debugging raw pytest behavior.

Impact

Local full-suite timing:

Serial pytest path: 138.24s
Parallel Makefile path: 29s

This makes the recommended local and agent test command about 4.8x faster on the measured machine.

The same change also reduces CI wall time by switching the reusable workflow from raw poetry run pytest to Makefile targets:

non-coverage jobs run make test
the coverage job runs make test-coverage

The current CI matrix runs PR tests across 4 Ubuntu/Python jobs, full tests across 8 Windows/macOS Python jobs, and latest-deps tests across 12 OS/Python jobs.

CI timing comparison:

Baseline raw pytest run: https://github.com/Pouyanpi/NeMo-Guardrails/actions/runs/26810039479
Current Makefile xdist run: https://github.com/Pouyanpi/NeMo-Guardrails/actions/runs/26812554221

Job	Baseline job	Current job	Baseline test step	Current test step
Ubuntu / Python 3.10 non-coverage	4m55s	2m18s	4m01s	1m49s
Ubuntu / Python 3.11 coverage	7m43s	3m20s	6m55s	2m44s
Ubuntu / Python 3.12 non-coverage	3m33s	1m57s	2m39s	1m31s
Ubuntu / Python 3.13 non-coverage	4m09s	2m01s	3m09s	1m31s

Note: the measured coverage run completed pytest successfully and failed later during Codecov upload.

Validation

make help

env LIVE_TEST=1 OPENAI_API_KEY=sk-test make test TEST=tests/test_embeddings_openai.py WORKERS=1 ARGS='-q --tb=short'

make test TEST=tests/test_utils.py WORKERS=1 ARGS='-q'

Summary by CodeRabbit

Release Notes

Tests
- Added support for parallel test execution with configurable workers
- Introduced pytest markers (serial, slow, live, real_embeddings) for organizing and filtering tests
- Enhanced test determinism with deterministic embedding provider
- Improved test isolation with automated state cleanup between tests
Chores
- Refactored test targets and GitHub Actions workflows for consistency

codecov · 2026-06-02T10:38:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-06-02T10:49:34Z

Greptile Summary

This PR makes make test the canonical local and CI test command by switching it to parallel execution via pytest-xdist, while keeping make test-serial as a debugging escape hatch. Live provider credentials are cleared via env -u in UNIT_TEST_ENV, and a new DeterministicEmbeddingSearchProvider replaces FastEmbed in unit tests for reproducibility and speed.

Makefile introduces test, test-parallel, test-serial, test-coverage, and test-benchmark targets with configurable WORKERS, DIST, and ARGS variables; make test reports a 4.8× speedup over the serial path.
tests/conftest.py adds autouse fixtures that reset context variables between tests and intercept LLMRails._get_embeddings_search_provider_instance via monkeypatch so deterministic hash-embeddings are used by default, with an opt-out via @pytest.mark.real_embeddings.
CI workflow updates replace bare poetry run pytest -v calls with make test / make test-coverage, and lint.yml is aligned to the renamed pre-commit target.

Confidence Score: 5/5

Safe to merge. All test infrastructure changes are additive and isolated to the test harness; no production code paths are modified.

The changes are well-scoped to the test toolchain: Makefile targets, conftest fixtures, and CI YAML. The deterministic embedding provider is gated behind an autouse monkeypatch that is automatically undone per-test, and live-credential clearing is handled by env -u in UNIT_TEST_ENV. No new behavioral regressions were identified in the changed files.

No files require special attention. tests/v2_x/test_state_serialization.py still carries a skipped performance test, but that decision is tracked and acceptable given the discussion in the PR thread.

Important Files Changed

Filename	Overview
Makefile	Restructured test targets: `test` now runs xdist in parallel with API keys stripped; adds `test-serial`, `test-benchmark`, and other named targets. `UNIT_TEST_ENV` uses `env -u` (requires POSIX shell, acknowledged in comments). `TEST=` defaulting to empty correctly delegates to pytest.ini testpaths.
tests/conftest.py	Adds three autouse fixtures to reset context variables between tests, plus `use_deterministic_embeddings_for_default_fastembed` that patches `LLMRails._get_embeddings_search_provider_instance` via monkeypatch. Correctly opts out for `real_embeddings`-marked tests and live env vars. No yield needed since monkeypatch finalizer handles cleanup.
tests/testing/embeddings.py	New `DeterministicEmbeddingSearchProvider` using hash-based 64-dim embeddings for reproducible test results. Class docstring explicitly documents the float('inf') default threshold behavior. strict=True on zip is compatible with Python >= 3.10 as required by the project.
pytest.ini	Adds four pytest markers. `serial` is explicitly labeled documentation-only (not enforced by xdist). `real_embeddings` gates tests that opt out of the deterministic provider. Testpaths unchanged from baseline.
.github/workflows/_test.yml	Replaces bare `poetry run pytest` invocations with `make test` / `make test-coverage`. Shell is `bash` globally so `env -u` in the Makefile works on all runner OSes.
tests/v2_x/test_state_serialization.py	`test_serialization_performance` remains skipped with the original tight `avg_time < 0.2` budget. The module-level `config` object is not mutated by the new fixture approach, which is an improvement.
pyproject.toml	Adds `pytest-xdist = "^3.8.0"` to dev dependencies. Existing `pytest-cov >= 4.1.0` is sufficient for xdist-aware coverage collection.

_{Reviews (6): Last reviewed commit: "ci: preserve pytest testpaths in make te..." | Re-trigger Greptile}

coderabbitai · 2026-06-02T10:50:13Z

📝 Walkthrough

Walkthrough

This PR modernizes the test infrastructure by introducing deterministic embeddings for reproducible tests, refactoring CI/CD to use make targets, improving test isolation through fixture cleanup, and updating tests to be resilient to pre-existing LLM activity. The changes span build configuration, test utilities, fixtures, and individual test updates across the test suite.

Changes

Test Infrastructure Modernization

Layer / File(s)	Summary
Build system and CI/CD targets `Makefile`, `.github/workflows/_test.yml`, `.github/workflows/lint.yml`, `pyproject.toml`, `pytest.ini`	Makefile refactored with new test targets (`test`, `test-coverage`, `test-watch`, etc.), configurable variables for parallelism and coverage, and a `warm-fastembed-cache` target. GitHub Actions workflows updated to invoke make targets instead of direct poetry commands. pyproject.toml adds `pytest-xdist` dependency for parallel execution. pytest.ini registers custom markers (`serial`, `slow`, `live`, `real_embeddings`).
Deterministic test embeddings provider `tests/testing/embeddings.py`	New module implements `DeterministicEmbeddingSearchProvider` with token-hash-based 64-dimensional embeddings, in-memory item storage, and similarity-based search for reproducible embedding behavior in tests.
Conftest fixtures and context cleanup `tests/conftest.py`	Imports deterministic embedding utilities and adds three autouse fixtures: `reset_tool_calls_var()` and `reset_explain_info_var()` reset context variables per test, and `use_deterministic_embeddings_for_default_fastembed()` conditionally monkeypatches `LLMRails` to replace default FastEmbed providers with the deterministic test provider unless tests are marked `@pytest.mark.real_embeddings` or real embedding environment variables are set.
IORails and request-id fixtures `tests/guardrails/test_iorails.py`, `tests/guardrails/test_request_id.py`	IORails fixture split into synchronous `iorails_sync` and asynchronous `iorails` variants with proper teardown. New autouse fixture `clean_request_id_context()` resets request ID ContextVar to `"no-req-id"` per test.
Test updates for pre-existing LLM call resilience `tests/test_bug_rail_flows_in_prompt.py`, `tests/test_embeddings_only_user_messages.py`, `tests/test_general_instructions.py`, `tests/test_retrieve_relevant_chunks.py`	Tests refactored to capture baseline `llm_calls` count before execution, then slice `llm_calls` to only inspect newly added calls, making tests resilient to fixture-level LLM activity.
Content safety tests separation `tests/test_content_safety_actions.py`	Refactored to separate unit tests (validating `_detect_language` via sys.modules patching) from live tests. New `requires_live_test` marker gates live test classes; only runs when `LIVE_TEST` environment variable is set.
Test marker registration and config updates `tests/test_actions_llm_embedding_lazy_init.py`, `tests/test_configs/simple_server/config_1/config.py`, `tests/test_configs/simple_server_2_x/config_2/config.py`, `tests/test_configs/with_custom_embedding_search_provider/config.py`, `tests/v2_x/test_llm_embedding_lazy_init.py`	Test classes marked with `@pytest.mark.real_embeddings` to categorize embedding-related tests. New test config modules added for custom LLM providers. Existing embedding search provider config updated to accept optional `threshold` parameter.
State serialization test refactoring `tests/v2_x/test_state_serialization.py`	Initial event processing extracted into `_process_initial_events()` helper for test reuse. Performance test marked with `@pytest.mark.skip` due to flakiness.
TestChat constructor isolation test `tests/test_testing_chat_harness.py`	New test `test_testchat_constructor_does_not_replace_active_explain_context()` verifies that constructing multiple TestChat instances from the same config does not replace explain context for already-active instances.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

enhancement

Suggested reviewers

tgasser-nv

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'test: make xdist the default Makefile test path' directly and clearly describes the main change: making xdist (parallel testing) the default in the Makefile's test target.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Test Results For Major Changes	✅ Passed	Major test infrastructure changes are well-documented with performance results (4.8x speedup: 138.24s→29s) and CI validation details in PR description.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/test_content_safety_actions.py (1)
322-325: ⚡ Quick win

Replace setattr with direct attribute assignment.

Ruff flags this as B010 (setattr with a constant attribute name), which may fail the lint/pre-commit job in CI.
♻️ Proposed fix
 def _fast_langdetect_module(detect):
     module = ModuleType("fast_langdetect")
-    setattr(module, "detect", detect)
+    module.detect = detect
     return module
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_content_safety_actions.py` around lines 322 - 325, In
_fast_langdetect_module, avoid using setattr with a constant name to satisfy
Ruff B010: replace the setattr(module, "detect", detect) call with a direct
attribute assignment on the ModuleType instance (assign module.detect = detect)
so the function still exposes detect but no longer triggers the lint rule.
Makefile (2)
22-22: 💤 Low value

Consider documenting the worksteal distribution strategy.

The DIST ?= worksteal default may not be familiar to all users. The worksteal strategy can provide better load balancing but may behave differently than the default load strategy for certain test patterns.

Consider adding a brief comment above line 11 explaining when users might want to override DIST:
+# DIST controls pytest-xdist scheduling: worksteal (default, better load balancing) or load (simpler, predictable order)
 DIST ?= worksteal
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` at line 22, Add a short explanatory comment near the Makefile
variable definition for DIST (currently defaulting to "worksteal") describing
that this sets pytest-xdist's --dist strategy, briefly stating what "worksteal"
does (dynamic task stealing for better load balance) and when to override it
(e.g., use "load" for deterministic distribution or when tests vary widely in
duration); reference the Makefile variables UNIT_TEST_ENV, PYTEST, WORKERS,
DIST, ARGS, TEST so readers understand this affects the pytest invocation `-n
$(WORKERS) --dist $(DIST)`.
14-15: ⚡ Quick win

Makefile: env -u is compatible with this repo’s Windows CI, but document the shell requirement

UNIT_TEST_ENV uses env -u (Makefile lines 14-15). The repo’s Windows test workflows run steps with defaults.run.shell: bash, and they execute make test / make test-coverage on Windows runners, so the Unix-style env usage matches CI. To avoid local Windows surprises (PowerShell/CMD), document that make targets assume a Unix-like shell such as Git Bash/WSL (or add a fallback for non-bash environments).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 14 - 15, The Makefile uses UNIT_TEST_ENV (the variable
defined with "env -u") which requires a Unix-like shell; update the Makefile to
either document this requirement at the top or add a portable fallback: add a
brief comment above UNIT_TEST_ENV stating that make targets assume a Unix-like
shell (e.g., bash/Git Bash/WSL) for local Windows usage, or implement a
conditional fallback for non-Unix shells (detect shell/platform and use an
equivalent approach or skip env -u) so make test and make test-coverage work or
clearly instruct developers to run make under bash-compatible shells.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/testing/embeddings.py`:
- Line 82: The zip call over text_embedding and item_embedding should use
strict=True to fail fast on length mismatches; update the generator expression
in the return statement (the zip(text_embedding, item_embedding) call) to
zip(text_embedding, item_embedding, strict=True) so mismatched lengths raise
immediately while keeping behavior the same when lengths match.

---

Nitpick comments:
In `@Makefile`:
- Line 22: Add a short explanatory comment near the Makefile variable definition
for DIST (currently defaulting to "worksteal") describing that this sets
pytest-xdist's --dist strategy, briefly stating what "worksteal" does (dynamic
task stealing for better load balance) and when to override it (e.g., use "load"
for deterministic distribution or when tests vary widely in duration); reference
the Makefile variables UNIT_TEST_ENV, PYTEST, WORKERS, DIST, ARGS, TEST so
readers understand this affects the pytest invocation `-n $(WORKERS) --dist
$(DIST)`.
- Around line 14-15: The Makefile uses UNIT_TEST_ENV (the variable defined with
"env -u") which requires a Unix-like shell; update the Makefile to either
document this requirement at the top or add a portable fallback: add a brief
comment above UNIT_TEST_ENV stating that make targets assume a Unix-like shell
(e.g., bash/Git Bash/WSL) for local Windows usage, or implement a conditional
fallback for non-Unix shells (detect shell/platform and use an equivalent
approach or skip env -u) so make test and make test-coverage work or clearly
instruct developers to run make under bash-compatible shells.

In `@tests/test_content_safety_actions.py`:
- Around line 322-325: In _fast_langdetect_module, avoid using setattr with a
constant name to satisfy Ruff B010: replace the setattr(module, "detect",
detect) call with a direct attribute assignment on the ModuleType instance
(assign module.detect = detect) so the function still exposes detect but no
longer triggers the lint rule.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c971d245-a681-452b-aabd-144041bb217b

📥 Commits

Reviewing files that changed from the base of the PR and between 8082e74 and 8128bd8.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (21)

.github/workflows/_test.yml
.github/workflows/lint.yml
Makefile
pyproject.toml
pytest.ini
tests/conftest.py
tests/guardrails/test_iorails.py
tests/guardrails/test_request_id.py
tests/test_actions_llm_embedding_lazy_init.py
tests/test_bug_rail_flows_in_prompt.py
tests/test_configs/simple_server/config_1/config.py
tests/test_configs/simple_server_2_x/config_2/config.py
tests/test_configs/with_custom_embedding_search_provider/config.py
tests/test_content_safety_actions.py
tests/test_embeddings_only_user_messages.py
tests/test_general_instructions.py
tests/test_retrieve_relevant_chunks.py
tests/test_testing_chat_harness.py
tests/testing/embeddings.py
tests/v2_x/test_llm_embedding_lazy_init.py
tests/v2_x/test_state_serialization.py

tgasser-nv

Needs some updates to add missing tests and remove performance-test running as a unit-test.

Also could you run on Windows and make sure there are no issues with the Windows Makefile operation vs MacOS or Linux? The Windows checks aren't part of the PR tests

tgasser-nv · 2026-06-03T21:52:08Z

      - name: Run pytest without coverage
        if: inputs.with-coverage == false
-        run: poetry run pytest -v
+        run: make test


The poetry run pytest -v command this replaces uses the pytest.ini testpaths: tests, docs/colang-2/examples, benchmark/tests ).

The new Makefile target specifies tests/ if nothing is provided (line 8). So the Makefile overrides pytest's default behaviour of pulling the test paths from the pytest.ini file.

The net effect is the docs/colang-2/examples and benchmark/tests dirs are dropped. The docs/colang-2/examples directory doesn't seem to have any tests, but the benchmark/tests one does. This drops 239 tests from CI coverage.

Recommend not providing the TESTS arg in the Makefile to pytest to remain compatible with the previous behaviour

I won't include docs/colang-2 .

I added a separate make test-benchmark target and a path-filtered benchmark workflow instead of folding benchmark/tests into make test (see a3cdef2).

benchmark/tests only cover benchmark tooling and do not import core nemoguardrails code.
keeping them separate preserves make test as the core library suite.
while still ensuring benchmark tooling is tested whenever benchmark code, dependencies, the Makefile, or the workflow changes. we also don't need the full matrix for benchmarking. What do you think Tim?

the new target passed localy: 239 passed.

I'd rather have tests for everything in the repo under the same Makefile for simplicity.

tgasser-nv · 2026-06-04T01:57:45Z



+@pytest.mark.asyncio
+async def test_serialization_performance():


We shouldn't have performance tests as part of our unit-tests. This threshold magic-number (2.0) has been adjusted upwards many times, and can cause flaky CI results. If we have to make the threshold so high that it always passes, it neither verifies functionality nor performance. Recommend removing this test

Yes, I'll drop it once more 👍🏻

I need to let @greptileai know about this. It's a bad code change suggestion @greptileai, originally this flaky test was skipped and your review kind of cause this issues. see 8128bd8

You're right to flag that, and I apologize for the confusion my earlier suggestion caused. The sequence here is clear: the test was flaky, you skipped it, and a prior bot review comment led to it being unskipped — which is the wrong direction. tgasser-nv's feedback is correct.

Dropping test_serialization_performance entirely is the right call. A performance threshold that has been bumped to 2.0 to avoid flakiness provides no signal — it's neither a functional test nor a meaningful perf guard at that point.

@CodeRabbit also note this down about flaky tests, specially performance tests being part of unit-tests.

@Pouyanpi, noted! I've recorded this guidance for future reviews.

(ᓀ ᓀ) 🐇

✏️ Learnings added

Learnt from: Pouyanpi Repo: NVIDIA-NeMo/Guardrails PR: 1970 File: tests/v2_x/test_state_serialization.py:144-144 Timestamp: 2026-06-04T08:31:21.557Z Learning: In the NVIDIA-NeMo/Guardrails repository, performance/wall-clock timing tests must NOT be included as part of the unit-test suite. Such tests are inherently flaky on shared CI infrastructure, and thresholds that get repeatedly bumped (e.g., `test_serialization_performance` in `tests/v2_x/test_state_serialization.py` where the threshold was raised to `2.0`) provide neither functional coverage nor meaningful performance signal. Any existing performance test of this kind should be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping such tests during code review.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: Pouyanpi Repo: NVIDIA-NeMo/Guardrails PR: 1814 File: tests/integrations/langchain/test_openai_param_filter.py:51-52 Timestamp: 2026-04-23T09:43:00.117Z Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.

Looks like we still have this test in as of the latest commit, can you remove it @Pouyanpi ?

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Pouyanpi · 2026-06-04T08:26:20Z

@tgasser-nv full tests were run:

https://github.com/NVIDIA-NeMo/Guardrails/actions/runs/26814510332
https://github.com/NVIDIA-NeMo/Guardrails/actions/runs/26814315586

tgasser-nv

Thanks for running the Windows full-tests as well. Approving with two blocking items before merging:

Adding benchmark/tests back into the main Makefile.
Removing the test_serialization_performance test

Pouyanpi · 2026-06-05T08:35:25Z

    assert output_events[0]["script"] == "Hello again!"


+@pytest.mark.skip(reason="Flaky wall-clock performance assertion.")


@tgasser-nv this is already skipped.

Pouyanpi · 2026-06-05T09:03:34Z

@tgasser-nv

I agree that the Makefile should be the central CI entrypoint, and that is a goal of this PR.

I do not think benchmark/tests should be folded into the core make test target or the full unittest matrix. The previous bare poetry run pytest -v command picked up benchmark/tests through pytest.ini testpaths, but that looks like accidental coupling rather than an intentional CI contract. These tests validate benchmark
tooling, not core library behavior.

these tests validate benchmark tooling, not core library behavior. they also do not appear to be reviewed with the same rigor or stability expectations as the core, and the benchmark directory is not part of the package we ship. treating them as part of the core test gate would blur that quality boundary and add matrix cost without much signal.

the compromise II’m aiming for is:

keep make test focused on the core tests/ suite,
keep make test-benchmark in the same Makefile,
run benchmark tests through a dedicated workflow when benchmark-related files, dependencies, pytest config, the makefile, or the workflow itself change.

devs can still run bare poetry run pytest if they want pytest’s default testpaths behavior locally. For CI, the Makefile gives us explicit, named targets for the suites we intentionally run. What do you think?

the performance test is actually skipped.

tgasser-nv · 2026-06-05T14:53:47Z

@tgasser-nv

I agree that the Makefile should be the central CI entrypoint, and that is a goal of this PR.

I do not think benchmark/tests should be folded into the core make test target or the full unittest matrix. The previous bare poetry run pytest -v command picked up benchmark/tests through pytest.ini testpaths, but that looks like accidental coupling rather than an intentional CI contract. These tests validate benchmark tooling, not core library behavior.

No, it was intentional.

these tests validate benchmark tooling, not core library behavior. they also do not appear to be reviewed with the same rigor or stability expectations as the core

Provide evidence for this statement. I've never asked for relaxed or less rigorous reviews for benchmark PRs. Tests are tests.

... the benchmark directory is not part of the package we ship.

We do ship the benchmarking tools via git clone, not via PyPI. They still have to work just as our PyPI package. We should have the same high standards for both.

treating them as part of the core test gate would blur that quality boundary and add matrix cost without much signal.

the compromise II’m aiming for is:

keep make test focused on the core tests/ suite,

keep make test-benchmark in the same Makefile,

run benchmark tests through a dedicated workflow when benchmark-related files, dependencies, pytest config, the makefile, or the workflow itself change.

devs can still run bare poetry run pytest if they want pytest’s default testpaths behavior locally. For CI, the Makefile gives us explicit, named targets for the suites we intentionally run. What do you think?

Your compromise is predicated on 1. benchmark tests being added to CI unintentionally, 2. Our not shipping benchmarking tools via PyPI packages justifying a lower code-quality. Neither are true.

There's no need to compromise, you're trying to justify a regression in baheviour.

tgasser-nv

See comment #1970 (comment)

Pouyanpi · 2026-06-05T15:03:16Z

@tgasser-nv do you see any cases that we need to run benchmark related tests except

when benchmark-related files, dependencies, pytest config, the makefile, or the workflow itself change.

?

I don't understand why we should run those tests as part of the core suite given the full matrix. Do we really need to run the benchmarking test on 4 python versions * 3 different OSs when the change is made to a file under nemoguardrails/? They don't depend on a single line of code in nemoguardrails.

Do we also need codecov coverage reports for those?

Pouyanpi · 2026-06-05T15:22:34Z

@tgasser-nv

make test and `make test-coverage`` no longer force TEST ?= tests/, so pytest now uses pytest.ini testpaths again. that means benchmark/tests is back in the main CI test path. I also removed the separate benchmark workflow so those tests do not run twice.

committed as:

c7ed441 ci: preserve pytest testpaths in make test

validation:

make test TEST=benchmark/tests WORKERS=2 ARGS='-q --maxfail=1'
239 passed

make test WORKERS=2 ARGS='--collect-only -q'
4661 tests collected

make test ARGS='-q --maxfail=1'
4481 passed, 180 skipped

If you think this is the right call please approve and merge the PR (or I'll merge when I'm back). ow I can revert this commit. TY!

tgasser-nv

Thanks for making the updates, approving

Pouyanpi self-assigned this Jun 2, 2026

Pouyanpi added this to the v0.23.0 milestone Jun 2, 2026

Pouyanpi requested a review from tgasser-nv June 2, 2026 10:37

Pouyanpi marked this pull request as ready for review June 2, 2026 10:41

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/conftest.py Outdated

Comment thread tests/testing/embeddings.py

Comment thread tests/v2_x/test_state_serialization.py

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/testing/embeddings.py Outdated

NVIDIA-NeMo deleted a comment from greptile-apps Bot Jun 2, 2026

tgasser-nv reviewed Jun 4, 2026

View reviewed changes

Pouyanpi force-pushed the pouyanpi/test/parallel-unit-test-target branch from a3cdef2 to 5d3cf1f Compare June 4, 2026 08:27

tgasser-nv approved these changes Jun 4, 2026

View reviewed changes

Pouyanpi commented Jun 5, 2026

View reviewed changes

tgasser-nv requested changes Jun 5, 2026

View reviewed changes

Pouyanpi added 11 commits June 5, 2026 17:21

test: make xdist the default Makefile test path

2b5d855

test: isolate shared state for parallel runs

cd81bb5

test: use deterministic default embeddings

ee3606e

test: gate real language detection checks

de29e12

ci: run non-coverage tests through make target

02ed035

ci: use hyphenated pre-commit target

d7bae36

make coverage parallel

130f8a8

skip flaky test

ed7e6f0

apply review suggestions

f86cb0e

add docstring

0186987

ci: add benchmark test workflow

6980e5d

Pouyanpi added 3 commits June 5, 2026 17:21

test: skip flaky serialization performance test

a538a47

ci: run benchmark tests on pytest config changes

06079ca

ci: preserve pytest testpaths in make test

5f73784

Pouyanpi force-pushed the pouyanpi/test/parallel-unit-test-target branch from c7ed441 to 5f73784 Compare June 5, 2026 15:22

tgasser-nv approved these changes Jun 5, 2026

View reviewed changes

Pouyanpi merged commit 1839dd2 into NVIDIA-NeMo:develop Jun 5, 2026
13 of 14 checks passed



		@pytest.mark.asyncio
		async def test_serialization_performance():

		assert output_events[0]["script"] == "Hello again!"


		@pytest.mark.skip(reason="Flaky wall-clock performance assertion.")

Conversation

Pouyanpi commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Why

Impact

Validation

Summary by CodeRabbit

Release Notes

Uh oh!

codecov Bot commented Jun 2, 2026

Codecov Report

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgasser-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Pouyanpi commented Jun 4, 2026

Uh oh!

tgasser-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pouyanpi commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgasser-nv commented Jun 5, 2026

Uh oh!

tgasser-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Pouyanpi commented Jun 5, 2026

Uh oh!

Pouyanpi commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pouyanpi commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

Pouyanpi commented Jun 5, 2026 •

edited

Loading

Pouyanpi commented Jun 5, 2026 •

edited

Loading