test: make xdist the default Makefile test path#1970
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR makes
|
| Filename | Overview |
|---|---|
| Makefile | Restructured test targets: test now runs xdist in parallel with API keys stripped; adds test-serial, test-benchmark, and other named targets. UNIT_TEST_ENV uses env -u (requires POSIX shell, acknowledged in comments). TEST= defaulting to empty correctly delegates to pytest.ini testpaths. |
| tests/conftest.py | Adds three autouse fixtures to reset context variables between tests, plus use_deterministic_embeddings_for_default_fastembed that patches LLMRails._get_embeddings_search_provider_instance via monkeypatch. Correctly opts out for real_embeddings-marked tests and live env vars. No yield needed since monkeypatch finalizer handles cleanup. |
| tests/testing/embeddings.py | New DeterministicEmbeddingSearchProvider using hash-based 64-dim embeddings for reproducible test results. Class docstring explicitly documents the float('inf') default threshold behavior. strict=True on zip is compatible with Python >= 3.10 as required by the project. |
| pytest.ini | Adds four pytest markers. serial is explicitly labeled documentation-only (not enforced by xdist). real_embeddings gates tests that opt out of the deterministic provider. Testpaths unchanged from baseline. |
| .github/workflows/_test.yml | Replaces bare poetry run pytest invocations with make test / make test-coverage. Shell is bash globally so env -u in the Makefile works on all runner OSes. |
| tests/v2_x/test_state_serialization.py | test_serialization_performance remains skipped with the original tight avg_time < 0.2 budget. The module-level config object is not mutated by the new fixture approach, which is an improvement. |
| pyproject.toml | Adds pytest-xdist = "^3.8.0" to dev dependencies. Existing pytest-cov >= 4.1.0 is sufficient for xdist-aware coverage collection. |
Reviews (6): Last reviewed commit: "ci: preserve pytest testpaths in make te..." | Re-trigger Greptile
📝 WalkthroughWalkthroughThis PR modernizes the test infrastructure by introducing deterministic embeddings for reproducible tests, refactoring CI/CD to use make targets, improving test isolation through fixture cleanup, and updating tests to be resilient to pre-existing LLM activity. The changes span build configuration, test utilities, fixtures, and individual test updates across the test suite. ChangesTest Infrastructure Modernization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/test_content_safety_actions.py (1)
322-325: ⚡ Quick winReplace
setattrwith direct attribute assignment.Ruff flags this as B010 (
setattrwith a constant attribute name), which may fail the lint/pre-commit job in CI.♻️ Proposed fix
def _fast_langdetect_module(detect): module = ModuleType("fast_langdetect") - setattr(module, "detect", detect) + module.detect = detect return module🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_content_safety_actions.py` around lines 322 - 325, In _fast_langdetect_module, avoid using setattr with a constant name to satisfy Ruff B010: replace the setattr(module, "detect", detect) call with a direct attribute assignment on the ModuleType instance (assign module.detect = detect) so the function still exposes detect but no longer triggers the lint rule.Makefile (2)
22-22: 💤 Low valueConsider documenting the worksteal distribution strategy.
The
DIST ?= workstealdefault may not be familiar to all users. The worksteal strategy can provide better load balancing but may behave differently than the defaultloadstrategy for certain test patterns.Consider adding a brief comment above line 11 explaining when users might want to override
DIST:+# DIST controls pytest-xdist scheduling: worksteal (default, better load balancing) or load (simpler, predictable order) DIST ?= worksteal🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` at line 22, Add a short explanatory comment near the Makefile variable definition for DIST (currently defaulting to "worksteal") describing that this sets pytest-xdist's --dist strategy, briefly stating what "worksteal" does (dynamic task stealing for better load balance) and when to override it (e.g., use "load" for deterministic distribution or when tests vary widely in duration); reference the Makefile variables UNIT_TEST_ENV, PYTEST, WORKERS, DIST, ARGS, TEST so readers understand this affects the pytest invocation `-n $(WORKERS) --dist $(DIST)`.
14-15: ⚡ Quick winMakefile:
env -uis compatible with this repo’s Windows CI, but document the shell requirement
UNIT_TEST_ENVusesenv -u(Makefile lines 14-15). The repo’s Windows test workflows run steps withdefaults.run.shell: bash, and they executemake test/make test-coverageon Windows runners, so the Unix-styleenvusage matches CI. To avoid local Windows surprises (PowerShell/CMD), document thatmaketargets assume a Unix-like shell such as Git Bash/WSL (or add a fallback for non-bash environments).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` around lines 14 - 15, The Makefile uses UNIT_TEST_ENV (the variable defined with "env -u") which requires a Unix-like shell; update the Makefile to either document this requirement at the top or add a portable fallback: add a brief comment above UNIT_TEST_ENV stating that make targets assume a Unix-like shell (e.g., bash/Git Bash/WSL) for local Windows usage, or implement a conditional fallback for non-Unix shells (detect shell/platform and use an equivalent approach or skip env -u) so make test and make test-coverage work or clearly instruct developers to run make under bash-compatible shells.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/testing/embeddings.py`:
- Line 82: The zip call over text_embedding and item_embedding should use
strict=True to fail fast on length mismatches; update the generator expression
in the return statement (the zip(text_embedding, item_embedding) call) to
zip(text_embedding, item_embedding, strict=True) so mismatched lengths raise
immediately while keeping behavior the same when lengths match.
---
Nitpick comments:
In `@Makefile`:
- Line 22: Add a short explanatory comment near the Makefile variable definition
for DIST (currently defaulting to "worksteal") describing that this sets
pytest-xdist's --dist strategy, briefly stating what "worksteal" does (dynamic
task stealing for better load balance) and when to override it (e.g., use "load"
for deterministic distribution or when tests vary widely in duration); reference
the Makefile variables UNIT_TEST_ENV, PYTEST, WORKERS, DIST, ARGS, TEST so
readers understand this affects the pytest invocation `-n $(WORKERS) --dist
$(DIST)`.
- Around line 14-15: The Makefile uses UNIT_TEST_ENV (the variable defined with
"env -u") which requires a Unix-like shell; update the Makefile to either
document this requirement at the top or add a portable fallback: add a brief
comment above UNIT_TEST_ENV stating that make targets assume a Unix-like shell
(e.g., bash/Git Bash/WSL) for local Windows usage, or implement a conditional
fallback for non-Unix shells (detect shell/platform and use an equivalent
approach or skip env -u) so make test and make test-coverage work or clearly
instruct developers to run make under bash-compatible shells.
In `@tests/test_content_safety_actions.py`:
- Around line 322-325: In _fast_langdetect_module, avoid using setattr with a
constant name to satisfy Ruff B010: replace the setattr(module, "detect",
detect) call with a direct attribute assignment on the ModuleType instance
(assign module.detect = detect) so the function still exposes detect but no
longer triggers the lint rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c971d245-a681-452b-aabd-144041bb217b
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (21)
.github/workflows/_test.yml.github/workflows/lint.ymlMakefilepyproject.tomlpytest.initests/conftest.pytests/guardrails/test_iorails.pytests/guardrails/test_request_id.pytests/test_actions_llm_embedding_lazy_init.pytests/test_bug_rail_flows_in_prompt.pytests/test_configs/simple_server/config_1/config.pytests/test_configs/simple_server_2_x/config_2/config.pytests/test_configs/with_custom_embedding_search_provider/config.pytests/test_content_safety_actions.pytests/test_embeddings_only_user_messages.pytests/test_general_instructions.pytests/test_retrieve_relevant_chunks.pytests/test_testing_chat_harness.pytests/testing/embeddings.pytests/v2_x/test_llm_embedding_lazy_init.pytests/v2_x/test_state_serialization.py
tgasser-nv
left a comment
There was a problem hiding this comment.
Needs some updates to add missing tests and remove performance-test running as a unit-test.
Also could you run on Windows and make sure there are no issues with the Windows Makefile operation vs MacOS or Linux? The Windows checks aren't part of the PR tests
| - name: Run pytest without coverage | ||
| if: inputs.with-coverage == false | ||
| run: poetry run pytest -v | ||
| run: make test |
There was a problem hiding this comment.
The poetry run pytest -v command this replaces uses the pytest.ini testpaths: tests, docs/colang-2/examples, benchmark/tests ).
The new Makefile target specifies tests/ if nothing is provided (line 8). So the Makefile overrides pytest's default behaviour of pulling the test paths from the pytest.ini file.
The net effect is the docs/colang-2/examples and benchmark/tests dirs are dropped. The docs/colang-2/examples directory doesn't seem to have any tests, but the benchmark/tests one does. This drops 239 tests from CI coverage.
Recommend not providing the TESTS arg in the Makefile to pytest to remain compatible with the previous behaviour
There was a problem hiding this comment.
I won't include docs/colang-2 .
I added a separate make test-benchmark target and a path-filtered benchmark workflow instead of folding benchmark/tests into make test (see a3cdef2).
benchmark/tests only cover benchmark tooling and do not import core nemoguardrails code.
keeping them separate preserves make test as the core library suite.
while still ensuring benchmark tooling is tested whenever benchmark code, dependencies, the Makefile, or the workflow changes. we also don't need the full matrix for benchmarking. What do you think Tim?
the new target passed localy: 239 passed.
There was a problem hiding this comment.
I'd rather have tests for everything in the repo under the same Makefile for simplicity.
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_serialization_performance(): |
There was a problem hiding this comment.
We shouldn't have performance tests as part of our unit-tests. This threshold magic-number (2.0) has been adjusted upwards many times, and can cause flaky CI results. If we have to make the threshold so high that it always passes, it neither verifies functionality nor performance. Recommend removing this test
There was a problem hiding this comment.
Yes, I'll drop it once more 👍🏻
I need to let @greptileai know about this. It's a bad code change suggestion @greptileai, originally this flaky test was skipped and your review kind of cause this issues. see 8128bd8
There was a problem hiding this comment.
You're right to flag that, and I apologize for the confusion my earlier suggestion caused. The sequence here is clear: the test was flaky, you skipped it, and a prior bot review comment led to it being unskipped — which is the wrong direction. tgasser-nv's feedback is correct.
Dropping test_serialization_performance entirely is the right call. A performance threshold that has been bumped to 2.0 to avoid flakiness provides no signal — it's neither a functional test nor a meaningful perf guard at that point.
There was a problem hiding this comment.
@CodeRabbit also note this down about flaky tests, specially performance tests being part of unit-tests.
There was a problem hiding this comment.
@Pouyanpi, noted! I've recorded this guidance for future reviews.
(ᓀ ᓀ) 🐇
✏️ Learnings added
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1970
File: tests/v2_x/test_state_serialization.py:144-144
Timestamp: 2026-06-04T08:31:21.557Z
Learning: In the NVIDIA-NeMo/Guardrails repository, performance/wall-clock timing tests must NOT be included as part of the unit-test suite. Such tests are inherently flaky on shared CI infrastructure, and thresholds that get repeatedly bumped (e.g., `test_serialization_performance` in `tests/v2_x/test_state_serialization.py` where the threshold was raised to `2.0`) provide neither functional coverage nor meaningful performance signal. Any existing performance test of this kind should be permanently removed or kept skipped (e.g., `pytest.mark.skip(reason="Flaky wall-clock performance assertion.")`). Do not suggest re-enabling or unskipping such tests during code review.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: Pouyanpi
Repo: NVIDIA-NeMo/Guardrails PR: 1814
File: tests/integrations/langchain/test_openai_param_filter.py:51-52
Timestamp: 2026-04-23T09:43:00.117Z
Learning: In NVIDIA-NeMo/Guardrails tests, recognize that env-based feature flags may be intentionally implemented using the truthiness pattern `os.environ.get("VAR_NAME") or os.environ.get("OTHER_VAR")` (where the env var values are treated as non-empty strings, e.g., `LIVE_TEST_MODE`, `TEST_LIVE_MODE`). During code review, avoid requesting per-file boolean normalization changes for this pattern; any correction should be handled as a repository-wide cleanup rather than a localized refactor in a single test file.
There was a problem hiding this comment.
Looks like we still have this test in as of the latest commit, can you remove it @Pouyanpi ?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
a3cdef2 to
5d3cf1f
Compare
tgasser-nv
left a comment
There was a problem hiding this comment.
Thanks for running the Windows full-tests as well. Approving with two blocking items before merging:
- Adding benchmark/tests back into the main Makefile.
- Removing the
test_serialization_performancetest
| assert output_events[0]["script"] == "Hello again!" | ||
|
|
||
|
|
||
| @pytest.mark.skip(reason="Flaky wall-clock performance assertion.") |
|
I agree that the Makefile should be the central CI entrypoint, and that is a goal of this PR. I do not think these tests validate benchmark tooling, not core library behavior. they also do not appear to be reviewed with the same rigor or stability expectations as the core, and the benchmark directory is not part of the package we ship. treating them as part of the core test gate would blur that quality boundary and add matrix cost without much signal. the compromise II’m aiming for is:
devs can still run bare the performance test is actually skipped. |
No, it was intentional.
Provide evidence for this statement. I've never asked for relaxed or less rigorous reviews for benchmark PRs. Tests are tests.
We do ship the benchmarking tools via git clone, not via PyPI. They still have to work just as our PyPI package. We should have the same high standards for both.
Your compromise is predicated on 1. benchmark tests being added to CI unintentionally, 2. Our not shipping benchmarking tools via PyPI packages justifying a lower code-quality. Neither are true. There's no need to compromise, you're trying to justify a regression in baheviour. |
tgasser-nv
left a comment
There was a problem hiding this comment.
See comment #1970 (comment)
|
@tgasser-nv do you see any cases that we need to run benchmark related tests except
? I don't understand why we should run those tests as part of the core suite given the full matrix. Do we really need to run the benchmarking test on 4 python versions * 3 different OSs when the change is made to a file under Do we also need codecov coverage reports for those? |
c7ed441 to
5f73784
Compare
|
committed as: c7ed441 ci: preserve pytest testpaths in make test validation: If you think this is the right call please approve and merge the PR (or I'll merge when I'm back). ow I can revert this commit. TY! |
tgasser-nv
left a comment
There was a problem hiding this comment.
Thanks for making the updates, approving
Summary
Make the default Makefile test path run the suite in parallel with pytest-xdist, while keeping an explicit serial pytest escape hatch.
What Changed
make testnow runs non-live pytest with xdist.make test-parallelremains available as an alias formake test.make test-serialruns plain pytest without xdist or live-env filtering.Why
The recommended local and agent test path should be fast, deterministic, and safe to run without live provider credentials. This makes
make testthe project-level test command, while preservingmake test-serialfor debugging raw pytest behavior.Impact
Local full-suite timing:
This makes the recommended local and agent test command about 4.8x faster on the measured machine.
The same change also reduces CI wall time by switching the reusable workflow from raw
poetry run pytestto Makefile targets:make testmake test-coverageThe current CI matrix runs PR tests across 4 Ubuntu/Python jobs, full tests across 8 Windows/macOS Python jobs, and latest-deps tests across 12 OS/Python jobs.
CI timing comparison:
Note: the measured coverage run completed pytest successfully and failed later during Codecov upload.
Validation
make helpSummary by CodeRabbit
Release Notes
Tests
Chores