Skip to content

Conversation

@sanity
Copy link
Collaborator

@sanity sanity commented Oct 28, 2025

Summary

This PR converts the test_ping_partially_connected_network test to use the #[freenet_test] macro and extends the macro with support for partial connectivity networks.

Key Changes

  1. Extended #[freenet_test] macro with peer_connectivity_ratio attribute

    • Added new peer_connectivity_ratio parameter (0.0-1.0) to control network topology
    • Implemented deterministic blocking formula: (a * 17 + b * 31 + a * b * 7) % 100
    • Uses relative peer indices (not absolute node indices) for consistent blocking patterns
    • Documented in macro README and testing guide
  2. Converted test from manual setup to macro-based

    • Reduced test code from ~376 lines to ~179 lines (52% reduction)
    • Macro now handles: socket binding, configuration, node startup, lifecycle management
    • Test focuses purely on business logic: contract operations and subscription propagation
    • Improved readability and maintainability
  3. Fixed critical bugs

    • Peer indexing bug: Was using absolute node indices (0-9) instead of relative peer indices (0-6), causing incorrect blocking patterns
    • Stack overflow: Changed from large select! block with boxed futures to tokio::task::spawn_local with LocalSet
    • Contract compilation: Fixed to use correct build directory based on debug/release mode
    • Chrono version conflict: Use re-exported chrono from freenet_ping_types
  4. Core fixes from original PR

    • Treat handshake enqueue timeouts as recoverable so gateways keep running
    • Ensure test harness compiles the ping contract with the contract feature
    • Use the same contract-loading flow in multi-gateway test as the single-gateway variant

Test Results

The test now passes consistently in ~129s with the correct network topology:

  • 3 gateways (gw-0, gw-1, gw-2)
  • 7 peer nodes (node-0 through node-6)
  • 30% connectivity ratio between peers (matches original test)
  • All peers auto-connect to all gateways

Testing

cargo test -p freenet-ping-app --test run_app test_ping_partially_connected_network -- --nocapture

Closes #2029

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

@sanity
Copy link
Collaborator Author

sanity commented Oct 28, 2025

Review of PR #2031

Reviewed the changes - this addresses multiple issues discovered during #2022 investigation. Here's my analysis:

✅ Strong Points

1. Handshake Timeout Handling (p2p_protoc.rs)

  • Converts handshake timeout from fatal error to graceful recovery
  • Critical fix: Previously timeout(...).await? would propagate Elapsed error and crash the gateway
  • Now treats timeout as connection failure, notifies callback, but keeps gateway running
  • This directly addresses the "channel closed" crashes seen in CI

2. Contract Compilation (common/mod.rs)

  • Adds contract feature to compilation pipeline via extra_features()
  • Ensures freenet_stdlib::time is available when compiling ping contract
  • Includes ensure_target_dir_env() fallback for when CARGO_TARGET_DIR isn't set

3. Test Consistency (run_app.rs)

  • Changes from std::fs::read() + ContractContainer::try_from() to common::load_contract()
  • Now uses same contract loading path as single-gateway test
  • Ensures contract compiles with proper features at test time

🔍 Observations

Handshake Error Handling
The timeout handling change is interesting:

Err(elapsed) => {
    // ...logs warning...
    // ...notifies callback...
    Ok(())  // <-- Returns Ok, gateway continues
}

This makes timeouts non-fatal, which is appropriate for CI environments where resource contention can cause temporary delays during multi-gateway initialization.

Contract Loading Pattern
The test now compiles the contract twice:

  1. First with temp params to get code hash
  2. Second with real params to get final container

This matches the single-gateway test pattern and ensures consistent behavior.

⚠️ Potential Concerns

1. Timeout Semantics
Treating handshake queue timeout as Ok(()) means:

  • Gateway continues running even if it couldn't queue the handshake
  • Connection attempt is silently abandoned (callback gets notified but gateway doesn't know about failure at return site)

Question: Should we log at ERROR level instead of WARN since this represents a failed connection attempt?

2. CARGO_TARGET_DIR Fallback
The ensure_target_dir_env() adds fallback logic:

if std::env::var(TARGET_DIR_VAR).is_err() {
    // ...find workspace and set target dir...
}

Question: When would CARGO_TARGET_DIR not be set during testing? Is this for IDE/manual runs?

📋 Testing Recommendation

Before merge, verify:

  1. Multi-gateway CI test: Does it pass consistently now? (most important)
  2. Single-gateway test: Still passes (regression check)
  3. Local testing: Both variants pass locally (sanity check)

The PR description says:

RUST_LOG=info cargo test -p freenet-ping-app --test run_app test_ping_partially_connected_network -- --ignored --nocapture

This should be run in CI to confirm the fix works in the resource-constrained environment where the original failure occurred.

🎯 Verdict

Changes look reasonable and address root causes identified in #2029:

  • ✅ Fixes gateway crash on handshake timeout
  • ✅ Ensures contract compiles with proper features
  • ✅ Aligns test with working single-gateway pattern

Recommend: Approve pending successful CI run, particularly the Test job which includes the ignored tests.


[AI-assisted debugging and comment]

@sanity
Copy link
Collaborator Author

sanity commented Oct 28, 2025

Thanks for the thorough read-through!

On the timeout logging level – I kept it at warn because the timeout here usually means “we never heard back from the handshake task in time” rather than a hard failure of the node. The connect op still receives the failure via the callback and will retry using its existing backoff, so the warning serves as a heads-up without spamming error when CI is under load. If we find it gets too chatty we can always tune the level in follow-up.

On CARGO_TARGET_DIR – integration tests launched via cargo test don’t inherit that env var (it’s only set when you call cargo yourself). In the multi-gateway test we fork a fresh cargo build to compile the contract inline; without explicitly seeding the env var that inner build fell back to $PWD/target, which conflicted with other runs. The helper just replicates what our build scripts do by default so the path is stable whether you run from CI, IDE, or a developer shell.

CI’s still running – I’ll keep an eye on the workflow to make sure the multi-gateway job makes it across the finish line.

@sanity
Copy link
Collaborator Author

sanity commented Oct 29, 2025

CI Test Failed

The multi-gateway test still fails in CI with the same error pattern:

error: anyhow::Error - Gateway node failed: deadline has elapsed

The logs show the familiar "channel closed" errors during gateway initialization, despite the handshake timeout handling improvements.

Analysis

The handshake timeout handling change (Err(elapsed) => Ok(())) makes timeouts non-fatal for the P2P connection manager, but the test is still hitting a different deadline - likely the test harness's gateway startup timeout.

The "channel closed" errors appear to be a symptom of a deeper coordination issue during multi-gateway initialization in resource-constrained CI environments. The changes in this PR:

  1. Prevent gateway crashes - Handshake timeouts no longer crash gateways
  2. Don't fix the underlying coordination issue - Gateways still fail to establish connections within the test deadline

Observations from Logs

ERROR freenet::operations::connect: Failed while attempting connection to gateway,
  gateway: v6MWKgqK4rkpqzJL, error: failed notifying, channel closed

This suggests channels are still closing during the initial connection phase, which indicates:

  • Gateways may be shutting down prematurely
  • Or there's a race condition in channel lifecycle management
  • Or the 10s timeout in establish_conn is being exceeded

Next Steps

Option 1: Increase the handshake timeout from 10s to something higher for CI
Option 2: Investigate why channels are closing - is it related to the rate limiter or connection handler?
Option 3: Add more detailed logging around channel lifecycle to understand the close sequence
Option 4: Check if the test needs sequential gateway startup rather than parallel

This is still a CI-specific issue (test passed locally for the other agent). The test may inherently be timing-sensitive and require a different approach for resource-constrained environments.

[AI-assisted debugging and comment]

@sanity sanity marked this pull request as draft October 30, 2025 18:14
@sanity
Copy link
Collaborator Author

sanity commented Nov 1, 2025

@claude Is this PR still relevant given other PRs merged over the past 3 days?

@claude
Copy link

claude bot commented Nov 1, 2025

Claude finished @sanity's task —— View job


PR Relevance Analysis

  • Check current branch commits and changes
  • Compare with origin/main to identify differences
  • Review recent merges to main
  • Analyze changed files for overlap
  • Provide recommendation on PR relevance

Answer: Yes, this PR is still very relevant

Here's the situation:

What's in this PR branch

This branch contains 4 commits that are NOT yet on main:

  1. 80a7a03 - fix(tests): stabilize multi-gateway integration
  2. 31ac9eb - test(tests): enable multi-gateway integration
  3. 4094f82 - fix(network): prevent handshake enqueue error from crashing node
  4. 4da57bd - test(ping): increase client response timeouts for integration

What's changed on main in the past 3 days

Main has moved forward by only 1 commit:

This is just a dependency update and has nothing to do with the multi-gateway test fixes.

Current Status

  • The changes in this PR have NOT been merged to main
  • The multi-gateway test fix is still needed
  • However, the CI test still failed on Oct 29 with the same "deadline has elapsed" error

Files Changed in This PR

  • apps/freenet-ping/app/src/ping_client.rs - Client response timeout changes
  • apps/freenet-ping/app/tests/common/mod.rs - Contract compilation with features
  • apps/freenet-ping/app/tests/run_app.rs - Test harness improvements
  • crates/core/src/node/network_bridge/p2p_protoc.rs - Handshake timeout handling

Recommendation

This PR should remain open, but you need to decide:

  1. If the fixes are working: Despite the changes, the CI test still failed. You may need to:

    • Investigate why the test is still timing out
    • Consider increasing timeouts further
    • Or close this PR if you determine a different approach is needed
  2. Merge conflict with main: You'll need to rebase or merge main into this branch to pick up the dependency update before merging

  3. Alternative: If these fixes aren't working, you might want to close this PR and pursue a different solution to the multi-gateway test instability

The PR is definitely still relevant (no other PR has addressed these issues), but it needs either more work or a decision on whether to pursue a different approach.


@iduartgomez
Copy link
Collaborator

@claude rebase on main and see if we wtill run into issues qith this test

@claude
Copy link

claude bot commented Nov 1, 2025

Claude finished @iduartgomez's task —— View job


Rebasing on main and testing

  • Check current git status and fetch latest changes
  • Analyze rebase requirements
  • Rebase branch on origin/main (requires permission)
  • Push rebased branch
  • Run multi-gateway integration test

Analysis

I've analyzed the current state of the branch and unfortunately I don't have permission to perform git rebase operations. However, I can provide you with the information you need:

Current Branch State:

  • The branch fix/multi-gateway-ci already includes all changes from main via merge commit c04a8ca
  • Main is at commit 538d0ef (the tracing-subscriber dependency update)
  • The branch has 6 commits total, with 4 unique commits not on main:
    1. 80a7a03 - fix(tests): stabilize multi-gateway integration
    2. 31ac9eb - test(tests): enable multi-gateway integration
    3. 4094f82 - fix(network): prevent handshake enqueue error from crashing node
    4. 4da57bd - test(ping): increase client response timeouts for integration
    5. c04a8ca - Merge branch 'main' into fix/multi-gateway-ci
    6. 4d6ca92 - fix: keep handshake hiccups from killing gateway

Files Changed:

  • apps/freenet-ping/app/src/ping_client.rs (+3/-3)
  • apps/freenet-ping/app/tests/common/mod.rs (+40/-3)
  • apps/freenet-ping/app/tests/run_app.rs (+11/-9)
  • crates/core/src/node/network_bridge/p2p_protoc.rs (+98/-19)

What You Can Do

To rebase this branch on main yourself, run:

git checkout fix/multi-gateway-ci
git rebase origin/main
git push --force-with-lease origin fix/multi-gateway-ci

This will create a cleaner linear history by replaying your commits on top of the latest main.

Alternative: The Branch is Already Up-to-Date

Since the branch already includes all changes from main (via the merge commit), rebasing is optional from a functionality standpoint. The main benefit would be a cleaner git history. If you're primarily concerned about whether the test passes, you could:

  1. Trigger CI on the current branch state, or
  2. Rebase locally as shown above and push

Would you like me to help with anything else regarding this PR?


@iduartgomez iduartgomez changed the title fix: stabilize multi-gateway integration test fix(test): convert multi-gateway test to use freenet_test macro with partial connectivity Nov 2, 2025
@iduartgomez
Copy link
Collaborator

Latest Update: Fixed Blocking Formula and Peer Indexing

The most recent commits fix a critical issue where the test was not functionally identical to the original:

Issues Found

  1. Wrong connectivity ratio: Test was using 0.5 instead of 0.3
  2. Simplified blocking formula: Using (i * j) % 100 instead of the original's better distribution
  3. Incorrect peer indices: Using absolute node indices (0-9 including gateways) instead of relative peer indices (0-6 for peers only)

Fixes Applied

  • Updated blocking formula to use ordered pairs: (a * 17 + b * 31 + a * b * 7) % 100
  • Fixed peer indexing to use relative indices (0-6) instead of absolute (0-9)
  • Changed connectivity ratio from 0.5 to 0.3
  • Fixed chrono version conflict

Test Status

✅ Test now passes consistently in ~129s with the same topology as the original manual test

The peer indexing fix was the most critical - without it, the blocking pattern was completely different from the original test because we were using indices 0-9 (all nodes) instead of 0-6 (peers only) in the hash formula.

@sanity
Copy link
Collaborator Author

sanity commented Nov 19, 2025

CI hasn't triggered automatically for the latest push. Requesting re-run to validate fixes. [AI-assisted - Claude]

@sanity sanity marked this pull request as ready for review November 19, 2025 20:36
@sanity
Copy link
Collaborator Author

sanity commented Nov 19, 2025

CI trigger test (marking ready didn't auto-trigger). @freenet/core-team [AI-assisted - Claude]

@sanity
Copy link
Collaborator Author

sanity commented Nov 19, 2025

Update: Agent pushed fixes for the test failures (commit 70b2c78 - 'fix(test): stabilize partial connectivity put flow'), but CI is not triggering automatically for this commit.

The fixes include:

  • Resilient wait_for_put_response with longer timeout
  • Increased bootstrap wait time in macros
  • Self-target message routing fixes in p2p_protoc.rs

Issue: PR is 2+ weeks old and CI hasn't run since Nov 3. Marking as 'ready for review' didn't trigger workflows. May need maintainer to manually trigger CI run or close/reopen PR.

[AI-assisted - Claude]

@sanity
Copy link
Collaborator Author

sanity commented Nov 19, 2025

Update: PR now shows merge conflicts. This likely occurred because the branch is 2+ weeks old and diverged significantly from main. Options:

  1. Rebase onto latest main
  2. Close and create fresh PR with the fixes

Agent completed the implementation work (commit 70b2c78 has all the fixes), but the branch needs to be synchronized with main.

[AI-assisted - Claude]

@sanity sanity force-pushed the fix/multi-gateway-ci branch from 70b2c78 to 2ceb7b0 Compare November 20, 2025 16:38
@sanity sanity force-pushed the fix/multi-gateway-ci branch from 2ceb7b0 to 7f00395 Compare November 21, 2025 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-gateway partially connected network test fails in CI with channel closed errors

3 participants