Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Da behavior membership tests #819

Open
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

romanzac
Copy link
Contributor

@romanzac romanzac commented Oct 14, 2024

1. What does this PR implement?

This PR tests functionality related to DispersalExecutorBehaviour and DispersalValidatorBehaviour. Primary focus on correct decoding of types received from the network and sanity checks whether message is sent to (in)correct member.

Tests included:

  • test_validation_behaviour

2. Does the code have enough context to be clearly understood?

Context is defined by DispersalExecutorBehaviour and DispersalValidatorBehaviour which are subject to test.

3. Who are the specification authors and who is accountable for this PR?

@romanzac

4. Is the specification accurate and complete?

Yes

5. Does the implementation introduce changes in the specification?

No changes to specification are expected. Changes to implementation are expected, because when executors are included within addressbook, some messages won't reach validators for dispersal. #819 (comment)

Checklist

  • 1. Description added.
  • 2. Context and links to Specification document(s) added.
  • 3. Main contact(s) (developers and specification authors) added
  • 4. Implementation and Specification are 100% in sync including changes. This is critical.
  • 5. Link PR to a specific milestone.

Sorry, something went wrong.

@bacv
Copy link
Member

bacv commented Oct 29, 2024

Could you try running cargo clippy -p nomos-da-network-core --tests, there might be some good suggestions. I'll take a look at the tests itselves.

@romanzac romanzac requested a review from bacv October 30, 2024 06:27
@romanzac
Copy link
Contributor Author

Could you try running cargo clippy -p nomos-da-network-core --tests, there might be some good suggestions. I'll take a look at the tests itselves.

I've fixed on Clippy suggestions and stabilized executor swarms' exit.

Copy link
Member

@bacv bacv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is checking the correct parts, but the setup seems to have couple of memberships for different subnetworks. Preferably the setup would mimic how it'll be going to be in a real network - one membership list with all peers assigned to subnetworks, the list is the same between all participants.

}
}

pub fn executor_swarm(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using these in other test modules? If not these could be made private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made private at e7f90a6

use libp2p::{identity, quic, PeerId, Swarm};
use nomos_da_messages::common::Blob;
use std::collections::{HashMap, HashSet};
use std::sync::Arc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArcWake and Arc are not used anymore, we can remove them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed at e7f90a6

>,
>,
messages_to_expect: usize,
mut terminator_rx: watch::Receiver<bool>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Receiver can expect unit type (). Watcher will signal all receivers even if the value changes from unit to unit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to unit type at e7f90a6

}

_ = terminator_rx.changed() => {
if *terminator_rx.borrow() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not using the value, the if condition can be removed, the break needs to happen if terminator_rx changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if condition removed at e7f90a6

.chain(validator_1_config.iter().map(to_p2p_address)),
);

let subnet_0_membership = create_membership(ALL_INSTANCES / 2, 0, &subnet_0_ids);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit strange, ideally we would have one membership with all peers assigned to different subnetworks. Executors would use membership to connect to any peer that is in that subnetwork, including another executor.

Maybe the initialization of different kinds of swarms could be based on one membership, this would simplify the setup and be closer to what we expect in real network?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed at 8397a1f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About a half of executors won't receive dispersal success message when address book contains all swarms. And the same amount never reaches validator for dispersal.

Visible from test run with RUST_LOG=debug:

2024-10-31T02:37:32.180833Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181661Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181748Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181781Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181829Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181859Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.271488Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.271625Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.283855Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.283928Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.284586Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 20 messages from subnet 0
2024-10-31T02:37:32.284597Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284608Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284614Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 10 messages from subnet 1

2024-10-31T02:37:32.284621Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284627Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284633Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284638Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284645Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 10 messages from subnet 0
2024-10-31T02:37:32.284651Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284657Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284663Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284669Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284675Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284681Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284687Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284693Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284699Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284706Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284711Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1

2024-10-31T02:37:32.284727Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.284738Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284794Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284839Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284891Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284931Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284970Z  WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.285421Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285431Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285438Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.285445Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285452Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285459Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.285466Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285472Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.285478Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success

1121ffa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to validators only, as having executors within the address book is not supported yet. d4a20fb

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executors only send stuff. In this case for dispersal, they will not receive anything, never. If one executor behaviour connects to other executor behaviour is a bug.

Copy link
Contributor Author

@romanzac romanzac Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executors only send stuff. In this case for dispersal, they will not receive anything, never. If one executor behaviour connects to other executor behaviour is a bug.

Created #900

@romanzac romanzac requested a review from bacv October 31, 2024 07:45
@romanzac
Copy link
Contributor Author

romanzac commented Nov 1, 2024

Gentlemen, @danielSanchezQ @bacv would you have some ideas please, what could be a problem with my test code and self-hosted runner ? Not all executors would receive dispersal success message. I ran out of ideas how to improve futures: run_validator_swarm and run_executor_swarm.

}

// Terminate any remaining executors
terminator_tx.send(()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romanzac @danielSanchezQ @bacv
In continuation to #902 , I am trying to identify what could be wrong with this PR. But the main thing is I can't do proper debugging as of now, so I can't exactly confirm it...

This statement is executed in the main thread which is wrong I guess because after spawning(it is async so it is not waiting) the for loop at line 528 it will directly come to this line..and it will tell all executors
to drop off... That is why I guess there is error coming...

Please check into this one...
Also I am not good at devops so I can't really tell what is wrong with the pipeline tests😅...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add latest log from self-hosted
run3.log

Copy link
Contributor Author

@romanzac romanzac Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bacv Would it be possible to reboot self-hosted machines just to run this workflow: Test Validator Behaviour ? This run to check if state of the machine has any impact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issues like 902 usually happen because multiple factors, e.g. environment is not friendly + connections are not that robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to:

  1. Look at the runner machine. Configuration differences to macos-latest. State of CPU, memory, networking.
  2. Play round libp2p networking layer. Use different transport protocol. Try any tuning params for QUICK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This run to check if state of the machine has any impact.

This can be tested on gh linux runner, without requiring any changes to the self hosted. Let me double check the test, I feel that the state of a machine isn't significant factor in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found only one log message type is present in self-hosted and not in macos-latest:

2024-11-29T10:04:19.560125Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor event: ConnectionClosed { peer_id: PeerId("12D3KooWE7XHQBnZ57Bk9tfpiTHN18cpV1B96tf4mi6RsdBRYtoJ"), connection_id: ConnectionId(21), endpoint: Dialer { address: /ip4/127.0.0.1/udp/5209/quic-v1/p2p/12D3KooWE7XHQBnZ57Bk9tfpiTHN18cpV1B96tf4mi6RsdBRYtoJ, role_override: Dialer }, num_established: 0, cause: Some(IO(Custom { kind: Other, error: Connection(ConnectionError(ApplicationClosed(ApplicationClose { error_code: 0, reason: b"" }))) })) }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nux runner,

Is self-hosted a linux ? We have already one differential run on macos-latest - np.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants