-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Da behavior membership tests #819
base: master
Are you sure you want to change the base?
Conversation
Could you try running |
- bump port numbers for quick to workaround stuck processes
I've fixed on Clippy suggestions and stabilized executor swarms' exit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is checking the correct parts, but the setup seems to have couple of memberships for different subnetworks. Preferably the setup would mimic how it'll be going to be in a real network - one membership list with all peers assigned to subnetworks, the list is the same between all participants.
} | ||
} | ||
|
||
pub fn executor_swarm( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using these in other test modules? If not these could be made private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made private at e7f90a6
use libp2p::{identity, quic, PeerId, Swarm}; | ||
use nomos_da_messages::common::Blob; | ||
use std::collections::{HashMap, HashSet}; | ||
use std::sync::Arc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArcWake
and Arc
are not used anymore, we can remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed at e7f90a6
>, | ||
>, | ||
messages_to_expect: usize, | ||
mut terminator_rx: watch::Receiver<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Receiver can expect unit type ()
. Watcher will signal all receivers even if the value changes from unit to unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to unit type at e7f90a6
} | ||
|
||
_ = terminator_rx.changed() => { | ||
if *terminator_rx.borrow() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are not using the value, the if condition can be removed, the break
needs to happen if terminator_rx changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if condition removed at e7f90a6
.chain(validator_1_config.iter().map(to_p2p_address)), | ||
); | ||
|
||
let subnet_0_membership = create_membership(ALL_INSTANCES / 2, 0, &subnet_0_ids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit strange, ideally we would have one membership with all peers assigned to different subnetworks. Executors would use membership to connect to any peer that is in that subnetwork, including another executor.
Maybe the initialization of different kinds of swarms could be based on one membership, this would simplify the setup and be closer to what we expect in real network?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed at 8397a1f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About a half of executors won't receive dispersal success message when address book contains all swarms. And the same amount never reaches validator for dispersal.
Visible from test run with RUST_LOG=debug
:
2024-10-31T02:37:32.180833Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181661Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181748Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181781Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181829Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.181859Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.271488Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.271625Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.283855Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.283928Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator timeout reached
2024-10-31T02:37:32.284586Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 20 messages from subnet 0
2024-10-31T02:37:32.284597Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284608Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284614Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 10 messages from subnet 1
2024-10-31T02:37:32.284621Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284627Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284633Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284638Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284645Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 10 messages from subnet 0
2024-10-31T02:37:32.284651Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284657Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284663Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284669Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284675Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284681Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284687Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284693Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284699Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284706Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 0
2024-10-31T02:37:32.284711Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Validator received 0 messages from subnet 1
2024-10-31T02:37:32.284727Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.284738Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284794Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284839Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284891Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284931Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.284970Z WARN nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor terminated
2024-10-31T02:37:32.285421Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285431Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285438Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.285445Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285452Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285459Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.285466Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
2024-10-31T02:37:32.285472Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 10 messages dispersal success
2024-10-31T02:37:32.285478Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor task received: 0 messages dispersal success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to validators only, as having executors within the address book is not supported yet. d4a20fb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executors only send stuff. In this case for dispersal, they will not receive anything, never. If one executor behaviour connects to other executor behaviour is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executors only send stuff. In this case for dispersal, they will not receive anything, never. If one executor behaviour connects to other executor behaviour is a bug.
Created #900
- add comment
Gentlemen, @danielSanchezQ @bacv would you have some ideas please, what could be a problem with my test code and self-hosted runner ? Not all executors would receive dispersal success message. I ran out of ideas how to improve futures: run_validator_swarm and run_executor_swarm. |
} | ||
|
||
// Terminate any remaining executors | ||
terminator_tx.send(()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romanzac @danielSanchezQ @bacv
In continuation to #902 , I am trying to identify what could be wrong with this PR. But the main thing is I can't do proper debugging as of now, so I can't exactly confirm it...
This statement is executed in the main thread which is wrong I guess because after spawning(it is async so it is not waiting) the for loop
at line 528 it will directly come to this line..and it will tell all executors
to drop off... That is why I guess there is error coming...
Please check into this one...
Also I am not good at devops so I can't really tell what is wrong with the pipeline tests😅...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add latest log from self-hosted
run3.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bacv Would it be possible to reboot self-hosted machines just to run this workflow: Test Validator Behaviour
? This run to check if state of the machine has any impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issues like 902 usually happen because multiple factors, e.g. environment is not friendly + connections are not that robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to:
- Look at the runner machine. Configuration differences to macos-latest. State of CPU, memory, networking.
- Play round libp2p networking layer. Use different transport protocol. Try any tuning params for QUICK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This run to check if state of the machine has any impact.
This can be tested on gh linux runner, without requiring any changes to the self hosted. Let me double check the test, I feel that the state of a machine isn't significant factor in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found only one log message type is present in self-hosted and not in macos-latest:
2024-11-29T10:04:19.560125Z DEBUG nomos_da_network_core::protocols::dispersal::validator::behaviour::tests: Executor event: ConnectionClosed { peer_id: PeerId("12D3KooWE7XHQBnZ57Bk9tfpiTHN18cpV1B96tf4mi6RsdBRYtoJ"), connection_id: ConnectionId(21), endpoint: Dialer { address: /ip4/127.0.0.1/udp/5209/quic-v1/p2p/12D3KooWE7XHQBnZ57Bk9tfpiTHN18cpV1B96tf4mi6RsdBRYtoJ, role_override: Dialer }, num_established: 0, cause: Some(IO(Custom { kind: Other, error: Connection(ConnectionError(ApplicationClosed(ApplicationClose { error_code: 0, reason: b"" }))) })) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nux runner,
Is self-hosted a linux ? We have already one differential run on macos-latest - np.
1. What does this PR implement?
This PR tests functionality related to DispersalExecutorBehaviour and DispersalValidatorBehaviour. Primary focus on correct decoding of types received from the network and sanity checks whether message is sent to (in)correct member.
Tests included:
2. Does the code have enough context to be clearly understood?
Context is defined by DispersalExecutorBehaviour and DispersalValidatorBehaviour which are subject to test.
3. Who are the specification authors and who is accountable for this PR?
@romanzac
4. Is the specification accurate and complete?
Yes
5. Does the implementation introduce changes in the specification?
No changes to specification are expected. Changes to implementation are expected, because when executors are included within addressbook, some messages won't reach validators for dispersal. #819 (comment)
Checklist