Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate tracing #6339

Open
wants to merge 192 commits into
base: unstable
Choose a base branch
from
Open

Conversation

ThreeHrSleep
Copy link
Member

@ThreeHrSleep ThreeHrSleep commented Sep 2, 2024

Issue Addressed

Tracing Integration

Proposed Changes

  • replace slog & log with tracing throughout the codebase
  • implement custom crit log
  • make relevant changes in the formatter
  • replace sloggers
  • re-write SSE logging components

cc: @macladson @eserilev

@ThreeHrSleep ThreeHrSleep marked this pull request as draft September 2, 2024 05:04
@chong-he chong-he added work-in-progress PR is a work-in-progress UX-and-logs labels Sep 3, 2024
@ThreeHrSleep ThreeHrSleep changed the title feat: tracing integration feat: Integrate tracing Sep 4, 2024
@ThreeHrSleep ThreeHrSleep force-pushed the tracing-integration branch 2 times, most recently from 23491cd to 4b63358 Compare September 9, 2024 11:07
Add retropgf funding (sigp#6324)

Migrate from `ethereum-types` to `alloy-primitives` (sigp#6078)

* Remove use of ethers_core::RlpStream

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* Remove old code

* Simplify keccak call

* Remove unused package

* Merge branch 'unstable' of https://github.com/ethDreamer/lighthouse into remove_use_of_ethers_core

* Merge branch 'unstable' into remove_use_of_ethers_core

* Run clippy

* Merge branch 'remove_use_of_ethers_core' of https://github.com/dospore/lighthouse into remove_use_of_ethers_core

* Check all cargo fmt

* migrate to alloy primitives init

* fix deps

* integrate alloy-primitives

* resolve dep issues

* more changes based on dep changes

* add TODOs

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* Revert lock

* Add BeaconBlocksByRange v3

* continue migration

* Revert "Add BeaconBlocksByRange v3"

This reverts commit e3ce7fc.

* impl hash256 extended trait

* revert some uneeded diffs

* merge conflict resolved

* fix subnet id rshift calc

* rename to FixedBytesExtended

* debugging

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fix failed test

* fixing more tests

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into remove_use_of_ethers_core

* introduce a shim to convert between the two u256 types

* move alloy to wrokspace

* align alloy versions

* update

* update web3signer test certs

* refactor

* resolve failing tests

* linting

* fix graffiti string test

* fmt

* fix ef test

* resolve merge conflicts

* remove udep and revert cert

* cargo patch

* cyclic dep

* fix build error

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* resolve conflicts, update deps

* merge unstable

* fmt

* fix deps

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* resolve merge conflicts

* resolve conflicts, make necessary changes

* Remove patch

* fmt

* remove file

* merge conflicts

* sneaking in a smol change

* bump versions

* Merge remote-tracking branch 'origin/unstable' into migrate-to-alloy-primitives

* Updates for peerDAS

* Update ethereum_hashing to prevent dupe

* updated alloy-consensus, removed TODOs

* cargo update

* endianess fix

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fmt

* fix merge

* fix test

* fixed_bytes crate

* minor fixes

* convert u256 to i64

* panic free mixin to_low_u64_le

* from_str_radix

* computbe_subnet api and ensuring we use big-endian

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into migrate-to-alloy-primitives

* fix test

* Simplify subnet_id test

* Simplify some more tests

* Add tests to fixed_bytes crate

* Merge branch 'unstable' into migrate-to-alloy-primitives

update libp2p to version 0.54 (sigp#6249)

* update libp2p to version 0.54.0

* address review

* Merge branch 'unstable' of github.com:sigp/lighthouse into update-libp2p

* Merge branch 'update-libp2p' of github.com:sigp/lighthouse into update-libp2p

Drop block data from BlockError and BlobError (sigp#5735)

* Drop block data from BlockError and BlobError

* Debug release tests

* Fix release tests

* Revert unnecessary changes to lighthouse_metrics

Update manual checkpoint sync content in Lighthouse book (sigp#6338)

* Update manual checkpiont sync

* Update faq

* Minor revision

* Minor revision

Light Client Bug Fix (sigp#6299)

* Light Client Bug Fix

Ignore Rust 1.82 warnings about void patterns (sigp#6357)

* Ignore Rust 1.82 warnings about void patterns

Enable `large_stack_frames` lint (sigp#6343)

* Enable `large_stack_frames` lint

Add blob count label to `DATA_COLUMN_SIDECAR_COMPUTATION` metric (sigp#6340)

* Add blob count label to `DATA_COLUMN_SIDECAR_COMPUTATION` metric, and move metrics into the compute function, recording only successful computation.

* Move `discard_timer_on_break` usage to caller site.

* Merge branch 'unstable' into compute-data-column-metric

* Merge branch 'unstable' into compute-data-column-metric

Metadata request ordering (sigp#6336)

* Send metadata request ordering

* Merge branch 'unstable' into metadata-order

Clarify validator monitor block log (sigp#6342)

* Clarify validator monitor block log

* Merge branch 'unstable' into clarify-block-log

Check known parent on rpc blob process (sigp#5893)

* Check known parent on rpc blob process

* fix test

* Merge branch 'unstable' of https://github.com/sigp/lighthouse into blob-unknown-parent

Remove beta tag from gossipsub 1.2 (sigp#6344)

* Remove the beta tag from gossipsub v1.2

* fix clippy

* Merge branch 'unstable' into remove-beta-tag

Fix lints for Rust 1.81 (sigp#6363)

* Fix lints for Rust 1.81

Use tikv-jemallocator instead of jemallocator (sigp#6354)

* Use tikv-jemallocator instead of jemallocator

* Merge branch 'unstable' into tikv-jemallocator

* Bump tikv-jemallocator and tikv-jemalloc-ctl

Improve `get_custody_columns` validation, caching and error handling (sigp#6308)

* Improve `get_custody_columns` validation, caching and error handling.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Fix failing test and add more test.

* Fix failing test and add more test.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Add unit test to make sure the default specs won't panic on the `compute_custody_requirement_subnets` function.

* Add condition when calling `compute_custody_subnets_from_metadata` and update logs.

* Validate `csc` when returning from enr. Remove `csc` computation on connection since we get them on metadata anyway.

* Add `peers_per_custody_subnet_count` to track peer csc and supernodes.

* Disconnect peers with invalid metadata and find other peers instead.

* Fix sampling tests.

* Merge branch 'unstable' into get-custody-columns-error-handing

* Merge branch 'unstable' into get-custody-columns-error-handing

Delete legacy payload reconstruction (sigp#6213)

* Delete legacy payload reconstruction

* Delete unneeded failing test

* Merge remote-tracking branch 'origin/unstable' into remove-more-ethers

* Merge remote-tracking branch 'origin/unstable' into remove-more-ethers

* Cleanups

simplify rpc codec logic (sigp#6304)

* simplify rpc codec logic

* Merge branch 'unstable' of github.com:sigp/lighthouse into simplify-rpc-codec

* Merge branch 'unstable' of github.com:sigp/lighthouse into simplify-rpc-codec

* Merge branch 'unstable' of github.com:sigp/lighthouse into simply-rpc-codec

* Merge branch 'unstable' into simplify-rpc-codec

* Merge branch 'unstable' into simplify-rpc-codec
Revert "Merge branch 'sigp:stable' into tracing-integration"

This reverts commit 668157c, reversing
changes made to ee25fbe.
@eserilev
Copy link
Collaborator

eserilev commented Oct 1, 2024

Tracing
image

Slog
image

@eserilev
Copy link
Collaborator

eserilev commented Oct 1, 2024

at the moment this is what a crit log looks like

its labeled as ERROR but with an error_type = crit. Im wondering if we can tweak the formatting here to display CRIT in purple instead of ERROR (as we have currently for our slog impl)

image

Copy link
Collaborator

@eserilev eserilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far!

Most of my suggested changes revolve around making sure we stick to display formatting vs debug formatting.

I think we have some options to simplify the crit! macro a bit

For crit level errors we'll want to display the Crit log similarly to how they are currently displayed with slog. This will involve implementing a custom tracing layer that takes in those crit logs and formats them appropriately.

In situations where we use slog:o! I think we can replace those with a tracing span

Theres some instances where we are still importing slog, we should try removing those if possible

beacon_node/beacon_chain/src/builder.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/light_client_server_cache.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/migrate.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/otb_verification_service.rs Outdated Show resolved Hide resolved
beacon_node/beacon_chain/src/state_advance_timer.rs Outdated Show resolved Hide resolved
validator_client/src/sync_committee_service.rs Outdated Show resolved Hide resolved
validator_client/src/sync_committee_service.rs Outdated Show resolved Hide resolved
validator_client/src/validator_store.rs Outdated Show resolved Hide resolved
validator_client/src/validator_store.rs Outdated Show resolved Hide resolved
validator_client/src/validator_store.rs Outdated Show resolved Hide resolved
Copy link

mergify bot commented Feb 5, 2025

This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏

@eserilev
Copy link
Collaborator

eserilev commented Feb 5, 2025

It seems that when I had introduced #[instrument] macro usage in a previous commit, I broke some things. Sorry for all the headaches!

  • I've tweaked the #[instrument] attribute macro to skip recording all arguments of a calling function/method via skip_all. We may have been recording complex/large data types would could have led to increased memory usage
  • I've tweaked the #[instrument] macro with parent = None. Previously, active spans with the same name would reference each other and stack in a "hierarchy of events" tree that tracing manages under the hood. With parent = None, spans with the same name wont reference each other anymore, and are instead treated as root spans. I think we may have had some deeply nested spans that may have potentially contributed to some of our OOM issues.

Testing locally things seem okay now! Going to deploy to one of our holesky nodes soon

@eserilev
Copy link
Collaborator

eserilev commented Feb 5, 2025

Some feedback from anton:

Ensure we stick to comma separated kv format

we used to have this format:

Feb 05 13:58:01.139 DEBG Obtained peer's metadata                new_seq_no: 415, peer_id: 16Uiu2HAmEMKzZFRN2u1ByvwCWqU6nn9rutg87asbPyofd7NsM4Kn, service: libp2p

decoded as folllows :

match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}(,| +) error: %{GREEDYDATA:error}\), %{GREEDYDATA:kv_pairs}" }
match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}(,| +) %{GREEDYDATA:kv_pairs}" }
match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}" }

ideally we should keep this log format (comma separated key values)

Fix one of the log formats

if we can mutate this one from:

Feb 05 14:06:06.552 DEBUG 974b737b797058e49fad2090a55aabe34b26148e      service="network"

to:

Feb 05 14:06:06.552 DEBUG new message recorded                                                     message="974b737b797058e49fad2090a55aabe34b26148e", service="network"

to keep consistency

Match current parsing rules

current parsing rules:

bootnode: 
       match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}(,| +|\.) %{GREEDYDATA:kv_pairs}" }
vc: 
       match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}(,| +) %{GREEDYDATA:kv_pairs}" }
bn:
       match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}(,| +) error: %{GREEDYDATA:error}\), %{GREEDYDATA:kv_pairs}" }
       match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}(,| +) %{GREEDYDATA:kv_pairs}" }
       match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}" }

ideally any and all of these should be correctly parsed with just the one

match => { "message" => "%{SYSLOGTIMESTAMP} %{WORD:loglevel} %{DATA:msg}(,| +) %{GREEDYDATA:kv_pairs}" }

@michaelsproul michaelsproul added v7.1.0 Post-Electra release and removed v7.0.0-beta.0 New release c. Q1 2025 labels Feb 6, 2025
@ThreeHrSleep
Copy link
Member Author

ThreeHrSleep commented Feb 6, 2025

Fix one of the log formats

if we can mutate this one from:

Feb 05 14:06:06.552 DEBUG 974b737b797058e49fad2090a55aabe34b26148e      service="network"

to:

Feb 05 14:06:06.552 DEBUG new message recorded                                                     message="974b737b797058e49fad2090a55aabe34b26148e", service="network"

to keep consistency

it looks like this type of logs were appearing because in some places (like this) the word message was being used as field name which interfered with the literal log message

this should (hopefully) fix it

@antondlr
Copy link
Member

antondlr commented Feb 8, 2025

we ran out of disk space on our tracing node because the file-based logs use different naming, and the limits are not respected:
we used to have one beacon.log for appending and beacon.log._n_(.gz) for rotating according to --logfile-max-number and --logfile-compress.
Now, all the logs are stored in one big beacon.$date.log file, rotated out every day but never (? maybe after max-number days?) cleaned, and crucially, never compressed.
Given the amount of logs one (debug) BN generates, this rapidly filled up our disk.

To minimise impact on other people's deployment, we should re-implement the old naming and rotation scheme.

Copy link

mergify bot commented Feb 11, 2025

This pull request has merge conflicts. Could you please resolve them @ThreeHrSleep? 🙏

@antondlr
Copy link
Member

thanks for this ! 🙏
two more nits in this regard:

  • rotation numbering is off, it just keeps going up where in the old naming scheme, there was always only a beacon.log.{1-n}.gz where n == --logfile-max-number
  • the discv5 and libp2p logs are still on the monolithical, non-rotated, one-per-day naming scheme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompat Backwards-incompatible API change ready-for-review The code is ready for review UX-and-logs v7.1.0 Post-Electra release
Projects
None yet
Development

Successfully merging this pull request may close these issues.