Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SafetyData to avoid double-proposing instead of myLastProposedView #6921

Merged

Conversation

jordanschalm
Copy link
Member

@jordanschalm jordanschalm commented Jan 22, 2025

This PR modifies the logic preventing double-proposing to use a sentinel from BlockProducer instead of the EventHandler.myLastProposedView field.

This is in order to address a race condition in the current logic, observed on Benchnet, which arises due to the use of an additional state variable outside SafetyRules to track safety information.

Context

The EventHandler.myLastProposedView field was introduced in this PR: #6392. At a similar time, #6389 was opened. #6389 was closed by #6463.

Race Condition

The race condition observed on Benchet occurs when a node crashes in the middle of proposing a new block. While proposing, the node first builds and signs, then stores the block. While signing, we synchronously persist new SafetyData corresponding to the new proposal. If we crash in between signing and storing the block, then, when starting back up, Persister knows about the latest view, but Forks does not.

The EventHandler uses the EventHandler.myLastProposedView field and the result of querying Forks to avoid double-proposing. However, after startup in this race condition:

  • myLastProposedView is 0 (not initialized at startup)
  • Forks does not contain our proposal for the current view, because we never stored it in the database

The node then proceeds with producing a new proposal for the view. Upon being requested for a signature,SafetyRules is understandably horrified that we are attempting to double-propose, and returns an unexpected error which crashes the node.

Changes

  • Removes the myLastProposedView field
  • Modifies Persister to avoid reading from the database
    • Adds the constraint that safety/liveness data must be initialized prior to construction. (This is already true, and close to the existing constraint that the data must be initialized prior to first use.)
    • Caches all writes of safety/liveness data, so reads never hit the database and cannot return an error.
    • (This is not strictly necessary, but I originally did this to avoid introducing an extra database read on the consensus hot-path. Although this PR no longer adds a read on the hotpath, the change is implemented at this point.)
  • Replaces current logic to avoid double proposals with an additional sentinel in BlockProducer

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 22.80702% with 44 lines in your changes missing coverage. Please review.

Project coverage is 41.13%. Comparing base (b87ed60) to head (587bcf9).

Files with missing lines Patch % Lines
cmd/scaffold.go 0.00% 21 Missing ⚠️
cmd/consensus/main.go 0.00% 5 Missing ⚠️
cmd/util/cmd/read-hotstuff/cmd/get_liveness.go 0.00% 4 Missing ⚠️
cmd/util/cmd/read-hotstuff/cmd/get_safety.go 0.00% 4 Missing ⚠️
consensus/hotstuff/blockproducer/block_producer.go 55.55% 4 Missing ⚠️
engine/collection/epochmgr/factories/hotstuff.go 40.00% 2 Missing and 1 partial ⚠️
consensus/hotstuff/persister/persister.go 50.00% 2 Missing ⚠️
consensus/hotstuff/safetyrules/safety_rules.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@                  Coverage Diff                  @@
##           feature/efm-recovery    #6921   +/-   ##
=====================================================
  Coverage                 41.12%   41.13%           
=====================================================
  Files                      2123     2122    -1     
  Lines                    187163   187169    +6     
=====================================================
+ Hits                      76980    76991   +11     
- Misses                   103735   103739    +4     
+ Partials                   6448     6439    -9     
Flag Coverage Δ
unittests 41.13% <22.80%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jordanschalm jordanschalm changed the title Initialize last acknowledged view in EventHandler.Start Use SafetyData to avoid double-proposing instead of myLastProposedView Jan 23, 2025
@jordanschalm jordanschalm marked this pull request as ready for review January 24, 2025 21:27
@jordanschalm jordanschalm requested a review from a team as a code owner January 24, 2025 21:27
Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Mostly stylistic and documentation suggestions. Only spotted two algorithmic aspects that I think should be revised:

  1. In a concurrent environment, I think the Persister does not properly guarantee data consistency (see this comment)
  2. Due to its cache, the Persister must be a Singleton in my opinion (see this comment), which I think the scaffold violates 👉 this comment

consensus/hotstuff/persister/persister.go Show resolved Hide resolved
consensus/hotstuff/persister/persister.go Outdated Show resolved Hide resolved
consensus/hotstuff/persister/persister.go Outdated Show resolved Hide resolved
consensus/hotstuff/persister/persister.go Outdated Show resolved Hide resolved
Copy link
Member

@AlexHentschel AlexHentschel Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation, I think the Persister does not properly guarantee data consistency

  • imagine a caller that:
    • instantiates a struct s of type SafetyData and writes something in it
    • then it calls PutSafetyData(s)
    • then it recycles s by writing more values into it
  • Similarly, imagine we have different concurrent calls of:
    • one go routine calls sa := GetLivenessData()
    • another routine calls PutSafetyData(sb)

In these scenarios, we would illegally mutate the safety data and have strange race conditions. While the current usage of the Persister does not do anything like this, I still think it can still a source of subtle bugs (especially critical for the SafetyRules) during future code extensions.

Edit: Yurii pointed out that SafetyRules actually re-uses the object s that it calls PutSafetyData(s) with (e.g. here)❗ Unfortunately, even for the current usage pattern, we must fix this.

Suggestion

  • Persister includes safetyData
  • On PutSafetyData(s), we copy the content of s into safetyData (happens automatically, since Persister.safetyData is a value type
  • GetLivenessData() would be something like:
    // GetSafetyData will retrieve last persisted safety data.
    // During normal operations, no errors are expected.
    func (p *Persister) GetSafetyData() *hotstuff.SafetyData {
    	var s hotstuff.SafetyData
    	p.mu.Lock()
    	s = p.safetyData
    	p.mu.Unlock()
    	return &s
    }
  • analogously for livenessData

And Alex as a perfectionist: defer has a marginal runtime overhead, because the compiler still puts the function call on the stack. So in cases where it is absolutely trivial (i.e. we can just write p.mu.Unlock() instead of defer p.mu.Unlock()), I prefer to skip the defer 👉 methods GetSafetyData and GetLivenessData

Copy link
Member

@durkmurder durkmurder Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very much on board with this suggestion and would like to use this approach since I am a big believer of using value data types for concurrency operations. The only think I would like to mention is that we can't simply copy hotstuff.SafetyData because it has a field LastTimeout *model.TimeoutObject and that field has:

// NewestQC is the newest QC (by view) known to the creator of this TimeoutObject
NewestQC *flow.QuorumCertificate
// LastViewTC is the timeout certificate for previous view if NewestQC.View != View - 1, else nil
LastViewTC *flow.TimeoutCertificate
// SigData is a BLS signature created by staking key signing View + NewestQC.View
// This signature is further aggregated in TimeoutCertificate.
SigData crypto.Signature

So a simple copy won't work, this needs to be a proper clone to ensure we are dealing with a full copy instead of a shallow one.

Same problem with LivenessData. Maybe we could use clone package for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing, going to revert the Persister caching:

  • as part of the PR review, we removed the one instance where we read from Persister in the hotpath
  • safely updating the cache implementation is a non-trivial amount of work + complexity, which now has little benefit.

consensus/hotstuff/persister.go Outdated Show resolved Hide resolved
consensus/hotstuff/eventhandler/event_handler.go Outdated Show resolved Hide resolved
consensus/hotstuff/persister.go Show resolved Hide resolved
consensus/hotstuff/persister/persister.go Show resolved Hide resolved
cmd/scaffold.go Outdated Show resolved Hide resolved
@jordanschalm jordanschalm mentioned this pull request Jan 27, 2025
2 tasks
}

// PutSafetyData persists the last safety data.
// During normal operations, no errors are expected.
func (p *Persister) PutSafetyData(safetyData *hotstuff.SafetyData) error {
return operation.RetryOnConflict(p.db.Update, operation.UpdateSafetyData(p.chainID, safetyData))
p.mu.Lock()
Copy link
Member

@durkmurder durkmurder Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ I am very worried about this setup.
I see two problems here:
In previous implementation the database was used as synchronization primitive and it was performing it in safe way.

  1. In current implementation there is no need to couple the update of p.safetyData and write to the DB as part of one critical section. Especially this being called on hot path. We could at least make the write to the DB before entering the critical section.
  2. I am very much worried about storing safety data protected by mutex but storing a pointer instead of a value. If we take a look at previous implementation we will see that it actually deals with copies since we serialize data and store it in the database. Taking copies simplifies concurrency by a large margin and makes reasoning a lot simpler. In the current implementation it's less secure since there is no guarantee that after calling PutSafetyData or PutLivenessData caller won't modify the value behind the pointer by accident and leave the persister in very much invalid state with a bug which will be extremely hard to catch.

We are trying to build a reliable component for our hot-path so I would take extra measures to make sure it's as safe as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the cache per our discussion

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for resolving long standing tech dept, the implementation looks a lot simpler this way. Well done!

@@ -21,5 +21,9 @@ type HotstuffReader interface {

// NewHotstuffReader returns a new Persister, constrained to read-only operations.
func NewHotstuffReader(db *badger.DB, chainID flow.ChainID) HotstuffReader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed this interface and realized that it would be cleaner to use the interface in the scaffold for the ping service here:

flow-go/cmd/scaffold.go

Lines 278 to 281 in 2f9cde3

persist, err := persister.New(node.DB, node.RootChainID)
if err != nil {
return nil, err
}

specifically because we don't want the ping service to be able to write the liveness data.

Not sure if that's possably overengineering:

  • move the HotstuffReader into the same package as the persister interface
  • let the constructor NewHotstuffReader error
  • panic in the cmd package in case of an error (instead of the constructor)

P.S. if you wanted to avoid the changes to in the cmd package, the Persister's constructor does not need to error anymore (after removing the cache), so you could revert the error checks alltogether ... in the end, I feel allowing the constructor to error (even if the current implementation never does) is probably a good thing.

cmd/scaffold.go Outdated
var hotstuffViewFunc func() (uint64, error)
// Setup consensus nodes to report their HotStuff view
if fnb.BaseConfig.NodeRole == flow.RoleConsensus.String() {
persist, err := persister.New(node.DB, node.RootChainID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could then use

// HotstuffReader exposes only the read-only parts of the Hotstuff Persister component.
// This is used to read information about the HotStuff instance's current state from CLI tools.
// CAUTION: the write functions are hidden here, because it is NOT SAFE to use them outside
// the Hotstuff state machine.
type HotstuffReader interface {
// GetLivenessData retrieves the latest persisted liveness data.
GetLivenessData() (*hotstuff.LivenessData, error)
// GetSafetyData retrieves the latest persisted safety data.
GetSafetyData() (*hotstuff.SafetyData, error)
}

@@ -470,6 +417,11 @@ func (e *EventHandler) proposeForNewViewIfPrimary() error {
// EventHandler and instead need to put it into the EventLoop's inbound queue to support consensus committees of size 1.
flowProposal, err := e.blockProducer.MakeBlockProposal(curView, newestQC, lastViewTC)
if err != nil {
if model.IsNoVoteError(err) {
log.Info().Err(err).Msg("aborting block proposal to prevent equivocation (likely re-entered proposal logic due to crash)")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -47,11 +53,19 @@ func (p *Persister) GetLivenessData() (*hotstuff.LivenessData, error) {
// PutSafetyData persists the last safety data.
// During normal operations, no errors are expected.
func (p *Persister) PutSafetyData(safetyData *hotstuff.SafetyData) error {
return operation.RetryOnConflict(p.db.Update, operation.UpdateSafetyData(p.chainID, safetyData))
err := operation.RetryOnConflict(p.db.Update, operation.UpdateSafetyData(p.chainID, safetyData))
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this and just do below

return err

}

// PutLivenessData persists the last liveness data.
// During normal operations, no errors are expected.
func (p *Persister) PutLivenessData(livenessData *hotstuff.LivenessData) error {
return operation.RetryOnConflict(p.db.Update, operation.UpdateLivenessData(p.chainID, livenessData))
err := operation.RetryOnConflict(p.db.Update, operation.UpdateLivenessData(p.chainID, livenessData))
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here the error check is not necessary, just return err below

- replace HotStuffReader with hotstuff.PersisterReader
- use PersisterReader in read-only contexts
@jordanschalm jordanschalm merged commit 0ac15d0 into feature/efm-recovery Jan 28, 2025
56 checks passed
@jordanschalm jordanschalm deleted the jord/hotstuff-safety-data-persist-bug branch January 28, 2025 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants