-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SafetyData
to avoid double-proposing instead of myLastProposedView
#6921
Use SafetyData
to avoid double-proposing instead of myLastProposedView
#6921
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/efm-recovery #6921 +/- ##
=====================================================
Coverage 41.12% 41.13%
=====================================================
Files 2123 2122 -1
Lines 187163 187169 +6
=====================================================
+ Hits 76980 76991 +11
- Misses 103735 103739 +4
+ Partials 6448 6439 -9
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
EventHandler.Start
SafetyData
to avoid double-proposing instead of myLastProposedView
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. Mostly stylistic and documentation suggestions. Only spotted two algorithmic aspects that I think should be revised:
- In a concurrent environment, I think the
Persister
does not properly guarantee data consistency (see this comment) - Due to its cache, the
Persister
must be a Singleton in my opinion (see this comment), which I think thescaffold
violates 👉 this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❗
In this implementation, I think the Persister
does not properly guarantee data consistency
- imagine a caller that:
- instantiates a struct
s
of typeSafetyData
and writes something in it - then it calls
PutSafetyData(s)
- then it recycles
s
by writing more values into it
- instantiates a struct
- Similarly, imagine we have different concurrent calls of:
- one go routine calls
sa := GetLivenessData()
- another routine calls
PutSafetyData(sb)
- one go routine calls
In these scenarios, we would illegally mutate the safety data and have strange race conditions. While the current usage of the Persister
does not do anything like this, I still think it can still a source of subtle bugs (especially critical for the SafetyRules
) during future code extensions.
Edit: Yurii pointed out that SafetyRules
actually re-uses the object s
that it calls PutSafetyData(s)
with (e.g. here)❗ Unfortunately, even for the current usage pattern, we must fix this.
Suggestion
Persister
includessafetyData
- On
PutSafetyData(s)
, we copy the content ofs
intosafetyData
(happens automatically, sincePersister.safetyData
is a value type GetLivenessData()
would be something like:// GetSafetyData will retrieve last persisted safety data. // During normal operations, no errors are expected. func (p *Persister) GetSafetyData() *hotstuff.SafetyData { var s hotstuff.SafetyData p.mu.Lock() s = p.safetyData p.mu.Unlock() return &s }
- analogously for
livenessData
And Alex as a perfectionist: defer
has a marginal runtime overhead, because the compiler still puts the function call on the stack. So in cases where it is absolutely trivial (i.e. we can just write p.mu.Unlock()
instead of defer p.mu.Unlock()
), I prefer to skip the defer
👉 methods GetSafetyData
and GetLivenessData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very much on board with this suggestion and would like to use this approach since I am a big believer of using value data types for concurrency operations. The only think I would like to mention is that we can't simply copy hotstuff.SafetyData
because it has a field LastTimeout *model.TimeoutObject
and that field has:
// NewestQC is the newest QC (by view) known to the creator of this TimeoutObject
NewestQC *flow.QuorumCertificate
// LastViewTC is the timeout certificate for previous view if NewestQC.View != View - 1, else nil
LastViewTC *flow.TimeoutCertificate
// SigData is a BLS signature created by staking key signing View + NewestQC.View
// This signature is further aggregated in TimeoutCertificate.
SigData crypto.Signature
So a simple copy won't work, this needs to be a proper clone to ensure we are dealing with a full copy instead of a shallow one.
Same problem with LivenessData
. Maybe we could use clone
package for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing, going to revert the Persister caching:
- as part of the PR review, we removed the one instance where we read from
Persister
in the hotpath - safely updating the cache implementation is a non-trivial amount of work + complexity, which now has little benefit.
} | ||
|
||
// PutSafetyData persists the last safety data. | ||
// During normal operations, no errors are expected. | ||
func (p *Persister) PutSafetyData(safetyData *hotstuff.SafetyData) error { | ||
return operation.RetryOnConflict(p.db.Update, operation.UpdateSafetyData(p.chainID, safetyData)) | ||
p.mu.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two problems here:
In previous implementation the database was used as synchronization primitive and it was performing it in safe way.
- In current implementation there is no need to couple the update of
p.safetyData
and write to the DB as part of one critical section. Especially this being called on hot path. We could at least make the write to the DB before entering the critical section. - I am very much worried about storing safety data protected by mutex but storing a pointer instead of a value. If we take a look at previous implementation we will see that it actually deals with copies since we serialize data and store it in the database. Taking copies simplifies concurrency by a large margin and makes reasoning a lot simpler. In the current implementation it's less secure since there is no guarantee that after calling
PutSafetyData
orPutLivenessData
caller won't modify the value behind the pointer by accident and leave the persister in very much invalid state with a bug which will be extremely hard to catch.
We are trying to build a reliable component for our hot-path so I would take extra measures to make sure it's as safe as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the cache per our discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for resolving long standing tech dept, the implementation looks a lot simpler this way. Well done!
Co-authored-by: Alexander Hentschel <[email protected]>
…flow/flow-go into jord/hotstuff-safety-data-persist-bug
@@ -21,5 +21,9 @@ type HotstuffReader interface { | |||
|
|||
// NewHotstuffReader returns a new Persister, constrained to read-only operations. | |||
func NewHotstuffReader(db *badger.DB, chainID flow.ChainID) HotstuffReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed this interface and realized that it would be cleaner to use the interface in the scaffold for the ping service here:
Lines 278 to 281 in 2f9cde3
persist, err := persister.New(node.DB, node.RootChainID) | |
if err != nil { | |
return nil, err | |
} |
specifically because we don't want the ping service to be able to write the liveness data.
Not sure if that's possably overengineering:
- move the HotstuffReader into the same package as the persister interface
- let the constructor
NewHotstuffReader
error - panic in the
cmd
package in case of an error (instead of the constructor)
P.S. if you wanted to avoid the changes to in the cmd
package, the Persister's constructor does not need to error anymore (after removing the cache), so you could revert the error checks alltogether ... in the end, I feel allowing the constructor to error (even if the current implementation never does) is probably a good thing.
cmd/scaffold.go
Outdated
var hotstuffViewFunc func() (uint64, error) | ||
// Setup consensus nodes to report their HotStuff view | ||
if fnb.BaseConfig.NodeRole == flow.RoleConsensus.String() { | ||
persist, err := persister.New(node.DB, node.RootChainID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we could then use
flow-go/cmd/util/cmd/read-hotstuff/cmd/reader.go
Lines 11 to 20 in 78e76d8
// HotstuffReader exposes only the read-only parts of the Hotstuff Persister component. | |
// This is used to read information about the HotStuff instance's current state from CLI tools. | |
// CAUTION: the write functions are hidden here, because it is NOT SAFE to use them outside | |
// the Hotstuff state machine. | |
type HotstuffReader interface { | |
// GetLivenessData retrieves the latest persisted liveness data. | |
GetLivenessData() (*hotstuff.LivenessData, error) | |
// GetSafetyData retrieves the latest persisted safety data. | |
GetSafetyData() (*hotstuff.SafetyData, error) | |
} |
@@ -470,6 +417,11 @@ func (e *EventHandler) proposeForNewViewIfPrimary() error { | |||
// EventHandler and instead need to put it into the EventLoop's inbound queue to support consensus committees of size 1. | |||
flowProposal, err := e.blockProducer.MakeBlockProposal(curView, newestQC, lastViewTC) | |||
if err != nil { | |||
if model.IsNoVoteError(err) { | |||
log.Info().Err(err).Msg("aborting block proposal to prevent equivocation (likely re-entered proposal logic due to crash)") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -47,11 +53,19 @@ func (p *Persister) GetLivenessData() (*hotstuff.LivenessData, error) { | |||
// PutSafetyData persists the last safety data. | |||
// During normal operations, no errors are expected. | |||
func (p *Persister) PutSafetyData(safetyData *hotstuff.SafetyData) error { | |||
return operation.RetryOnConflict(p.db.Update, operation.UpdateSafetyData(p.chainID, safetyData)) | |||
err := operation.RetryOnConflict(p.db.Update, operation.UpdateSafetyData(p.chainID, safetyData)) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this and just do below
return err
} | ||
|
||
// PutLivenessData persists the last liveness data. | ||
// During normal operations, no errors are expected. | ||
func (p *Persister) PutLivenessData(livenessData *hotstuff.LivenessData) error { | ||
return operation.RetryOnConflict(p.db.Update, operation.UpdateLivenessData(p.chainID, livenessData)) | ||
err := operation.RetryOnConflict(p.db.Update, operation.UpdateLivenessData(p.chainID, livenessData)) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here the error check is not necessary, just return err below
- replace HotStuffReader with hotstuff.PersisterReader - use PersisterReader in read-only contexts
This PR modifies the logic preventing double-proposing to use a sentinel from
BlockProducer
instead of theEventHandler.myLastProposedView
field.This is in order to address a race condition in the current logic, observed on Benchnet, which arises due to the use of an additional state variable outside
SafetyRules
to track safety information.Context
The
EventHandler.myLastProposedView
field was introduced in this PR: #6392. At a similar time, #6389 was opened. #6389 was closed by #6463.Race Condition
The race condition observed on Benchet occurs when a node crashes in the middle of proposing a new block. While proposing, the node first builds and signs, then stores the block. While signing, we synchronously persist new SafetyData corresponding to the new proposal. If we crash in between signing and storing the block, then, when starting back up,
Persister
knows about the latest view, butForks
does not.The
EventHandler
uses theEventHandler.myLastProposedView
field and the result of queryingForks
to avoid double-proposing. However, after startup in this race condition:myLastProposedView
is 0 (not initialized at startup)Forks
does not contain our proposal for the current view, because we never stored it in the databaseThe node then proceeds with producing a new proposal for the view. Upon being requested for a signature,
SafetyRules
is understandably horrified that we are attempting to double-propose, and returns an unexpected error which crashes the node.Changes
myLastProposedView
fieldPersister
to avoid reading from the databaseBlockProducer