-
Notifications
You must be signed in to change notification settings - Fork 5
Fix off-by-one error in written offset callback #464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
61a3216 to
dcc34f2
Compare
| changelog.topic() | ||
| ); | ||
| table.init(0); | ||
| table = cassandraClient.kvFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ableegoldman I had to partially revert #383 here. One of the tests shouldRestoreUnflushedChangelog used a cassandra fault injector which did nothing when mongo was used as the backend.
| } | ||
|
|
||
| @ParameterizedTest | ||
| @EnumSource(KVSchema.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fault injector doesn't actually work with KVSchema.FACT because it uses a different statement type. I think the off-by-one error was allowing the assertions to pass because we always reconsumed one extra record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found a good way to target the update from flush when using the FACT table. It uses a DefaultBoundStatement which is used by several other queries prior to flush. Without the targeted fault, the task crashes before it reads any input.
We use two paths to find the committed offset sent through
BatchFlusher.flushWriteBatch. One of them is through the offset committed through the consumer, and the other is through the producer's send callback. In the first case, the offset we forward through is exclusive (one more than the last consumed record). In the second, it is inclusive (matches the offset of the last written record).Later when we are restoring in
ResponsiveRestoreConsumer, we seek to the last committed offset in the remote store. This is correct in the case of the exclusive consumer committed offset. However, for the inclusive produced offset, we would replay the last written record. This is probably not a big deal for most stores, but RS3 has strict offset validation. If we are just replaying a single record, then we would send an expected written offset which exactly matches the (inclusive) offset of the record. This leads to anIllegalWalSegmenterror from RS3.You can see this chain of events from the following logs captured from the window soak:
The patch fixes the problem by incrementing the offset passed through offsets written by the producer. This ensures that the
notifyCommitcallback always receives an exclusive offset.