Skip to content

backfill: handle transaction retry for vector index backfill #144328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025

Conversation

mw5h
Copy link
Contributor

@mw5h mw5h commented Apr 11, 2025

Previously, when backfilling a vector index, a transaction retry would cause backfill failure because the vector index backfill writer would overwrite the values read by the backfill reader with what it wanted to push down to KV so as to avoid new allocations. This worked well so long as there were no transaction retries but would fail to re-encode the index entry on a retry because the writer lost access to the unquantized vector. The quantized vector cannot be reused on a transaction retry because fixups may cause the target partition to change.

This patch creates a scratch rowenc.IndexEntry in the backfill helper to store the input vector entry. Before attempting to write the entry, we copy the input IndexEntry to the scratch entry and use that to re-encode the vector, which is still modified in place to limit new allocations.

Additionally, this patch switches the writer from using CPut() to CPutAllowingIfNotExists() so that if the backfill job restarts, we don't see the partially written index and fail due to duplicate keys.

Informs: #143107
Release note: None

@mw5h mw5h requested review from andy-kimball, DrewKimball and a team April 11, 2025 22:37
@mw5h mw5h requested a review from a team as a code owner April 11, 2025 22:37
Copy link

blathers-crl bot commented Apr 11, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andy-kimball)


pkg/sql/rowexec/indexbackfiller.go line 197 at r1 (raw file):

	}

	// Initialize the tmpEntry. This will store the input entry that we are encoding

nit: mention here that this allows us to preserve the initial "template" indexEntry across txn retries


pkg/sql/rowexec/indexbackfiller.go line 207 at r1 (raw file):

	tmpEntry.Value.RawBytes = tmpEntry.Value.RawBytes[:0]
	tmpEntry.Key = append(tmpEntry.Key, indexEntry.Key...)
	tmpEntry.Value.RawBytes = append(tmpEntry.Value.RawBytes, indexEntry.Value.RawBytes...)

super nit:

	tmpEntry.Key = append(tmpEntry.Key[:0], indexEntry.Key...)
	tmpEntry.Value.RawBytes = append(tmpEntry.Value.RawBytes[0:], indexEntry.Value.RawBytes...)

pkg/sql/backfill/backfill.go line 523 at r1 (raw file):

	outputEntry.Key = outputEntry.Key[:0]
	outputEntry.Key = append(outputEntry.Key, vih.indexPrefix...)

super nit: similar to below:

outputEntry.Key = append(outputEntry.Key[:0], vih.indexPrefix...)

@andy-kimball
Copy link
Contributor

Is it possible to add tests that detect and regress this issue?

@mw5h mw5h force-pushed the vec-backfill-fix branch 2 times, most recently from 80bfbcb to 4fc6cce Compare April 14, 2025 20:56
Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball and @mw5h)

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good, i just had some minor nits

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball and @mw5h)


pkg/sql/rowexec/indexbackfiller.go line 215 at r3 (raw file):

		}

		if ib.flowCtx.Cfg.TestingKnobs.VectorIndexBackfillTxnError != nil {

super nit: this testing hook could have a more general name, like RunDuringReencodeVectorIndexEntry, but i have no strong opinion on changing the name if you think we're unlikely to use this for any other purpose.


pkg/sql/backfill/backfill_test.go line 27 at r3 (raw file):

)

func TestMain(m *testing.M) {

nit: let's stick with the convention of keeping TestMain inside of a file called main_test.go

we also usually add a line like this to main_test.go, which updates a testing linter

//go:generate ../../util/leaktest/add-leaktest.sh *_test.go

pkg/sql/backfill/backfill_test.go line 64 at r3 (raw file):

					if !errorState.hasErrored {
						errorState.hasErrored = true
						return &testRetryError{}

in other tests that simulate retry errors, we use txn.GenerateForcedRetryableErr

return txn.GenerateForcedRetryableErr(ctx, "forcing a retry error")

could we use that here to avoid needing to create a mocked retry error? it would require passing in the kv.Txn object into the testing hook, which seems doable.

@mw5h mw5h force-pushed the vec-backfill-fix branch from 87492b9 to 8e36aab Compare April 15, 2025 17:58
@mw5h
Copy link
Contributor Author

mw5h commented Apr 15, 2025

RFAL

Previously, when backfilling a vector index, a transaction retry would
cause backfill failure because the vector index backfill writer would
overwrite the values read by the backfill reader with what it wanted to
push down to KV so as to avoid new allocations. This worked well so long
as there were no transaction retries but would fail to re-encode the
index entry on a retry because the writer lost access to the unquantized
vector. The quantized vector cannot be reused on a transaction retry
because fixups may cause the target partition to change.

This patch creates a scratch rowenc.IndexEntry in the backfill helper to
store the input vector entry. Before attempting to write the entry, we
copy the input IndexEntry to the scratch entry and use that to re-encode
the vector, which is still modified in place to limit new allocations.

Additionally, this patch switches the writer from using CPut() to
CPutAllowingIfNotExists() so that if the backfill job restarts, we don't
see the partially written index and fail due to duplicate keys.

Informs: cockroachdb#143107
Release note: None
@mw5h mw5h force-pushed the vec-backfill-fix branch from 8e36aab to 02738e6 Compare April 15, 2025 18:44
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @DrewKimball and @mw5h)

@mw5h
Copy link
Contributor Author

mw5h commented Apr 15, 2025

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 15, 2025

@craig craig bot merged commit 844a17b into cockroachdb:master Apr 15, 2025
23 checks passed
@mw5h mw5h added the backport-25.2.x Flags PRs that need to be backported to 25.2 label Apr 15, 2025
@mw5h
Copy link
Contributor Author

mw5h commented Apr 15, 2025

blathers backport 25.2

Copy link

blathers-crl bot commented Apr 15, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 02738e6 to blathers/backport-release-25.2-144328: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 25.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@mw5h
Copy link
Contributor Author

mw5h commented Apr 16, 2025

blathers backport 25.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants