Skip to content

release-25.2: backup: fix race condition in starting compaction job #150505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release-25.2
Choose a base branch
from

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Jul 18, 2025

Backport 1/1 commits from #150169 on behalf of @kev-cao.


In #145930, scheduled compactions are blocked from running if another compaction job is running for the schedule. However, it is currently possible for there to be a race condition which results in a compaction job being unable to find an incremental backup. Take the following circumstance:

  1. Compaction job A starts.
  2. A scheduled backup B completes and begins considering whether a compaction job should run. It fetches the current chain to its end time and finds that it should run a compaction.
  3. Compaction job A completes.
  4. B starts a transaction to create the compaction job. Because A has completed, it does not block the job from being created.
  5. B creates a compaction job C that has a start time that is now skipped due to A's completion.
  6. When C is picked up by the job system, it resolves the backup chain again, which now no longer has its start time and it fails.

This is resolved by opening a transaction before fetching the backup chain to check for an already running compaction job ID.

Fixes: #149867, #147264

Release note: None


Release justification:

In #145930, scheduled compactions are blocked from running if another
compaction job is running for the schedule. However, it is currently
possible for there to be a race condition which results in a compaction
job being unable to find an incremental backup. Take the following
circumstance:

1. Compaction job A starts.
2. A scheduled backup B completes and begins considering whether a
compaction job should run. It fetches the current chain to its end time
and finds that it should run a compaction.
3. Compaction job A completes.
4. B starts a transaction to create the compaction job. Because A has
completed, it does not block the job from being created.
5. B creates a compaction job C that has a start time that is now
skipped due to A's completion.
6. When C is picked up by the job system, it resolves the backup chain
again, which now no longer has its start time and it fails.

This is resolved by opening a transaction before fetching the backup
chain to check for an already running compaction job.

Fixes: #149867, #147264

Release note: None
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-25.2-150169 branch from d3e38d9 to 00c750e Compare July 18, 2025 17:15
@blathers-crl blathers-crl bot requested a review from a team as a code owner July 18, 2025 17:15
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Jul 18, 2025
@blathers-crl blathers-crl bot requested review from kev-cao and removed request for a team July 18, 2025 17:15
Copy link
Author

blathers-crl bot commented Jul 18, 2025

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl bot requested review from jeffswenson and msbutler July 18, 2025 17:15
@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label Jul 18, 2025
Copy link
Author

blathers-crl bot commented Jul 18, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants