Skip to content

Conversation

@sjha4
Copy link
Member

@sjha4 sjha4 commented Aug 21, 2025

What are the changes introduced in this pull request?

This pull request introduces a rake task to clean up duplicate Katello::ErratumPackage records. This cleanup is a crucial preparatory step before the upcoming bigint migration to ensure data integrity and prevent potential issues with unique constraints.

Considerations taken when implementing this change?

  1. Added this as a task so it can be used in the upgrade task.
  2. Thought about adding this as part of migration but decided against it cause data migrations as part of db schema migrations is not ideal.
  3. Tested this with large amount of data with the following timings:
katello=# select count(1) from katello_erratum_packages;
 count
--------
 449694
 
 katello=# select count(1) from katello_module_stream_erratum_packages;
 count
--------
 450307
 
real	20m35.800s
user	5m25.025s
sys	0m23.097s

What are the testing steps for this pull request?

To test this change, you can create dummy duplicate data and then run the cleanup task.

First, drop the unique constraint on erratum_packages to allow for duplicate data creation.
psql -d katello -U katello
DROP index katello_erratum_packages_eid_nvrea_n_f

Next, create dummy duplicate Katello::ErratumPackage records from existing data in the Rails console.

Katello::ErratumPackage.all.each do |ep|
  ep1 = ep.dup
  ep2 = ep.dup
  ep3 = ep.dup
  ep4 = ep.dup
  ep5 = ep.dup
  ep6 = ep.dup
  ep1.save!
  ep2.save!
  ep3.save!
  ep4.save!
  ep5.save!
  ep6.save!
end

Gather existing IDs for creating associations.

ep_ids = Katello::ModuleStreamErratumPackage.pluck(:erratum_package_id)
module_stream_ids = Katello::ModuleStream.pluck(:id)

Create duplicate references to the newly created packages.

Katello::ErratumPackage.where.not(id: ep_ids).each do |ep|
  mep = Katello::ModuleStreamErratumPackage.new(module_stream_id: module_stream_ids[rand(953)], erratum_package_id: ep.id)
  mep.save
end

Finally, run the cleanup task and verify that the duplicates are removed and the associations are correctly maintained.

bundle exec rails katello:cleanup_duplicate_erratum_packages

To verify that duplicates got cleaned up by adding back the unique index:

CREATE UNIQUE INDEX katello_erratum_packages_eid_nvrea_n_f ON katello_erratum_packages USING btree (erratum_id, nvrea, name, filename);

Summary by Sourcery

Add a rake task to clean up duplicate Katello::ErratumPackage records by consolidating duplicates and updating associated module stream references in preparation for the bigint migration.

New Features:

  • Introduce katello:cleanup_duplicate_erratum_packages rake task to remove duplicate erratum_packages

Enhancements:

  • Group erratum_packages by NVREA, erratum_id, name, and filename and keep only the earliest record
  • Repoint ModuleStreamErratumPackage associations to the kept record and delete conflicting and obsolete references

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Aug 21, 2025

Reviewer's Guide

Introduces a new rake task to detect and remove duplicate Katello::ErratumPackage records, reassigning associated module stream links before deleting duplicates to ensure data integrity ahead of the bigint migration.

Sequence diagram for duplicate erratum package cleanup task

sequenceDiagram
  participant RakeTask as Rake Task
  participant User as User (anonymous_api_admin)
  participant EP as Katello::ErratumPackage
  participant MSEP as Katello::ModuleStreamErratumPackage

  RakeTask->>User: Set current user to anonymous_api_admin
  RakeTask->>EP: Find duplicate groups by nvrea, erratum_id, name, filename
  loop For each duplicate group
    RakeTask->>EP: Get all duplicate IDs
    RakeTask->>MSEP: For each duplicate, delete conflicting associations
    RakeTask->>MSEP: Update remaining associations to point to kept ID
  end
  RakeTask->>EP: Delete duplicate records
Loading

File-Level Changes

Change Details Files
Add cleanup_duplicate_erratum_packages rake task
  • Defined new task with description and dependencies
  • Set User.current to anonymous_api_admin
  • Invoked handle_duplicate_erratum_packages from the task
lib/katello/tasks/repository.rake
Implement duplicate cleanup logic in handle_duplicate_erratum_packages
  • Grouped ErratumPackage records by fields to identify duplicates
  • Built mappings of IDs to keep versus remove
  • Batched deletion of conflicting ModuleStreamErratumPackage records
  • Batched update of remaining module stream associations
  • Deleted duplicate ErratumPackage records
lib/katello/tasks/repository.rake

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `lib/katello/tasks/repository.rake:123` </location>
<code_context>
+    update_mappings.each_slice(1000) do |batch|
+      batch.each do |old_id, new_id|
+        # Delete records where module_stream already has the target erratum_package
+        Katello::ModuleStreamErratumPackage
+          .where(erratum_package_id: old_id)
+          .where(
+            module_stream_id: Katello::ModuleStreamErratumPackage
+              .where(erratum_package_id: new_id)
+              .select(:module_stream_id)
+          )
+          .delete_all
+
+        # Update remaining records
</code_context>

<issue_to_address>
Deleting records before updating may cause data loss if not all relationships are covered.

Please ensure this deletion logic accounts for all relevant relationships and maintains referential integrity.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@sjha4 sjha4 force-pushed the dup_erratum_package_cleanup branch from 2e31672 to 570af3e Compare August 21, 2025 16:01
@ianballou
Copy link
Member

ianballou commented Aug 25, 2025

We might want to alert the user that they need to reindex their DBs. Otherwise, if they clean up these dupes and proceed with the upgrade, the duplicates can get populated all over again.

Somewhat related: theforeman/foreman_maintain#1032

If we'll trigger this rake task from the pre-hooks in the installer or foreman_maintain, perhaps the warning could be reported then instead.

Essentially, they need to run runuser -u postgres -- reindexdb -a.

@sjha4
Copy link
Member Author

sjha4 commented Aug 26, 2025

There is value in alerting users here if they are ever running this outside of an upgrade..And the same warning will appear as part of upgrade when this task runs.. I'll add a message here..

@sjha4 sjha4 force-pushed the dup_erratum_package_cleanup branch 2 times, most recently from 4450067 to 1ce8ba1 Compare August 26, 2025 14:37
@sjha4
Copy link
Member Author

sjha4 commented Aug 26, 2025

@ianballou @pavanshekar This is good for a review..

def handle_duplicate_erratum_packages
# Alert users that they need to reindex their database to ensure the indexes are re-run and active.
puts "Please reindex your database to ensure indexes are rebuilt and active."
puts "This can be acheived by running `runuser -u postgres -- reindexdb -a`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one small typo:

Suggested change
puts "This can be acheived by running `runuser -u postgres -- reindexdb -a`"
puts "This can be achieved by running `runuser -u postgres -- reindexdb -a`"

Copy link
Contributor

@pavanshekar pavanshekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acking... - works well! 👍

@sjha4 sjha4 force-pushed the dup_erratum_package_cleanup branch from 1ce8ba1 to d6e7dbf Compare August 27, 2025 19:19
@sjha4 sjha4 force-pushed the dup_erratum_package_cleanup branch from d6e7dbf to 60119a2 Compare August 27, 2025 19:45
@sjha4 sjha4 requested a review from ianballou August 27, 2025 20:23
Copy link
Member

@ianballou ianballou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@sjha4 sjha4 merged commit dac6836 into Katello:master Aug 28, 2025
42 of 44 checks passed
pavanshekar pushed a commit to pavanshekar/katello that referenced this pull request Sep 9, 2025
pavanshekar pushed a commit that referenced this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants