Skip to content

Add searchable field column to handle full text search #8544

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Netacci
Copy link
Contributor

@Netacci Netacci commented Mar 4, 2025

This PR is an alternative approach for handling full-text search using SearchVector.
Changes:
Added a search_vector field to the Commit model:
search_vector = SearchVectorField(null=True, blank=True)

Now when we run a query, instead of creating the SearchVector dynamically, the search now runs against the already populated search_vector field:

filtered_commits = (
                Commit.objects.filter(
                    search_vector=SearchQuery(search_param, config="english"),
                    push__repository=repository,
                )
                .values_list("push_id", flat=True)
                .order_by("-push__time")
                .distinct()[:200]
            )

The search_vector is updated when a new commit is added

   def save(self, *args, **kwargs):
        self.search_vector = SearchVector(
            self.revision, self.author, Substr(self.comments, 1, 100000), config="english"
        )
        super().save(*args, **kwargs)

This can also be done with Postgres Triggers(but I didn't want to tamper with the db) also it's easier to debug if django handles it.

Endpoint: http://localhost:8000/api/project/try/push/?search=1906541
Result: Returns results matching the query across relevant fields such as bug_numbers, summary, author, and revisions.

When I print filtered_commits.query I get

SELECT DISTINCT "commit"."push_id", "push"."time" FROM "commit" INNER JOIN "push" ON ("commit"."push_id" = "push"."id") WHERE ("push"."repository_id" = 4 AND "commit"."search_vector" @@ (plainto_tsquery(english::regconfig, 1906541))) ORDER BY "push"."time" DESC LIMIT 200

and when I run EXPLAIN ANALYZE on the query, I get

 Limit  (cost=75.49..75.50 rows=1 width=12) (actual time=4.219..4.223 rows=1 loops=1)
   ->  Unique  (cost=75.49..75.50 rows=1 width=12) (actual time=4.218..4.221 rows=1 loops=1)
         ->  Sort  (cost=75.49..75.50 rows=1 width=12) (actual time=4.216..4.219 rows=1 loops=1)
               Sort Key: push."time" DESC, commit.push_id
               Sort Method: quicksort  Memory: 25kB
               ->  Hash Join  (cost=23.65..75.48 rows=1 width=12) (actual time=2.337..3.025 rows=1 loops=1)
                     Hash Cond: (commit.push_id = push.id)
                     ->  Bitmap Heap Scan on commit  (cost=16.10..67.89 rows=14 width=4) (actual time=0.171..0.852 rows=5 loops=1)
                           Recheck Cond: (search_vector @@ '''1906541'''::tsquery)
                           Heap Blocks: exact=5
                           ->  Bitmap Index Scan on search_vector_idx  (cost=0.00..16.10 rows=14 width=0) (actual time=0.111..0.112 rows=5 loops=1)
                                 Index Cond: (search_vector @@ '''1906541'''::tsquery)
                     ->  Hash  (cost=7.35..7.35 rows=16 width=12) (actual time=0.796..0.797 rows=16 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Seq Scan on push  (cost=0.00..7.35 rows=16 width=12) (actual time=0.540..0.575 rows=16 loops=1)
                                 Filter: (repository_id = 4)
                                 Rows Removed by Filter: 252
 Planning Time: 17.443 ms
 Execution Time: 8.398 ms
(19 rows)

Video of http://localhost:8000/api/project/try/push/?search=1906541 results on browser

17.01.2025_07.48.28_REC.mp4

]

def save(self, *args, **kwargs):
self.search_vector = SearchVector(
self.revision, self.author, Substr(self.comments, 1, 100000), config="english"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions:

  1. would it be possible to add the search_vector column to the push model instead of the commit model? This would save us a join.
  2. I believe we can reduce the value 100000, we are really interested in the first line. Actually I wonder if we could just get the first line (that is, substr to the first "\n")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would still need a join to access comments if we move up to the push model

)
.filter(
search=SearchQuery(search_param, config="english"),
Commit.objects.filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably switch to the code that you tried before and that avoids the subquery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry what code did I try before? you mean the one without the search vector field?

Comment on lines 18 to 20
field=django.contrib.postgres.search.SearchVectorField(
blank=True, null=True
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we populate it for all existing commits?

Copy link
Contributor Author

@Netacci Netacci Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. I only did that locally by running

from django.contrib.postgres.search import SearchVector
from treeherder.model.models import Commit

Commit.objects.update(
    search_vector=SearchVector('comments', 'author', 'revision')
)

I'll have to create a migration to update existing data

@@ -1,8 +1,7 @@
# Generated by Django 5.1.5 on 2025-02-27 18:06
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration has been applied, a new migration is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rolled back to the old migration 0037_bugjob... then deleted my 0038_commit_search.. migration. so when I ran makemigration and another one got generated, I renamed it to 0038_commit_search.. and applied migration. Would that still be a problem?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A migration 0039 landed which depends on the old 0038. Please use a new migration 0040.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay got it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a new migration for this

@Netacci Netacci force-pushed the set-full-text-search-using-search-vector branch from 7cbf478 to 7114688 Compare March 6, 2025 10:48
@@ -0,0 +1,20 @@
# Generated by Django 5.1.5 on 2025-03-05 23:46
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fold this migration into the 0040 one.

# Test search by comments
resp = client.get(
reverse("push-list", kwargs={"project": test_repository.name}) + "?search=bug"
)
assert resp.status_code == 200

results = resp.json()["results"]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug code to remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea missed that :)

class Migration(migrations.Migration):

replaces = [
("model", "0040_remove_commit_search_vector_idx_commit_search_vector_and_more"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two migrations don't exist (outside your local database if it hasn't been reset). replaces might even break the migration of the ones mentioned here are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will take replaces out

# Test search by comments
resp = client.get(
reverse("push-list", kwargs={"project": test_repository.name}) + "?search=bug"
)
assert resp.status_code == 200

results = resp.json()["results"]

print(list(Commit.objects.values("revision", "author", "comments", "search_vector")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substr for comments should still be necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added Substr to the migration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Archaeopteryx when you have the time, can you review the PR? I've updated the recommended changes and rebased branch

@Netacci Netacci force-pushed the set-full-text-search-using-search-vector branch from 2f4baa3 to d08e40d Compare March 18, 2025 06:54
@Netacci Netacci requested a review from Archaeopteryx March 25, 2025 08:49
@@ -0,0 +1,39 @@
# Generated by Django 5.1.5 on 2025-03-06 14:49
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs a small rebase, another migration has landed.

@Netacci Netacci force-pushed the set-full-text-search-using-search-vector branch from d08e40d to 2c3e665 Compare March 25, 2025 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants