Skip to content

tensorchord/pg_tokenizer.rs

Repository files navigation

pg_tokenizer

A PostgreSQL extension that provides tokenizers for full-text search.

Quick Start

The official ghcr.io/tensorchord/vchord_bm25-postgres Docker image comes pre-configured with several complementary extensions:

  • pg_tokenizer - This extension
  • VectorChord-bm25 - Native BM25 Ranking Index
  • VectorChord - Scalable, high-performance, and disk-efficient vector similarity search
  • pgvector - Popular vector similarity search

Simply run the Docker container as shown below:

docker run \
  --name vectorchord-demo \
  -e POSTGRES_PASSWORD=mysecretpassword \
  -p 5432:5432 \
  -d ghcr.io/tensorchord/vchord_bm25-postgres:pg17-v0.2.0

Once everything’s set up, you can connect to the database using the psql command line tool. The default username is postgres, and the default password is mysecretpassword. Here’s how to connect:

psql -h localhost -p 5432 -U postgres

After connecting, run the following SQL to make sure the extension is enabled:

CREATE EXTENSION pg_tokenizer;

Then, don’t forget to add tokenizer_catalog to your search_path:

ALTER SYSTEM SET search_path TO "$user", public, tokenizer_catalog;
SELECT pg_reload_conf();

Example

SELECT create_tokenizer('tokenizer1', $$
model = "llmlingua2"
$$);

SELECT tokenize('PostgreSQL is a powerful, open-source object-relational database system. It has over 15 years of active development.', 'tokenizer1');

More examples can be found in docs/03-examples.md.

Documentation