diff --git a/docs/embeddings/configuration/general.md b/docs/embeddings/configuration/general.md index bdf1779f5..a41eca7a3 100644 --- a/docs/embeddings/configuration/general.md +++ b/docs/embeddings/configuration/general.md @@ -2,16 +2,37 @@ General configuration options that don't fit elsewhere. -## format +## keyword ```yaml -format: pickle|json +keyword: boolean ``` -Sets the configuration storage format. Defaults to `pickle`. +Enables sparse keyword indexing for this embeddings. + +## hybrid +```yaml +hybrid: boolean +``` + +Enables hybrid (sparse + dense) indexing for this embeddings. + +## indexes +```yaml +indexes: dict +``` + +Key value pairs defining subindexes for this embeddings. Each key is the index name and the value is the full configuration. This configuration can use any of the available configurations in a standard embeddings instance. ## autoid ```yaml format: int|uuid function ``` -Sets the auto id generation method. When this is not set, an autogenerated numeric sequence is used. This also supports [UUID generation functions](https://docs.python.org/3/library/uuid.html#uuid.uuid1). For example, setting this value to `uuid4` will generate random UUIDs. Setting this to `uuid5` will generate deterministic UUIDs for each input data row. \ No newline at end of file +Sets the auto id generation method. When this is not set, an autogenerated numeric sequence is used. This also supports [UUID generation functions](https://docs.python.org/3/library/uuid.html#uuid.uuid1). For example, setting this value to `uuid4` will generate random UUIDs. Setting this to `uuid5` will generate deterministic UUIDs for each input data row. + +## format +```yaml +format: pickle|json +``` + +Sets the configuration storage format. Defaults to `pickle`. diff --git a/docs/embeddings/configuration/index.md b/docs/embeddings/configuration/index.md index f62111759..5a6762a5f 100644 --- a/docs/embeddings/configuration/index.md +++ b/docs/embeddings/configuration/index.md @@ -42,6 +42,10 @@ General configuration that doesn't fit elsewhere. An accomplying graph index can be created with an embeddings database. This enables topic modeling, path traversal and more. [NetworkX](https://github.com/networkx/networkx) is the default graph index. +## [Scoring](./scoring) + +Sparse keyword indexing and word vectors term weighting. + ## [Vectors](./vectors) Vector search is enabled by converting text and other binary data into embeddings vectors. These vectors are then stored in an ANN index. The vector model is optional and a default model is used when not provided. diff --git a/docs/embeddings/configuration/scoring.md b/docs/embeddings/configuration/scoring.md new file mode 100644 index 000000000..50bf5c7a4 --- /dev/null +++ b/docs/embeddings/configuration/scoring.md @@ -0,0 +1,30 @@ +# Scoring + +An embeddings instance can optionally have an associated scoring instance. This scoring instance can serve two purposes, depending on the settings. + +One use case is building sparse/keyword indexes. This occurs when the `terms` parameter is set to `True`. + +The other use case is with word vector term weighting. This feature has been available since the initial version but isn't quite as common anymore. + +The following covers the available options + +## method +```yaml +method: bm25|tfidf|sif +``` + +Sets the scoring method. + +## terms +```yaml +terms: boolean +``` + +Enables term frequency sparse arrays for a scoring instance. This is the backend for sparse keyword indexes. + +## normalize +```yaml +normalize: boolean +``` + +Enables normalized scoring (ranging from 0 to 1). When enabled, statistics from the index will be used to calculate normalized scores. diff --git a/docs/embeddings/configuration/vectors.md b/docs/embeddings/configuration/vectors.md index ac225bd32..1a4a77cb7 100644 --- a/docs/embeddings/configuration/vectors.md +++ b/docs/embeddings/configuration/vectors.md @@ -39,13 +39,6 @@ storevectors: boolean Enables copying of a vectors model set in path into the embeddings models output directory on save. This option enables a fully encapsulated index with no external file dependencies. -#### scoring -```yaml -scoring: bm25|tfidf|sif -``` - -A scoring model builds weighted averages of word vectors for a given sentence. Supports BM25, TF-IDF and SIF (smooth inverse frequency) methods. If a scoring method is not provided, mean sentence embeddings are built. - #### pca ```yaml pca: int diff --git a/docs/examples.md b/docs/examples.md index 2c76a3e95..c73db1dec 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -28,6 +28,7 @@ Build semantic/similarity/vector/neural search applications. | [Topic Modeling with BM25](https://github.com/neuml/txtai/blob/master/examples/39_Classic_Topic_Modeling_with_BM25.ipynb) | Topic modeling backed by a BM25 index | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/39_Classic_Topic_Modeling_with_BM25.ipynb) | | [Embeddings in the Cloud](https://github.com/neuml/txtai/blob/master/examples/43_Embeddings_in_the_Cloud.ipynb) | Load and use an embeddings index from the Hugging Face Hub | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/43_Embeddings_in_the_Cloud.ipynb) | | [Customize your own embeddings database](https://github.com/neuml/txtai/blob/master/examples/45_Customize_your_own_embeddings_database.ipynb) | Ways to combine vector indexes with relational databases | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/45_Customize_your_own_embeddings_database.ipynb) | +| [What's new in txtai 6.0](https://github.com/neuml/txtai/blob/master/examples/46_Whats_new_in_txtai_6_0.ipynb) | Sparse, hybrid and subindexes for embeddings, LLM improvements | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtai/blob/master/examples/46_Whats_new_in_txtai_6_0.ipynb) | ## LLM diff --git a/mkdocs.yml b/mkdocs.yml index e6581e6e6..1fdc379d3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -64,6 +64,7 @@ nav: - Database: embeddings/configuration/database.md - General: embeddings/configuration/general.md - Graph: embeddings/configuration/graph.md + - Scoring: embeddings/configuration/scoring.md - Vectors: embeddings/configuration/vectors.md - Index Guide: embeddings/indexing.md - Methods: embeddings/methods.md