diff --git a/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/updated-ecommerce_dense_sparse_project.ipynb b/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/updated-ecommerce_dense_sparse_project.ipynb new file mode 100644 index 00000000..6c2d9b55 --- /dev/null +++ b/supporting-blog-content/lexical-and-semantic-search-with-elasticsearch/updated-ecommerce_dense_sparse_project.ipynb @@ -0,0 +1,1106 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "r8OKk3QOGBXl", + "metadata": { + "id": "r8OKk3QOGBXl" + }, + "source": [ + "# **Lexical and Semantic Search with Elasticsearch**\n", + "\n", + "In the following examples, we will explore various approaches to retrieving information using Elasticsearch - focusing specifically on full text search, semantic search, and a hybrid combination of both.\n", + "\n", + "To accomplish this, this example demonstrates various search scenarios on a dataset generated to simulate e-commerce product information.\n", + "\n", + "This dataset contains over 2,500 products, each with a description. These products are categorized into 76 distinct product categories, with each category containing a varying number of products. \n", + "\n", + "Here is a sample of an object from the dataset:\n", + "\n", + "```json\n", + " {\n", + " \"product\": \"Samsung 49-inch Curved Gaming Monitor\",\n", + " \"description\": \"is a curved gaming monitor with a high refresh rate and AMD FreeSync technology.\",\n", + " \"category\": \"Monitors\"\n", + "}\n", + "\n", + "```\n", + "\n", + "We will consume the dataset from a JSON file into Elasticsearch using modern consumption patterns. We will then perform a series of search operations to demonstrate the different search strategies.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "6370f2e4", + "metadata": {}, + "source": [ + "## **🧰 Requirements**\n", + "\n", + "For this example, you will need:\n", + "\n", + "- Python 3.11 or later\n", + "- The Elastic Python client\n", + "- Elastic 9.0 deployment or later on either a local, cloud, or serverless environment\n", + "\n", + "\n", + "We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html). You can use a [free trial here](https://cloud.elastic.co/registration?onboarding_token=vectorsearch&utm_source=github&utm_content=elasticsearch-labs-notebook) to get started." + ] + }, + { + "cell_type": "markdown", + "id": "hmMWo2e-IkTB", + "metadata": { + "id": "hmMWo2e-IkTB" + }, + "source": [ + "## Setup Elasticsearch environment:\n", + "\n", + "To get started, we'll need to connect to our Elastic deployment using the Python client.\n", + "\n", + "Because we're using an Elastic Cloud deployment, we'll use the **Cloud Endpoint** and **Cloud API Key** to identify our deployment. These may be found within Kibana by following the instructions [here](https://www.elastic.co/docs/deploy-manage/api-keys/elastic-cloud-api-keys).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8d24cd8-a437-4bd2-a1f0-93e535ccf8a9", + "metadata": { + "id": "e8d24cd8-a437-4bd2-a1f0-93e535ccf8a9" + }, + "outputs": [], + "source": [ + "%pip install elasticsearch pandas IPython -q" + ] + }, + { + "cell_type": "markdown", + "id": "38b734aa", + "metadata": {}, + "source": [ + "### Import the required packages\n", + "We will import the following packages:\n", + "- `Elasticsearch`: a client library for Elasticsearch actions\n", + "- `bulk`: a function to perform Elasticsearch actions in bulk\n", + "- `getpass`: a module for receiving Elasticsearch credentials via text prompt\n", + "- `json`: a module for reading and writing JSON data\n", + "- `pandas`, `display`, `Markdown`: for data visualization and markdown formatting\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eaf90bc8-647e-4ada-9aa9-5cb9e60762b7", + "metadata": { + "id": "eaf90bc8-647e-4ada-9aa9-5cb9e60762b7" + }, + "outputs": [], + "source": [ + "# import the Elasticsearch client and bulk function\n", + "from elasticsearch import Elasticsearch\n", + "from elasticsearch.helpers import bulk\n", + "\n", + "# import getpass module to handle Auth input\n", + "import getpass\n", + "\n", + "# import json module to read JSON file of products\n", + "import json # module for handling JSON data\n", + "\n", + "# display search results in a table\n", + "import pandas as pd\n", + "from IPython.display import display, Markdown" + ] + }, + { + "cell_type": "markdown", + "id": "ea1VkDBXJIQR", + "metadata": { + "id": "ea1VkDBXJIQR" + }, + "source": [ + "### 📚 Instantiating the Elasticsearch Client\n", + "\n", + "First we prompt the user for their Elastic Endpoint URL and Elastic API Key.\n", + "Then we create a `client` object that instantiates an instance of the `Elasticsearch` class.\n", + "Lastly, we verify that our client is connected to our Elasticsearch instance by calling `client.ping()`.\n", + "> 🔐 *NOTE: `getpass` enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6907a2bf-4927-428e-9ca8-9df3dd35a2cc", + "metadata": { + "id": "6907a2bf-4927-428e-9ca8-9df3dd35a2cc" + }, + "outputs": [], + "source": [ + "# endpoint for Elasticsearch instance\n", + "ELASTIC_ENDPOINT = getpass.getpass(\"Enter Elastic Endpoint: \")\n", + "\n", + "# Elastic API key for Elasticsearch\n", + "ELASTIC_API_KEY = getpass.getpass(\"Enter Elastic API Key: \")\n", + "\n", + "# create the Elasticsearch client instance\n", + "client = Elasticsearch(\n", + " hosts=[ELASTIC_ENDPOINT], api_key=ELASTIC_API_KEY, request_timeout=3600\n", + ")\n", + "\n", + "resp = client.ping()\n", + "print(f\"Connected to Elastic instance: {resp}\")" + ] + }, + { + "cell_type": "markdown", + "id": "BH-N6epTJarM", + "metadata": { + "id": "BH-N6epTJarM" + }, + "source": [ + "## Prepare our embedding model workflow\n", + "\n", + "Next we ensure our embedding models are available in Elasticsearch. We will use Elastic's provided `e5_multilingual_small` and `elser_V2` models to provide dense and sparse vectoring, respectively. Using these models out of the box will ensure they are up-to-date and ready for integration with Elasticsearch.\n", + "\n", + "Other models may be uploaded and deployed using [Eland](https://www.elastic.co/docs/reference/elasticsearch/clients/eland) or integrated using the [inference endpoint API](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-inference-put-azureopenai) to connect to third-party models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f6f3f5a-2b93-4a0c-93c8-c887ca80f687", + "metadata": { + "id": "7f6f3f5a-2b93-4a0c-93c8-c887ca80f687" + }, + "outputs": [], + "source": [ + "# Declare models and endpoint names predeployed by Elastic\n", + "elser_model = \".elser_model_2_linux-x86_64\"\n", + "elser_endpoint = \".elser-2-elasticsearch\"\n", + "\n", + "e5_model = \".multilingual-e5-small_linux-x86_64\"\n", + "e5_endpoint = \".multilingual-e5-small-elasticsearch\"\n", + "\n", + "# Define (model, endpoint) tuples to check\n", + "model_endpoint_pairs = [(elser_model, elser_endpoint), (e5_model, e5_endpoint)]\n", + "\n", + "# Fetch all loaded models and endpoints once\n", + "models = client.ml.get_trained_models()\n", + "model_ids = {model[\"model_id\"]: model for model in models[\"trained_model_configs\"]}\n", + "endpoints = client.inference.get()\n", + "endpoint_ids = {\n", + " endpoint[\"inference_id\"]: endpoint for endpoint in endpoints[\"endpoints\"]\n", + "}\n", + "\n", + "# Check each (model, endpoint) pair\n", + "for model_id, endpoint_id in model_endpoint_pairs:\n", + " print(f\"Checking Model: {model_id}\")\n", + " model = model_ids.get(model_id)\n", + " if model:\n", + " print(f\" Model ID: {model['model_id']}\")\n", + " print(f\" Description: {model.get('description', 'No description')}\")\n", + " print(f\" Version: {model.get('version', 'N/A')}\")\n", + " else:\n", + " print(\" Model not found or not loaded.\")\n", + " print(f\"Checking Endpoint: {endpoint_id}\")\n", + " endpoint = endpoint_ids.get(endpoint_id)\n", + " if endpoint:\n", + " print(f\" Inference Endpoint ID: {endpoint['inference_id']}\")\n", + " print(f\" Task Type: {endpoint['task_type']}\")\n", + " else:\n", + " print(\" Endpoint not found or not ready.\")\n", + " print(\"------\")" + ] + }, + { + "cell_type": "markdown", + "id": "80506477", + "metadata": {}, + "source": [ + "### Create an inference pipeline\n", + "This function will create an ingest pipeline with inference processors to use `ELSER` (sparse_vector) and `e5_multilingual_small` (dense_vector) to infer against data that will be ingested in the pipeline. This allows us to automatically generate embeddings for the product descriptions when they are indexed into Elasticsearch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6739f55b-6983-4b48-9349-6e0111b313fe", + "metadata": { + "id": "6739f55b-6983-4b48-9349-6e0111b313fe" + }, + "outputs": [], + "source": [ + "index_pipeline = \"ecommerce-pipeline\"\n", + "resp = client.ingest.put_pipeline(\n", + " id=index_pipeline,\n", + " processors=[\n", + " {\n", + " \"inference\": {\n", + " \"model_id\": elser_endpoint, # inference endpoint ID\n", + " \"input_output\": [\n", + " {\n", + " \"input_field\": \"description\", # source field\n", + " \"output_field\": \"elser_description_vector\", # destination vector field\n", + " }\n", + " ],\n", + " }\n", + " },\n", + " {\n", + " \"inference\": {\n", + " \"model_id\": e5_endpoint, # inference endpoint ID\n", + " \"input_output\": [\n", + " {\n", + " \"input_field\": \"description\", # source field\n", + " \"output_field\": \"e5_description_vector\", # destination vector field\n", + " }\n", + " ],\n", + " \"inference_config\": {\"text_embedding\": {}},\n", + " }\n", + " },\n", + " ],\n", + ")\n", + "\n", + "print(f\"ecommerce-pipeline created: {resp['acknowledged']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "QUQ1nCaiKIQr", + "metadata": { + "id": "QUQ1nCaiKIQr" + }, + "source": [ + "## Index documents\n", + "The `ecommerce-search` index we are creating will include fields to support dense and sparse vector storage and search. \n", + "\n", + "We define the `e5_description_vector` and the `elser_description_vector` fields to store the inference pipeline results. \n", + "\n", + "The field type in `e5_description_vector` is a `dense_vector`. The `.e5_multilingual_small` model has an embedding size of 384, so the dimension of the vector (dims) is set to 384. \n", + "\n", + "We also add an `elser_description_vector` field type to support the `sparse_vector` output from our `.elser_model_2_linux-x86_64` model. No further configuration is needed for this field for our use case." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418", + "metadata": { + "id": "9b53b39e-d74e-4fa8-a364-e2c3caf37418" + }, + "outputs": [], + "source": [ + "# define the index name and mapping\n", + "commerce_index = \"ecommerce-search\"\n", + "mappings = {\n", + " \"properties\": {\n", + " \"product\": {\n", + " \"type\": \"text\",\n", + " },\n", + " \"description\": {\n", + " \"type\": \"text\",\n", + " },\n", + " \"category\": {\n", + " \"type\": \"text\",\n", + " },\n", + " \"elser_description_vector\": {\"type\": \"sparse_vector\"},\n", + " \"e5_description_vector\": {\n", + " \"type\": \"dense_vector\",\n", + " \"dims\": 384,\n", + " \"index\": \"true\",\n", + " \"similarity\": \"cosine\",\n", + " },\n", + " \"e5_semantic_description_vector\": {\n", + " \"type\": \"semantic_text\",\n", + " \"inference_id\": e5_endpoint,\n", + " },\n", + " \"elser_semantic_description_vector\": {\"type\": \"semantic_text\"},\n", + " }\n", + "}\n", + "\n", + "\n", + "if client.indices.exists(index=commerce_index):\n", + " client.indices.delete(index=commerce_index)\n", + "resp = client.indices.create(\n", + " index=commerce_index,\n", + " mappings=mappings,\n", + ")\n", + "\n", + "print(f\"Index {commerce_index} created: {resp['acknowledged']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "88db9926", + "metadata": {}, + "source": [ + "### Attach Pipeline to Index\n", + "Lets connect our pipeline to the index. This updates the settings of our index to use the pipeline we previously defined as the default.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c4830b74", + "metadata": {}, + "outputs": [], + "source": [ + "resp = client.indices.put_settings(\n", + " index=commerce_index,\n", + " body={\"default_pipeline\": index_pipeline},\n", + ")\n", + "print(f\"Pipeline set for {commerce_index}: {resp['acknowledged']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "Vo-LKu8TOT5j", + "metadata": { + "id": "Vo-LKu8TOT5j" + }, + "source": [ + "### Load documents\n", + "\n", + "We load the contents of`products-ecommerce.json` into the `ecommerce-search` index. We will use the `bulk` helper function to efficiently index our documents en masse. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3cfdc3b7-7e4f-4111-997b-c333ac8938ba", + "metadata": { + "id": "3cfdc3b7-7e4f-4111-997b-c333ac8938ba" + }, + "outputs": [], + "source": [ + "# Load the dataset\n", + "with open(\"products-ecommerce.json\", \"r\") as f:\n", + " data_json = json.load(f)\n", + "\n", + "\n", + "# helper function to create bulk indexing body\n", + "def create_index_body(doc):\n", + " doc[\"elser_semantic_description_vector\"] = doc[\"description\"]\n", + " doc[\"e5_semantic_description_vector\"] = doc[\"description\"]\n", + "\n", + " return {\n", + " \"_index\": \"ecommerce-search\",\n", + " \"_source\": doc,\n", + " }\n", + "\n", + "\n", + "# prepare the documents array payload\n", + "documents = [create_index_body(doc) for doc in data_json]\n", + "\n", + "# use bulk function to index\n", + "try:\n", + " print(\"Indexing documents...\")\n", + " resp = bulk(client, documents)\n", + " print(f\"Documents indexed successfully: {resp[0]}\")\n", + "except Exception as e:\n", + " print(f\"Error indexing documents: {e}\")" + ] + }, + { + "cell_type": "markdown", + "id": "-qUXNuOvPDsI", + "metadata": { + "id": "-qUXNuOvPDsI" + }, + "source": [ + "## Text Analysis\n", + "The classic way documents are ranked for relevance by Elasticsearch based on a text query uses the Lucene implementation of the [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) model, a **sparse model for lexical search**. This method follows the traditional approach for text search, looking for exact term matches.\n", + "\n", + "To make this search possible, Elasticsearch converts **text field** data into a searchable format by performing text analysis.\n", + "\n", + "**Text analysis** is performed by an [analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html), a set of rules to govern the process of extracting relevant tokens for searching. An analyzer must have exactly one [tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html). The tokenizer receives a stream of characters and breaks it up into individual tokens (usually individual words.) \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5f51e460", + "metadata": {}, + "source": [ + "### Standard Analyzer\n", + "In the example below we are using the default analyzer, the standard analyzer, which works well for most use cases as it provides English grammar based tokenization. Tokenization enables matching on individual terms, but each token is still matched literally." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55b602d1-f1e4-4b70-9273-5fc701ac9039", + "metadata": { + "id": "55b602d1-f1e4-4b70-9273-5fc701ac9039" + }, + "outputs": [], + "source": [ + "# Define the text to be analyzed\n", + "text = \"Comfortable furniture for a large balcony\"\n", + "\n", + "# Define the analyze request\n", + "request_body = {\"analyzer\": \"standard\", \"text\": text} # Stop Analyzer\n", + "\n", + "# Perform the analyze request\n", + "resp = client.indices.analyze(\n", + " analyzer=request_body[\"analyzer\"], text=request_body[\"text\"]\n", + ")\n", + "\n", + "# Extract and display the analyzed tokens\n", + "standard_tokens = [token[\"token\"] for token in resp[\"tokens\"]]\n", + "print(\"Standard-analyzed Tokens:\", standard_tokens)" + ] + }, + { + "cell_type": "markdown", + "id": "fb75f526", + "metadata": {}, + "source": [ + "### Stop Analyzer\n", + "If you want to personalize your search experience you can choose a different built-in analyzer. For example, by updating the code to use the stop analyzer it will break the text into tokens at any non-letter character with support for removing stop words." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3e3fdcff", + "metadata": {}, + "outputs": [], + "source": [ + "# Define the analyze request\n", + "request_body = {\"analyzer\": \"stop\", \"text\": text}\n", + "\n", + "# Perform the analyze request\n", + "response = client.indices.analyze(\n", + " analyzer=request_body[\"analyzer\"], text=request_body[\"text\"]\n", + ")\n", + "\n", + "# Extract and display the analyzed tokens\n", + "stop_tokens = [token[\"token\"] for token in response[\"tokens\"]]\n", + "print(\"Stop-analyzed Tokens:\", stop_tokens)" + ] + }, + { + "cell_type": "markdown", + "id": "aba7fad6", + "metadata": {}, + "source": [ + "### Custom Analyzer\n", + "When the built-in analyzers do not fulfill your needs, you can create a [custom analyzer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html)\n", + "], which uses the appropriate combination of zero or more [character filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-charfilters.html), a [tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html) and zero or more [token filters](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html).\n", + "\n", + "In the below example that combines a tokenizer and token filters, the text will be lowercased by the [lowercase filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenfilter.html) before being processed by the [synonyms token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html).\n", + "\n", + "> Note: you cannot pass a custom analyzer definition inline to analyze. Define the analyzer in your index settings, then reference it by name in the analyze call. For this reason we will create a temporary index to store the analyzer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d44f3e2b", + "metadata": {}, + "outputs": [], + "source": [ + "index_settings = {\n", + " \"settings\": {\n", + " \"analysis\": {\n", + " \"analyzer\": {\n", + " \"my_custom_analyzer\": {\n", + " \"type\": \"custom\",\n", + " \"tokenizer\": \"standard\",\n", + " \"char_filter\": [\"html_strip\"],\n", + " \"filter\": [\"lowercase\", \"asciifolding\"],\n", + " }\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "custom_text = \"Čōmføřțǎble Fůrñíturę Fòr â ľarğe Bałcony\"\n", + "\n", + "# Create a temporary index with the custom analyzer\n", + "client.indices.create(index=\"temporary_index\", body=index_settings)\n", + "\n", + "# Perform the analyze request\n", + "resp = client.indices.analyze(\n", + " index=\"temporary_index\", analyzer=\"my_custom_analyzer\", text=custom_text\n", + ")\n", + "\n", + "# Extract and display the analyzed tokens\n", + "custom_tokens = [token[\"token\"] for token in resp[\"tokens\"]]\n", + "print(\"Custom Tokens:\", custom_tokens)\n", + "\n", + "# Delete the temporary index\n", + "client.indices.delete(index=\"temporary_index\")" + ] + }, + { + "cell_type": "markdown", + "id": "432620b6", + "metadata": {}, + "source": [ + "### Text Analysis Results\n", + "In the table below, we can observe that analyzers both included with Elasticsearch and custom made may be included with your search requests to improve the quality of your search results by reducing or refining the content being searched. Attention should be paid to your particular use case and the needs of your users." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c5d11cb", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Standard Token Analyzer\")\n", + "print(f\"Before: \\n{text}\")\n", + "print(f\"After: \\n{standard_tokens}\")\n", + "print(\"===================\")\n", + "print(\"Stop Token Analyzer\")\n", + "print(f\"Before: \\n{text}\")\n", + "print(f\"After: \\n{stop_tokens}\")\n", + "print(\"===================\")\n", + "print(\"Custom Token Analyzer\")\n", + "print(f\"Before: \\n{custom_text}\")\n", + "print(f\"After: \\n{custom_tokens}\")" + ] + }, + { + "cell_type": "markdown", + "id": "db4f86e3", + "metadata": {}, + "source": [ + "## Search \n", + "The remainder of this notebook will cover the following search types:\n", + "\n", + "\n", + "- Lexical Search\n", + "- Semantic Search \n", + " - ELSER Semantic Search (Sparse Vector)\n", + " - E5 Semantic Search (Dense Vector)\n", + "- Hybrid Search\n" + ] + }, + { + "cell_type": "markdown", + "id": "8G8MKcUvP0zs", + "metadata": { + "id": "8G8MKcUvP0zs" + }, + "source": [ + "## Lexical Search\n", + "Our first search will be a straightforward BM25 text search within the description field. We are storing all of our results in a results_list for a final comparison at the end of the notebook. A convenience function to display the results is also defined." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f4984f6c-ceec-46a4-b64c-f749e6b1b04f", + "metadata": { + "id": "f4984f6c-ceec-46a4-b64c-f749e6b1b04f" + }, + "outputs": [], + "source": [ + "results_list = []\n", + "\n", + "\n", + "def print_search_results(search_results):\n", + " if not search_results:\n", + " print(\"No matches found\")\n", + " else:\n", + " for hit in search_results:\n", + " score = hit[\"_score\"]\n", + " product = hit[\"_source\"][\"product\"]\n", + " category = hit[\"_source\"][\"category\"]\n", + " description = hit[\"_source\"][\"description\"]\n", + " print(\n", + " f\"\\nScore: {score}\\nProduct: {product}\\nCategory: {category}\\nDescription: {description}\\n\"\n", + " )\n", + "\n", + "\n", + "# Regular BM25 (Lexical) Search\n", + "resp = client.search(\n", + " size=2,\n", + " index=\"ecommerce-search\",\n", + " query={\n", + " \"match\": {\n", + " \"description\": {\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", + " \"analyzer\": \"stop\",\n", + " }\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "lexical_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"lexical_search\": lexical_search_results})\n", + "print_search_results(lexical_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "xiywcf_-P39a", + "metadata": { + "id": "xiywcf_-P39a" + }, + "source": [ + "## Semantic Search with Dense Vector" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72187c9a-14c1-4084-a080-4e5c1e614f22", + "metadata": { + "id": "72187c9a-14c1-4084-a080-4e5c1e614f22" + }, + "outputs": [], + "source": [ + "# KNN\n", + "# TODO: Add Semantic_Text type?\n", + "response = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " knn={\n", + " \"field\": \"e5_description_vector\",\n", + " \"k\": 50, # Number of nearest neighbors to return as top hits.\n", + " \"num_candidates\": 500, # Number of nearest neighbor candidates to consider per shard. Increasing num_candidates tends to improve the accuracy of the final k results.\n", + " \"query_vector_builder\": { # Object indicating how to build a query_vector. kNN search enables you to perform semantic search by using a previously deployed text embedding model.\n", + " \"text_embedding\": {\n", + " \"model_id\": \".multilingual-e5-small-elasticsearch\", # Text embedding model id\n", + " \"model_text\": \"Comfortable furniture for a large balcony\", # Query\n", + " }\n", + " },\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "dense_semantic_search_results = response[\"hits\"][\"hits\"]\n", + "results_list.append({\"dense_semantic_search\": dense_semantic_search_results})\n", + "print_search_results(dense_semantic_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "QlWFdngRQFbv", + "metadata": { + "id": "QlWFdngRQFbv" + }, + "source": [ + "## Semantic Search with Sparse Vector" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c5475e21", + "metadata": {}, + "outputs": [], + "source": [ + "# Elastic Learned Sparse Encoder - ELSER\n", + "\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " query={\n", + " \"sparse_vector\": {\n", + " \"field\": \"elser_description_vector\",\n", + " \"inference_id\": \".elser-2-elasticsearch\",\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "\n", + "sparse_semantic_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"sparse_semantic_search\": sparse_semantic_search_results})\n", + "print_search_results(sparse_semantic_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "3a2a5267", + "metadata": {}, + "source": [ + "## Semantic Search with `semantic_text` Type (ELSER)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d2fb926", + "metadata": {}, + "outputs": [], + "source": [ + "# Elastic Learned Sparse Encoder - ELSER\n", + "\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " query={\n", + " \"semantic\": {\n", + " \"field\": \"elser_semantic_description_vector\",\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "elser_semantic_text_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"elser_semantic_text_search\": sparse_semantic_search_results})\n", + "print_search_results(elser_semantic_text_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "1df079f3", + "metadata": {}, + "source": [ + "## Semantic Search with `semantic_text` Type (e5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c0bf5fc-ab32-4f33-8f26-904ff10635a5", + "metadata": { + "id": "2c0bf5fc-ab32-4f33-8f26-904ff10635a5" + }, + "outputs": [], + "source": [ + "# Elastic Learned Sparse Encoder - ELSER\n", + "\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " query={\n", + " \"semantic\": {\n", + " \"field\": \"e5_semantic_description_vector\",\n", + " \"query\": \"Comfortable furniture for a large balcony\",\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "e5_semantic_text_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"e5_semantic_text_search\": e5_semantic_text_search_results})\n", + "print_search_results(e5_semantic_text_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "kz9deDBYQJxr", + "metadata": { + "id": "kz9deDBYQJxr" + }, + "source": [ + "## Hybrid Search - BM25 + Dense Vector linear combination" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f84aa16b-49c5-4abf-a049-d556c225542e", + "metadata": { + "id": "f84aa16b-49c5-4abf-a049-d556c225542e" + }, + "outputs": [], + "source": [ + "# BM25 + KNN (Linear Combination)\n", + "query = \"A dining table and comfortable chairs for a large balcony\"\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " query={\n", + " \"bool\": {\n", + " \"should\": [\n", + " {\n", + " \"match\": {\n", + " \"description\": {\n", + " \"query\": query,\n", + " \"boost\": 1,\n", + " }\n", + " }\n", + " }\n", + " ]\n", + " }\n", + " },\n", + " knn={\n", + " \"field\": \"e5_description_vector\",\n", + " \"k\": 2,\n", + " \"num_candidates\": 20,\n", + " \"boost\": 1,\n", + " \"query_vector_builder\": {\n", + " \"text_embedding\": {\n", + " \"model_id\": \".multilingual-e5-small-elasticsearch\",\n", + " \"model_text\": query,\n", + " }\n", + " },\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "dense_linear_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"dense_linear_search\": dense_linear_search_results})\n", + "print_search_results(dense_linear_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "cybkWjmpQV8g", + "metadata": { + "id": "cybkWjmpQV8g" + }, + "source": [ + "## Hybrid Search - BM25 + Dense Vector Reverse Reciprocal Fusion (RRF)\n", + "\n", + "[Reciprocal rank fusion](https://www.elastic.co/docs/reference/elasticsearch/rest-apis/reciprocal-rank-fusion) (RRF) is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aa2e072d-37bb-43fd-a83f-e1cb55a24861", + "metadata": { + "id": "aa2e072d-37bb-43fd-a83f-e1cb55a24861" + }, + "outputs": [], + "source": [ + "# BM25 + KNN (RRF)\n", + "top_k = 2\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " retriever={\n", + " \"rrf\": {\n", + " \"retrievers\": [\n", + " {\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"match\": {\n", + " \"description\": \"A dining table and comfortable chairs for a large balcony\"\n", + " }\n", + " }\n", + " }\n", + " },\n", + " {\n", + " \"knn\": {\n", + " \"field\": \"e5_description_vector\",\n", + " \"query_vector_builder\": {\n", + " \"text_embedding\": {\n", + " \"model_id\": e5_endpoint,\n", + " \"model_text\": \"A dining table and comfortable chairs for a large balcony\",\n", + " }\n", + " },\n", + " \"k\": 2,\n", + " \"num_candidates\": 20,\n", + " }\n", + " },\n", + " ],\n", + " \"rank_window_size\": 2,\n", + " \"rank_constant\": 20,\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "dense_rrf_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"dense_rrf_search\": dense_rrf_search_results})\n", + "print_search_results(dense_rrf_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "LyKI2Z-XQbI6", + "metadata": { + "id": "LyKI2Z-XQbI6" + }, + "source": [ + "## Hybrid Search - BM25 + Sparse Vector linear combination" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd842732-b20a-4c7a-b735-e1f558a9b922", + "metadata": { + "id": "bd842732-b20a-4c7a-b735-e1f558a9b922" + }, + "outputs": [], + "source": [ + "# BM25 + Elastic Learned Sparse Encoder (Linear Combination)\n", + "\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " size=2,\n", + " query={\n", + " \"bool\": {\n", + " \"should\": [\n", + " {\n", + " \"match\": {\n", + " \"description\": {\n", + " \"query\": \"A dining table and comfortable chairs for a large balcony\",\n", + " \"boost\": 1, # You can adjust the boost value\n", + " }\n", + " }\n", + " },\n", + " {\n", + " \"sparse_vector\": {\n", + " \"field\": \"elser_description_vector\",\n", + " \"inference_id\": elser_endpoint,\n", + " \"query\": \"A dining table and comfortable chairs for a large balcony\",\n", + " }\n", + " },\n", + " ]\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "sparse_linear_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"sparse_linear_search\": sparse_linear_search_results})\n", + "print_search_results(sparse_linear_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "e3d5e4e9", + "metadata": {}, + "source": [ + "## Hybrid Search - BM25 + Sparse Vector Reciprocal Rank Fusion (RRF)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "199c5c60", + "metadata": {}, + "outputs": [], + "source": [ + "# BM25 + ELSER (RRF)\n", + "top_k = 2\n", + "resp = client.search(\n", + " index=\"ecommerce-search\",\n", + " retriever={\n", + " \"rrf\": {\n", + " \"retrievers\": [\n", + " {\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"match\": {\n", + " \"description\": \"A dining table and comfortable chairs for a large balcony\"\n", + " }\n", + " }\n", + " }\n", + " },\n", + " {\n", + " \"standard\": {\n", + " \"query\": {\n", + " \"sparse_vector\": {\n", + " \"field\": \"elser_description_vector\",\n", + " \"inference_id\": elser_endpoint,\n", + " \"query\": \"A dining table and comfortable chairs for a large balcony\",\n", + " }\n", + " }\n", + " }\n", + " },\n", + " ],\n", + " \"rank_window_size\": 2,\n", + " \"rank_constant\": 20,\n", + " }\n", + " },\n", + " source_excludes=[\"*_description_vector\"], # Exclude vector fields from response\n", + ")\n", + "\n", + "sparse_rrf_search_results = resp[\"hits\"][\"hits\"]\n", + "results_list.append({\"sparse_rrf_search\": sparse_rrf_search_results})\n", + "print_search_results(sparse_rrf_search_results)" + ] + }, + { + "cell_type": "markdown", + "id": "7b95f9b8", + "metadata": {}, + "source": [ + "TODO: \n", + "- Semantic Text / Query BUilder (ask Serena)\n", + "- Table of Results\n", + "- Conclusion\n", + "- Next steps\n", + "\n", + "\n", + "\n", + "## Compiled Results\n", + "Here are the results of the previous searches. We can see that all of the results return approximately the same the products." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1162a857", + "metadata": {}, + "outputs": [], + "source": [ + "# Flatten results for each search type, preserving insertion order\n", + "rows = []\n", + "for result in results_list:\n", + " search_type = list(result.keys())[0]\n", + " for doc in result[search_type]:\n", + " row = {\n", + " \"search_type\": search_type,\n", + " \"product\": doc[\"_source\"].get(\"product\"),\n", + " \"category\": doc[\"_source\"].get(\"category\"),\n", + " \"description\": doc[\"_source\"].get(\"description\"),\n", + " \"score\": doc.get(\"_score\"),\n", + " }\n", + " rows.append(row)\n", + "\n", + "# Create DataFrame without altering row order\n", + "df = pd.DataFrame(rows)\n", + "\n", + "# Get the unique search_types in order of appearance\n", + "ordered_search_types = []\n", + "for row in rows:\n", + " st = row[\"search_type\"]\n", + " if st not in ordered_search_types:\n", + " ordered_search_types.append(st)\n", + "\n", + "for search_type in ordered_search_types:\n", + " group = df[df[\"search_type\"] == search_type]\n", + " display(Markdown(f\"### {search_type.replace('_', ' ').title()}\"))\n", + " styled = (\n", + " group.drop(columns=\"search_type\")\n", + " .reset_index(drop=True)\n", + " .style.set_properties(\n", + " subset=[\"description\"],\n", + " **{\"white-space\": \"pre-wrap\", \"word-break\": \"break-word\"},\n", + " )\n", + " .hide(axis=\"index\") # For pandas >=1.4.0\n", + " )\n", + " display(styled)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}