|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "## Redis Vector Library (RedisVL)\n", |
| 8 | + "\n", |
| 9 | + "RedisVL is a command line interface (CLI) and Python library to help load and create vector search indices within Redis. While Redis offers the ``redis-cli`` which can perform similar actions, ``redisvl`` aims to be more specific to setting up VSS use cases.\n", |
| 10 | + "\n", |
| 11 | + "This notebook will walk through\n", |
| 12 | + "1. Preparing a dataset with vectors.\n", |
| 13 | + "2. Writing data schema for ``redis``\n", |
| 14 | + "3. Loading the data and creating a vector search index\n", |
| 15 | + "4. Combining vector search with tag, text, and numeric search\n", |
| 16 | + "5. Performing queries\n", |
| 17 | + "\n", |
| 18 | + "Before running this notebook, be sure to\n", |
| 19 | + "1. Gave installed ``redisvl`` and have that environment active for this notebook.\n", |
| 20 | + "2. Have a running Redis instance with RediSearch > 2.4 running." |
| 21 | + ] |
| 22 | + }, |
| 23 | + { |
| 24 | + "cell_type": "markdown", |
| 25 | + "metadata": {}, |
| 26 | + "source": [ |
| 27 | + "### 1.1 Creating Vector Embeddings\n", |
| 28 | + "\n", |
| 29 | + "For this example, we will use the following overly simplified dataset\n" |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "code", |
| 34 | + "execution_count": 2, |
| 35 | + "metadata": {}, |
| 36 | + "outputs": [], |
| 37 | + "source": [ |
| 38 | + "import pandas as pd\n", |
| 39 | + "import numpy as np\n", |
| 40 | + "\n", |
| 41 | + "data = pd.DataFrame(\n", |
| 42 | + " {\n", |
| 43 | + " \"users\": [\"john\", \"mary\", \"joe\"],\n", |
| 44 | + " \"age\": [1, 2, 3],\n", |
| 45 | + " \"job\": [\"engineer\", \"doctor\", \"dentist\"],\n", |
| 46 | + " \"credit_score\": [\"high\", \"low\", \"medium\"]\n", |
| 47 | + " }\n", |
| 48 | + ")" |
| 49 | + ] |
| 50 | + }, |
| 51 | + { |
| 52 | + "cell_type": "markdown", |
| 53 | + "metadata": {}, |
| 54 | + "source": [ |
| 55 | + "This will make up 3 entries in Redis (hashes) each with 4 sub-keys (users, age, job, credit_score).\n", |
| 56 | + "\n", |
| 57 | + "Now, we want to add vectors to represent each user. These are just dummy vectors to illustrate the point, but more complex vectors can be created and used as well. For more information on creating embeddings, see this [article](https://mlops.community/vector-similarity-search-from-basics-to-production/).\n", |
| 58 | + "\n", |
| 59 | + "Let's add a ``vector`` column to the above dataframe" |
| 60 | + ] |
| 61 | + }, |
| 62 | + { |
| 63 | + "cell_type": "code", |
| 64 | + "execution_count": 3, |
| 65 | + "metadata": {}, |
| 66 | + "outputs": [ |
| 67 | + { |
| 68 | + "data": { |
| 69 | + "text/html": [ |
| 70 | + "<div>\n", |
| 71 | + "<style scoped>\n", |
| 72 | + " .dataframe tbody tr th:only-of-type {\n", |
| 73 | + " vertical-align: middle;\n", |
| 74 | + " }\n", |
| 75 | + "\n", |
| 76 | + " .dataframe tbody tr th {\n", |
| 77 | + " vertical-align: top;\n", |
| 78 | + " }\n", |
| 79 | + "\n", |
| 80 | + " .dataframe thead th {\n", |
| 81 | + " text-align: right;\n", |
| 82 | + " }\n", |
| 83 | + "</style>\n", |
| 84 | + "<table border=\"1\" class=\"dataframe\">\n", |
| 85 | + " <thead>\n", |
| 86 | + " <tr style=\"text-align: right;\">\n", |
| 87 | + " <th></th>\n", |
| 88 | + " <th>users</th>\n", |
| 89 | + " <th>age</th>\n", |
| 90 | + " <th>job</th>\n", |
| 91 | + " <th>credit_score</th>\n", |
| 92 | + " <th>user_embedding</th>\n", |
| 93 | + " </tr>\n", |
| 94 | + " </thead>\n", |
| 95 | + " <tbody>\n", |
| 96 | + " <tr>\n", |
| 97 | + " <th>0</th>\n", |
| 98 | + " <td>john</td>\n", |
| 99 | + " <td>1</td>\n", |
| 100 | + " <td>engineer</td>\n", |
| 101 | + " <td>high</td>\n", |
| 102 | + " <td>b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'</td>\n", |
| 103 | + " </tr>\n", |
| 104 | + " <tr>\n", |
| 105 | + " <th>1</th>\n", |
| 106 | + " <td>mary</td>\n", |
| 107 | + " <td>2</td>\n", |
| 108 | + " <td>doctor</td>\n", |
| 109 | + " <td>low</td>\n", |
| 110 | + " <td>b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?'</td>\n", |
| 111 | + " </tr>\n", |
| 112 | + " <tr>\n", |
| 113 | + " <th>2</th>\n", |
| 114 | + " <td>joe</td>\n", |
| 115 | + " <td>3</td>\n", |
| 116 | + " <td>dentist</td>\n", |
| 117 | + " <td>medium</td>\n", |
| 118 | + " <td>b'fff?fff?\\xcd\\xcc\\xcc='</td>\n", |
| 119 | + " </tr>\n", |
| 120 | + " </tbody>\n", |
| 121 | + "</table>\n", |
| 122 | + "</div>" |
| 123 | + ], |
| 124 | + "text/plain": [ |
| 125 | + " users age job credit_score \\\n", |
| 126 | + "0 john 1 engineer high \n", |
| 127 | + "1 mary 2 doctor low \n", |
| 128 | + "2 joe 3 dentist medium \n", |
| 129 | + "\n", |
| 130 | + " user_embedding \n", |
| 131 | + "0 b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?' \n", |
| 132 | + "1 b'\\xcd\\xcc\\xcc=\\xcd\\xcc\\xcc=\\x00\\x00\\x00?' \n", |
| 133 | + "2 b'fff?fff?\\xcd\\xcc\\xcc=' " |
| 134 | + ] |
| 135 | + }, |
| 136 | + "execution_count": 3, |
| 137 | + "metadata": {}, |
| 138 | + "output_type": "execute_result" |
| 139 | + } |
| 140 | + ], |
| 141 | + "source": [ |
| 142 | + "data[\"user_embedding\"] = [\n", |
| 143 | + " np.array([0.1, 0.1, 0.5], dtype=np.float32).tobytes(),\n", |
| 144 | + " np.array([0.1, 0.1, 0.5], dtype=np.float32).tobytes(),\n", |
| 145 | + " np.array([0.9, 0.9, 0.1], dtype=np.float32).tobytes(),\n", |
| 146 | + "]\n", |
| 147 | + "data" |
| 148 | + ] |
| 149 | + }, |
| 150 | + { |
| 151 | + "cell_type": "markdown", |
| 152 | + "metadata": {}, |
| 153 | + "source": [ |
| 154 | + "As seen above, the vectors themselves need to be turned into bytes before they can be loaded into Redis. Using ``NumPy``, this is fairly trivial. \n", |
| 155 | + "\n", |
| 156 | + "Our dataset is now ready to be used with ``redisvl``" |
| 157 | + ] |
| 158 | + }, |
| 159 | + { |
| 160 | + "cell_type": "markdown", |
| 161 | + "metadata": {}, |
| 162 | + "source": [ |
| 163 | + "### 1.2 Writing data schema for ``redisvl``\n", |
| 164 | + "\n", |
| 165 | + "In order for ``redisvl`` to be flexible for many types of data, it uses a schema specified in either a python dictionary or a yaml file. There are a couple main components\n", |
| 166 | + "\n", |
| 167 | + "1. index specification\n", |
| 168 | + "2. field specification\n", |
| 169 | + "\n", |
| 170 | + "The index specification determines how data will be stored in Redis. This includes\n", |
| 171 | + "- ``name``: the name of the index\n", |
| 172 | + "- ``prefix``: key prefix for each loaded entry\n", |
| 173 | + "- ``key_field``: field within the dataset to use as unique keys\n", |
| 174 | + "\n", |
| 175 | + "The field specification determines what fields within the dataset will be available for queries. Each field corresponds to the name of a **column** within the dataset. The values within each specified column are arguments for the creation of that index that correspond directly to ``redis-py`` arguments.\n", |
| 176 | + "\n", |
| 177 | + "So for example, given the above dataset, the following schema can be used." |
| 178 | + ] |
| 179 | + }, |
| 180 | + { |
| 181 | + "cell_type": "code", |
| 182 | + "execution_count": 4, |
| 183 | + "metadata": {}, |
| 184 | + "outputs": [], |
| 185 | + "source": [ |
| 186 | + "schema = {\n", |
| 187 | + " \"index\": {\n", |
| 188 | + " \"name\": \"user_index\",\n", |
| 189 | + " \"prefix\": \"user:\",\n", |
| 190 | + " \"key_field\": \"users\",\n", |
| 191 | + " \"storage_type\": \"hash\",\n", |
| 192 | + " },\n", |
| 193 | + " \"fields\": {\n", |
| 194 | + " # key is the field type\n", |
| 195 | + " # value is the name of the column in the dataset(frame)\n", |
| 196 | + " \"tag\": {\"credit_score\": {}},\n", |
| 197 | + " \"text\": {\"job\": {}},\n", |
| 198 | + " \"numeric\": {\"age\": {}},\n", |
| 199 | + " \"vector\": {\n", |
| 200 | + " \"user_embedding\": {\n", |
| 201 | + " \"dims\": 3,\n", |
| 202 | + " \"distance_metric\": \"cosine\",\n", |
| 203 | + " \"algorithm\": \"flat\",\n", |
| 204 | + " \"datatype\": \"float32\",\n", |
| 205 | + " }\n", |
| 206 | + " },\n", |
| 207 | + " },\n", |
| 208 | + "}" |
| 209 | + ] |
| 210 | + }, |
| 211 | + { |
| 212 | + "cell_type": "markdown", |
| 213 | + "metadata": {}, |
| 214 | + "source": [ |
| 215 | + "### 1.3 Creating a search index\n", |
| 216 | + "\n", |
| 217 | + "With the data and the index schema defined, we can now use ``redisvl`` as a library to create a search index as follows.\n", |
| 218 | + "\n", |
| 219 | + "Note that at this point, the index will have no entries. With Redis, this is fine as new entries from this index (or that follow the schema) will automatically be indexed in the background in Redis." |
| 220 | + ] |
| 221 | + }, |
| 222 | + { |
| 223 | + "cell_type": "code", |
| 224 | + "execution_count": 5, |
| 225 | + "metadata": {}, |
| 226 | + "outputs": [], |
| 227 | + "source": [ |
| 228 | + "from redisvl.index import SearchIndex\n", |
| 229 | + "\n", |
| 230 | + "# construct a search index from the schema\n", |
| 231 | + "index = SearchIndex.from_dict(schema)\n", |
| 232 | + "\n", |
| 233 | + "# connect to local redis instance\n", |
| 234 | + "index.connect(\"localhost\", 6379)\n", |
| 235 | + "\n", |
| 236 | + "# create the index (no data yet)\n", |
| 237 | + "index.create()" |
| 238 | + ] |
| 239 | + }, |
| 240 | + { |
| 241 | + "cell_type": "markdown", |
| 242 | + "metadata": {}, |
| 243 | + "source": [ |
| 244 | + "### 1.4 Loading Data with PandasReader\n", |
| 245 | + "\n", |
| 246 | + "In this section, we will take our dataframe we defined above and load it into our search index so that we can query it." |
| 247 | + ] |
| 248 | + }, |
| 249 | + { |
| 250 | + "cell_type": "code", |
| 251 | + "execution_count": 6, |
| 252 | + "metadata": {}, |
| 253 | + "outputs": [], |
| 254 | + "source": [ |
| 255 | + "from redisvl.readers import PandasReader\n", |
| 256 | + "\n", |
| 257 | + "# Initialize a reader for a pandas dataframe.\n", |
| 258 | + "reader = PandasReader(data)\n", |
| 259 | + "\n", |
| 260 | + "# load the data into Redis\n", |
| 261 | + "index.load(reader)" |
| 262 | + ] |
| 263 | + }, |
| 264 | + { |
| 265 | + "cell_type": "markdown", |
| 266 | + "metadata": {}, |
| 267 | + "source": [ |
| 268 | + "### 1.5 Running a Vector Search\n", |
| 269 | + "\n", |
| 270 | + "Next we will run a vector query on our newly populated index. This example will use a simple vector to demonstrate how vector similarity works. Vectors in production will be much larger than 3 floats and often require Machine Learning models (i.e. Huggingface sentence transformers) or an embeddings API (Cohere, OpenAI) to create." |
| 271 | + ] |
| 272 | + }, |
| 273 | + { |
| 274 | + "cell_type": "code", |
| 275 | + "execution_count": 8, |
| 276 | + "metadata": {}, |
| 277 | + "outputs": [], |
| 278 | + "source": [ |
| 279 | + "from redisvl.query import create_vector_query\n", |
| 280 | + "\n", |
| 281 | + "# create a vector query returning a number of results\n", |
| 282 | + "# with specific fields to return.\n", |
| 283 | + "query = create_vector_query(\n", |
| 284 | + " return_fields=[\"users\", \"age\", \"job\", \"credit_score\", \"vector_score\"],\n", |
| 285 | + " number_of_results=3,\n", |
| 286 | + " vector_param_name=\"vec_param\",\n", |
| 287 | + " vector_field_name=\"user_embedding\",\n", |
| 288 | + " sort=True\n", |
| 289 | + ")\n", |
| 290 | + "\n", |
| 291 | + "# establish a query vector to search against the data in Redis\n", |
| 292 | + "query_vector = np.array([0.1, 0.1, 0.5], dtype=np.float32).tobytes()\n", |
| 293 | + "\n", |
| 294 | + "# use the SearchIndex instance (or Redis client) to execute the query\n", |
| 295 | + "results = index.search(query, query_params={\"vec_param\": query_vector})" |
| 296 | + ] |
| 297 | + }, |
| 298 | + { |
| 299 | + "cell_type": "code", |
| 300 | + "execution_count": 9, |
| 301 | + "metadata": {}, |
| 302 | + "outputs": [ |
| 303 | + { |
| 304 | + "name": "stdout", |
| 305 | + "output_type": "stream", |
| 306 | + "text": [ |
| 307 | + "Score: 0\n", |
| 308 | + "Document {'id': 'user:john', 'payload': None, 'vector_score': '0', 'users': 'john', 'age': '1', 'job': 'engineer', 'credit_score': 'high'}\n", |
| 309 | + "Score: 0\n", |
| 310 | + "Document {'id': 'user:mary', 'payload': None, 'vector_score': '0', 'users': 'mary', 'age': '2', 'job': 'doctor', 'credit_score': 'low'}\n", |
| 311 | + "Score: 0.653301358223\n", |
| 312 | + "Document {'id': 'user:joe', 'payload': None, 'vector_score': '0.653301358223', 'users': 'joe', 'age': '3', 'job': 'dentist', 'credit_score': 'medium'}\n" |
| 313 | + ] |
| 314 | + } |
| 315 | + ], |
| 316 | + "source": [ |
| 317 | + "for doc in results.docs:\n", |
| 318 | + " print(\"Score:\", doc.vector_score)\n", |
| 319 | + " print(doc)\n" |
| 320 | + ] |
| 321 | + }, |
| 322 | + { |
| 323 | + "cell_type": "code", |
| 324 | + "execution_count": null, |
| 325 | + "metadata": {}, |
| 326 | + "outputs": [], |
| 327 | + "source": [] |
| 328 | + } |
| 329 | + ], |
| 330 | + "metadata": { |
| 331 | + "kernelspec": { |
| 332 | + "display_name": "Python 3.8.13 ('redisvl2')", |
| 333 | + "language": "python", |
| 334 | + "name": "python3" |
| 335 | + }, |
| 336 | + "language_info": { |
| 337 | + "codemirror_mode": { |
| 338 | + "name": "ipython", |
| 339 | + "version": 3 |
| 340 | + }, |
| 341 | + "file_extension": ".py", |
| 342 | + "mimetype": "text/x-python", |
| 343 | + "name": "python", |
| 344 | + "nbconvert_exporter": "python", |
| 345 | + "pygments_lexer": "ipython3", |
| 346 | + "version": "3.8.13" |
| 347 | + }, |
| 348 | + "orig_nbformat": 4, |
| 349 | + "vscode": { |
| 350 | + "interpreter": { |
| 351 | + "hash": "9b1e6e9c2967143209c2f955cb869d1d3234f92dc4787f49f155f3abbdfb1316" |
| 352 | + } |
| 353 | + } |
| 354 | + }, |
| 355 | + "nbformat": 4, |
| 356 | + "nbformat_minor": 2 |
| 357 | +} |
0 commit comments