elastic
diff --git a/‎notebooks/README.md
Lines changed: 5 additions & 2 deletions b/‎notebooks/README.md
Lines changed: 5 additions & 2 deletions
diff --git a/‎notebooks/ingestion-and-chunking/_nbtest.teardown.ipynb
Lines changed: 87 additions & 0 deletions b/‎notebooks/ingestion-and-chunking/_nbtest.teardown.ipynb
Lines changed: 87 additions & 0 deletions
diff --git a/‎notebooks/ingestion-and-chunking/json-chunking-ingest.ipynb
Lines changed: 237 additions & 0 deletions b/‎notebooks/ingestion-and-chunking/json-chunking-ingest.ipynb
Lines changed: 237 additions & 0 deletions
@@ -9,13 +9,16 @@ Notebooks are organized into the following folders:
 
 - [`search`](./search/): Notebooks that demonstrate the fundamentals of Elasticsearch, like indexing embeddings, running lexical, semantic and _hybrid_ searches, and more.
 
-- [`enterprise-search`](./enterprise-search/): Notebooks that demonstrate use cases for working with and exporting from Elastic Enterprise Search, App Search, or Workplace Search.
+- [`doc-ingestion-and-chunking`](./ingestion-and-chunking/): Notebooks that demonstrate how to ingest and chunk documents for indexing in Elasticsearch from PDF, HTML and JSON with ELSER.
 
 - [`generative-ai`](./generative-ai/): Notebooks that demonstrate various use cases for Elasticsearch as the retrieval engine and vector store for LLM-powered applications.
 
+- [`langchain`](./langchain/): Notebooks that demonstrate how to integrate Elastic with [LangChain](https://langchain-langchain.vercel.app/docs/get_started/introduction.html), a framework for developing applications powered by language models.
+
 - [`integrations`](./integrations/): Notebooks that demonstrate how to integrate popular services and projects with Elasticsearch:
+
   - [OpenAI](./integrations/openai)
   - [Hugging Face](./integrations/hugging-face)
   - [LlamaIndex](./integrations/llama-index)
 
-- [`langchain`](./langchain/): Notebooks that demonstrate how to integrate Elastic with [LangChain](https://langchain-langchain.vercel.app/docs/get_started/introduction.html), a framework for developing applications powered by language models.
+- [`enterprise-search`](./enterprise-search/): Notebooks that demonstrate use cases for working with and exporting from Elastic Enterprise Search, App Search, or Workplace Search.
@@ -0,0 +1,87 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "1422b7bb-bc8c-42bb-b070-53fce3cf6144",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from elasticsearch import Elasticsearch\n",
+    "from getpass import getpass\n",
+    "\n",
+    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
+    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
+    "\n",
+    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
+    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
+    "\n",
+    "# Create the client instance\n",
+    "client = Elasticsearch(\n",
+    "    # For local development\n",
+    "    # hosts=[\"http://localhost:9200\"]\n",
+    "    cloud_id=ELASTIC_CLOUD_ID,\n",
+    "    api_key=ELASTIC_API_KEY,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "e4a89367-d23a-4340-bc92-2dcabd18adcd",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "ObjectApiResponse({'acknowledged': True})"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "client.indices.delete(\n",
+    "    index=\"json_chunked_index,pdf_chunked_index,website_chunked_index\",\n",
+    "    ignore_unavailable=True,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "4ac37f1b-6122-49fe-a3b8-e8f2025a0961",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "try:\n",
+    "    client.ml.delete_trained_model(model_id=\".elser_model_2\", force=True)\n",
+    "except:\n",
+    "    pass"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,237 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# JSON load, Extraction and Ingest with ELSER Example\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion/json-chunking-ingest.ipynb)\n",
+    "\n",
+    "This workbook demonstrates how to load a JSON file,  create passages and ingest into Elasticsearch. \n",
+    "\n",
+    "In this example we will:\n",
+    "- load the JSON using jq\n",
+    "- chunk the text with LangChain document splitter\n",
+    "- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. \n",
+    "\n",
+    "We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "zQlYpYkI46Ff",
+    "outputId": "83677846-8a6a-4b49-fde0-16d473778814"
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -qU langchain_community langchain elasticsearch tiktoken langchain-elasticsearch jq"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "GCZR7-zK810e"
+   },
+   "source": [
+    "## Connecting to Elasticsearch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "id": "DofNZ2w25nIr"
+   },
+   "outputs": [],
+   "source": [
+    "from elasticsearch import Elasticsearch\n",
+    "from getpass import getpass\n",
+    "\n",
+    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
+    "ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
+    "\n",
+    "# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
+    "ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
+    "\n",
+    "client = Elasticsearch(\n",
+    "    # For local development\n",
+    "    # \"http://localhost:9200\",\n",
+    "    # basic_auth=(\"elastic\", \"changeme\")\n",
+    "    cloud_id=ELASTIC_CLOUD_ID,\n",
+    "    api_key=ELASTIC_API_KEY,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "zv6hKYWr8-Mg"
+   },
+   "source": [
+    "## Deploying ELSER"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "1U4ffD2K9BkJ"
+   },
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "model = \".elser_model_2\"\n",
+    "\n",
+    "try:\n",
+    "    client.ml.put_trained_model(model_id=model, input={\"field_names\": [\"text_field\"]})\n",
+    "except:\n",
+    "    pass\n",
+    "\n",
+    "while True:\n",
+    "    status = client.ml.get_trained_models(model_id=model, include=\"definition_status\")\n",
+    "\n",
+    "    if status[\"trained_model_configs\"][0][\"fully_defined\"]:\n",
+    "        print(model + \" is downloaded and ready to be deployed.\")\n",
+    "        break\n",
+    "    else:\n",
+    "        print(model + \" is downloading or not ready to be deployed.\")\n",
+    "    time.sleep(5)\n",
+    "\n",
+    "client.ml.start_trained_model_deployment(\n",
+    "    model_id=model, number_of_allocations=1, wait_for=\"starting\"\n",
+    ")\n",
+    "\n",
+    "while True:\n",
+    "    status = client.ml.get_trained_models_stats(\n",
+    "        model_id=model,\n",
+    "    )\n",
+    "    if status[\"trained_model_stats\"][0][\"deployment_stats\"][\"state\"] == \"started\":\n",
+    "        print(model + \" has been successfully deployed.\")\n",
+    "        break\n",
+    "    else:\n",
+    "        print(model + \" is currently being deployed.\")\n",
+    "    time.sleep(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "wqYXqJxn9JsA"
+   },
+   "source": [
+    "## Loading a JSON file, creating chunks into docs\n",
+    "This will load the webpage from the url provided, and then chunk the html text into passage docs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "id": "7bN32vunqIk2"
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import JSONLoader\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "from urllib.request import urlopen\n",
+    "import json\n",
+    "\n",
+    "# Change the URL to the desired dataset\n",
+    "url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/workplace-documents.json\"\n",
+    "\n",
+    "response = urlopen(url)\n",
+    "data = json.load(response)\n",
+    "\n",
+    "with open(\"temp.json\", \"w\") as json_file:\n",
+    "    json.dump(data, json_file)\n",
+    "\n",
+    "\n",
+    "# Metadata function to extract metadata from the record\n",
+    "def metadata_func(record: dict, metadata: dict) -> dict:\n",
+    "    metadata[\"name\"] = record.get(\"name\")\n",
+    "    metadata[\"summary\"] = record.get(\"summary\")\n",
+    "    metadata[\"url\"] = record.get(\"url\")\n",
+    "    metadata[\"category\"] = record.get(\"category\")\n",
+    "    metadata[\"updated_at\"] = record.get(\"updated_at\")\n",
+    "\n",
+    "    return metadata\n",
+    "\n",
+    "\n",
+    "# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/\n",
+    "# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders\n",
+    "loader = JSONLoader(\n",
+    "    file_path=\"temp.json\",\n",
+    "    jq_schema=\".[]\",\n",
+    "    content_key=\"content\",\n",
+    "    metadata_func=metadata_func,\n",
+    ")\n",
+    "\n",
+    "text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
+    "    chunk_size=512, chunk_overlap=256\n",
+    ")\n",
+    "docs = loader.load_and_split(text_splitter=text_splitter)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ingesting the passages into Elasticsearch\n",
+    "This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "0xtdeIJI9N9-"
+   },
+   "outputs": [],
+   "source": [
+    "from langchain_elasticsearch import ElasticsearchStore\n",
+    "\n",
+    "INDEX_NAME = \"json_chunked_index\"\n",
+    "\n",
+    "ElasticsearchStore.from_documents(\n",
+    "    docs,\n",
+    "    es_connection=client,\n",
+    "    index_name=INDEX_NAME,\n",
+    "    strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),\n",
+    "    bulk_kwargs={\n",
+    "        \"request_timeout\": 180,\n",
+    "    },\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "include_colab_link": true,
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}