Skip to content

Commit 7afdafb

Browse files
[Playground] notebooks for ingestion suitable for playground (#232)
* notebooks for ingestion suitable for playground * update copy * update formatting * updates * Update notebooks/ingestion-and-chunking/website-chunking-ingest.ipynb Co-authored-by: Liam Thompson <[email protected]> * updates * update teardown script * update test file * updates to see if passes * remove pypdf dep * updates --------- Co-authored-by: Liam Thompson <[email protected]>
1 parent 8f68be1 commit 7afdafb

File tree

5 files changed

+748
-2
lines changed

5 files changed

+748
-2
lines changed

notebooks/README.md

+5-2
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,16 @@ Notebooks are organized into the following folders:
99

1010
- [`search`](./search/): Notebooks that demonstrate the fundamentals of Elasticsearch, like indexing embeddings, running lexical, semantic and _hybrid_ searches, and more.
1111

12-
- [`enterprise-search`](./enterprise-search/): Notebooks that demonstrate use cases for working with and exporting from Elastic Enterprise Search, App Search, or Workplace Search.
12+
- [`doc-ingestion-and-chunking`](./ingestion-and-chunking/): Notebooks that demonstrate how to ingest and chunk documents for indexing in Elasticsearch from PDF, HTML and JSON with ELSER.
1313

1414
- [`generative-ai`](./generative-ai/): Notebooks that demonstrate various use cases for Elasticsearch as the retrieval engine and vector store for LLM-powered applications.
1515

16+
- [`langchain`](./langchain/): Notebooks that demonstrate how to integrate Elastic with [LangChain](https://langchain-langchain.vercel.app/docs/get_started/introduction.html), a framework for developing applications powered by language models.
17+
1618
- [`integrations`](./integrations/): Notebooks that demonstrate how to integrate popular services and projects with Elasticsearch:
19+
1720
- [OpenAI](./integrations/openai)
1821
- [Hugging Face](./integrations/hugging-face)
1922
- [LlamaIndex](./integrations/llama-index)
2023

21-
- [`langchain`](./langchain/): Notebooks that demonstrate how to integrate Elastic with [LangChain](https://langchain-langchain.vercel.app/docs/get_started/introduction.html), a framework for developing applications powered by language models.
24+
- [`enterprise-search`](./enterprise-search/): Notebooks that demonstrate use cases for working with and exporting from Elastic Enterprise Search, App Search, or Workplace Search.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "code",
5+
"execution_count": 1,
6+
"id": "1422b7bb-bc8c-42bb-b070-53fce3cf6144",
7+
"metadata": {},
8+
"outputs": [],
9+
"source": [
10+
"from elasticsearch import Elasticsearch\n",
11+
"from getpass import getpass\n",
12+
"\n",
13+
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
14+
"ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
15+
"\n",
16+
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
17+
"ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
18+
"\n",
19+
"# Create the client instance\n",
20+
"client = Elasticsearch(\n",
21+
" # For local development\n",
22+
" # hosts=[\"http://localhost:9200\"]\n",
23+
" cloud_id=ELASTIC_CLOUD_ID,\n",
24+
" api_key=ELASTIC_API_KEY,\n",
25+
")"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": 2,
31+
"id": "e4a89367-d23a-4340-bc92-2dcabd18adcd",
32+
"metadata": {},
33+
"outputs": [
34+
{
35+
"data": {
36+
"text/plain": [
37+
"ObjectApiResponse({'acknowledged': True})"
38+
]
39+
},
40+
"execution_count": 2,
41+
"metadata": {},
42+
"output_type": "execute_result"
43+
}
44+
],
45+
"source": [
46+
"client.indices.delete(\n",
47+
" index=\"json_chunked_index,pdf_chunked_index,website_chunked_index\",\n",
48+
" ignore_unavailable=True,\n",
49+
")"
50+
]
51+
},
52+
{
53+
"cell_type": "code",
54+
"execution_count": 3,
55+
"id": "4ac37f1b-6122-49fe-a3b8-e8f2025a0961",
56+
"metadata": {},
57+
"outputs": [],
58+
"source": [
59+
"try:\n",
60+
" client.ml.delete_trained_model(model_id=\".elser_model_2\", force=True)\n",
61+
"except:\n",
62+
" pass"
63+
]
64+
}
65+
],
66+
"metadata": {
67+
"kernelspec": {
68+
"display_name": "Python 3 (ipykernel)",
69+
"language": "python",
70+
"name": "python3"
71+
},
72+
"language_info": {
73+
"codemirror_mode": {
74+
"name": "ipython",
75+
"version": 3
76+
},
77+
"file_extension": ".py",
78+
"mimetype": "text/x-python",
79+
"name": "python",
80+
"nbconvert_exporter": "python",
81+
"pygments_lexer": "ipython3",
82+
"version": "3.10.3"
83+
}
84+
},
85+
"nbformat": 4,
86+
"nbformat_minor": 5
87+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# JSON load, Extraction and Ingest with ELSER Example\n",
8+
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/elastic/elasticsearch-labs/blob/main/notebooks/ingestion/json-chunking-ingest.ipynb)\n",
9+
"\n",
10+
"This workbook demonstrates how to load a JSON file, create passages and ingest into Elasticsearch. \n",
11+
"\n",
12+
"In this example we will:\n",
13+
"- load the JSON using jq\n",
14+
"- chunk the text with LangChain document splitter\n",
15+
"- ingest into Elasticsearch with LangChain Elasticsearch Vectorstore. \n",
16+
"\n",
17+
"We will also setup your Elasticsearch cluster with ELSER model, so we can use it to embed the passages."
18+
]
19+
},
20+
{
21+
"cell_type": "code",
22+
"execution_count": null,
23+
"metadata": {
24+
"colab": {
25+
"base_uri": "https://localhost:8080/"
26+
},
27+
"id": "zQlYpYkI46Ff",
28+
"outputId": "83677846-8a6a-4b49-fde0-16d473778814"
29+
},
30+
"outputs": [],
31+
"source": [
32+
"!pip install -qU langchain_community langchain elasticsearch tiktoken langchain-elasticsearch jq"
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"metadata": {
38+
"id": "GCZR7-zK810e"
39+
},
40+
"source": [
41+
"## Connecting to Elasticsearch"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": 2,
47+
"metadata": {
48+
"id": "DofNZ2w25nIr"
49+
},
50+
"outputs": [],
51+
"source": [
52+
"from elasticsearch import Elasticsearch\n",
53+
"from getpass import getpass\n",
54+
"\n",
55+
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id\n",
56+
"ELASTIC_CLOUD_ID = getpass(\"Elastic Cloud ID: \")\n",
57+
"\n",
58+
"# https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key\n",
59+
"ELASTIC_API_KEY = getpass(\"Elastic Api Key: \")\n",
60+
"\n",
61+
"client = Elasticsearch(\n",
62+
" # For local development\n",
63+
" # \"http://localhost:9200\",\n",
64+
" # basic_auth=(\"elastic\", \"changeme\")\n",
65+
" cloud_id=ELASTIC_CLOUD_ID,\n",
66+
" api_key=ELASTIC_API_KEY,\n",
67+
")"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {
73+
"id": "zv6hKYWr8-Mg"
74+
},
75+
"source": [
76+
"## Deploying ELSER"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"metadata": {
83+
"id": "1U4ffD2K9BkJ"
84+
},
85+
"outputs": [],
86+
"source": [
87+
"import time\n",
88+
"\n",
89+
"model = \".elser_model_2\"\n",
90+
"\n",
91+
"try:\n",
92+
" client.ml.put_trained_model(model_id=model, input={\"field_names\": [\"text_field\"]})\n",
93+
"except:\n",
94+
" pass\n",
95+
"\n",
96+
"while True:\n",
97+
" status = client.ml.get_trained_models(model_id=model, include=\"definition_status\")\n",
98+
"\n",
99+
" if status[\"trained_model_configs\"][0][\"fully_defined\"]:\n",
100+
" print(model + \" is downloaded and ready to be deployed.\")\n",
101+
" break\n",
102+
" else:\n",
103+
" print(model + \" is downloading or not ready to be deployed.\")\n",
104+
" time.sleep(5)\n",
105+
"\n",
106+
"client.ml.start_trained_model_deployment(\n",
107+
" model_id=model, number_of_allocations=1, wait_for=\"starting\"\n",
108+
")\n",
109+
"\n",
110+
"while True:\n",
111+
" status = client.ml.get_trained_models_stats(\n",
112+
" model_id=model,\n",
113+
" )\n",
114+
" if status[\"trained_model_stats\"][0][\"deployment_stats\"][\"state\"] == \"started\":\n",
115+
" print(model + \" has been successfully deployed.\")\n",
116+
" break\n",
117+
" else:\n",
118+
" print(model + \" is currently being deployed.\")\n",
119+
" time.sleep(5)"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {
125+
"id": "wqYXqJxn9JsA"
126+
},
127+
"source": [
128+
"## Loading a JSON file, creating chunks into docs\n",
129+
"This will load the webpage from the url provided, and then chunk the html text into passage docs."
130+
]
131+
},
132+
{
133+
"cell_type": "code",
134+
"execution_count": 7,
135+
"metadata": {
136+
"id": "7bN32vunqIk2"
137+
},
138+
"outputs": [],
139+
"source": [
140+
"from langchain_community.document_loaders import JSONLoader\n",
141+
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
142+
"from urllib.request import urlopen\n",
143+
"import json\n",
144+
"\n",
145+
"# Change the URL to the desired dataset\n",
146+
"url = \"https://raw.githubusercontent.com/elastic/elasticsearch-labs/main/datasets/workplace-documents.json\"\n",
147+
"\n",
148+
"response = urlopen(url)\n",
149+
"data = json.load(response)\n",
150+
"\n",
151+
"with open(\"temp.json\", \"w\") as json_file:\n",
152+
" json.dump(data, json_file)\n",
153+
"\n",
154+
"\n",
155+
"# Metadata function to extract metadata from the record\n",
156+
"def metadata_func(record: dict, metadata: dict) -> dict:\n",
157+
" metadata[\"name\"] = record.get(\"name\")\n",
158+
" metadata[\"summary\"] = record.get(\"summary\")\n",
159+
" metadata[\"url\"] = record.get(\"url\")\n",
160+
" metadata[\"category\"] = record.get(\"category\")\n",
161+
" metadata[\"updated_at\"] = record.get(\"updated_at\")\n",
162+
"\n",
163+
" return metadata\n",
164+
"\n",
165+
"\n",
166+
"# For more loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/\n",
167+
"# And 3rd party loaders https://python.langchain.com/docs/modules/data_connection/document_loaders/#third-party-loaders\n",
168+
"loader = JSONLoader(\n",
169+
" file_path=\"temp.json\",\n",
170+
" jq_schema=\".[]\",\n",
171+
" content_key=\"content\",\n",
172+
" metadata_func=metadata_func,\n",
173+
")\n",
174+
"\n",
175+
"text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(\n",
176+
" chunk_size=512, chunk_overlap=256\n",
177+
")\n",
178+
"docs = loader.load_and_split(text_splitter=text_splitter)"
179+
]
180+
},
181+
{
182+
"cell_type": "markdown",
183+
"metadata": {},
184+
"source": [
185+
"## Ingesting the passages into Elasticsearch\n",
186+
"This will ingest the passage docs into the Elasticsearch index, under the specified INDEX_NAME."
187+
]
188+
},
189+
{
190+
"cell_type": "code",
191+
"execution_count": null,
192+
"metadata": {
193+
"id": "0xtdeIJI9N9-"
194+
},
195+
"outputs": [],
196+
"source": [
197+
"from langchain_elasticsearch import ElasticsearchStore\n",
198+
"\n",
199+
"INDEX_NAME = \"json_chunked_index\"\n",
200+
"\n",
201+
"ElasticsearchStore.from_documents(\n",
202+
" docs,\n",
203+
" es_connection=client,\n",
204+
" index_name=INDEX_NAME,\n",
205+
" strategy=ElasticsearchStore.SparseVectorRetrievalStrategy(model_id=model),\n",
206+
" bulk_kwargs={\n",
207+
" \"request_timeout\": 180,\n",
208+
" },\n",
209+
")"
210+
]
211+
}
212+
],
213+
"metadata": {
214+
"colab": {
215+
"include_colab_link": true,
216+
"provenance": []
217+
},
218+
"kernelspec": {
219+
"display_name": "Python 3",
220+
"name": "python3"
221+
},
222+
"language_info": {
223+
"codemirror_mode": {
224+
"name": "ipython",
225+
"version": 3
226+
},
227+
"file_extension": ".py",
228+
"mimetype": "text/x-python",
229+
"name": "python",
230+
"nbconvert_exporter": "python",
231+
"pygments_lexer": "ipython3",
232+
"version": "3.10.3"
233+
}
234+
},
235+
"nbformat": 4,
236+
"nbformat_minor": 0
237+
}

0 commit comments

Comments
 (0)