Skip to content

Commit 066b789

Browse files
committed
fixed linting issues
Signed-off-by: Daniele Martinoli <[email protected]>
1 parent 468f392 commit 066b789

File tree

1 file changed

+96
-37
lines changed

1 file changed

+96
-37
lines changed

docs/rag/ilab-rag-retrieval.md

Lines changed: 96 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Design Proposal - Embedding Ingestion Pipeline And RAG-Based Chat
2-
**TODOs**:
2+
3+
**TODOs**
4+
35
* Vector store authentication options.
46
* Document versioning and data update policies.
57
* Unify prompt management in InstructLab. See (`chat_template` [configuration][chat_template] and
@@ -9,54 +11,66 @@
911

1012
**Version**: 0.1
1113

12-
**Options to Rebuild Excalidraw Diagrams**:
14+
**Options to Rebuild Excalidraw Diagrams**
15+
1316
* Using this [shareable link][shareable-excalidraw]
1417
* Importing the scene from the exported [DSL](./images/rag-ingestion-and-chat.excalidraw)
1518

1619
## 1. Introduction
20+
1721
This document proposes enhancements to the `ilab` CLI to support workflows utilizing Retrieval-Augmented Generation
1822
(RAG) artifacts within `InstructLab`. The proposed changes introduce new commands and options for the embedding ingestion
1923
and RAG-based chat pipelines:
24+
2025
* A new `ilab data` sub-command to process customer documentation.
2126
* Either from knowledge taxonomy or from actual user documents.
2227
* A new `ilab data` sub-command to generate and ingest embeddings from pre-processed documents into a configured vector store.
2328
* An option to enhance the chat pipeline by using the stored embeddings to augment the context of conversations, improving relevance and accuracy.
2429

2530
### 1.1 User Experience Overview
31+
2632
The commands are tailored to support diverse user experiences, all enabling the use of RAG functionality to enrich chat sessions.
2733

2834
### 1.2 Model Training Path
35+
2936
This flow is designed for users who aim to train their own models and leverage the source documents that support knowledge submissions to enhance the chat context:
3037
![model-training](./images/rag-model-training.png)
3138

3239
**Note**: documents are processed using `instructlab-sdg` package and are defined using the docling v1 schema.
3340

3441
### 1.3 Taxonomy Path (no Training)
42+
3543
This flow is for users who have defined taxonomy knowledge but prefer not to train their own models. Instead, they aim to generate RAG artifacts from source documents to enhance the chat context:
3644
![taxonomy-no-training](./images/rag-taxonomy-no-training.png)
3745

3846
**Note**: documents are processed using `docling.DocumentConverter` and are defined using the docling v2 schema.
3947

4048
### 1.4 Plug-and-Play RAG Path
49+
4150
This flow is designed for users who want to enhance their chat experience with pre-trained models by simply integrating the RAG functionality:
4251
![plug-and-play](./images/rag-plug-and-play.png)
4352

4453
**Note**: documents are processed using `docling.DocumentConverter` and are defined using the docling v2 schema.
4554

4655
## 2. Proposed Pipelines
56+
4757
### 2.1 Working Assumption
58+
4859
This proposal aims to serve as a reference design to develop a Proof of Concept for RAG workflows, while
4960
also laying the foundation for future implementations of state-of-the-art RAG artifacts tailored to specific use
5061
cases.
5162

5263
#### Command Options
64+
5365
To maintain compatibility and simplicity, no new configurations will be introduced for new commands. Instead,
5466
the settings will be defined using the following hierarchy (options higher in the list overriding those below):
67+
5568
* CLI flags (e.g., `--FLAG`).
5669
* Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_FLAG_NAME>`.
5770
* Default values, for all the applicable use cases.
5871

5972
For example, the `vectordb-uri` argument can be implemented using the `click` module like this:
73+
6074
```py
6175
@click.option(
6276
"--document-store-uri",
@@ -66,9 +80,11 @@ For example, the `vectordb-uri` argument can be implemented using the `click` mo
6680
```
6781

6882
#### Local embedding models
83+
6984
The embedding model used to generate text embeddings must be downloaded locally before executing the pipeline.
7085

7186
For example, this command can be used to download the `sentence-transformers/all-minilm-l6-v2` model to the local models cache:
87+
7288
```bash
7389
ilab model download -rp sentence-transformers/all-minilm-l6-v2
7490
```
@@ -77,39 +93,47 @@ If the configured embedding model has not been cached, the command execution wil
7793
consistently to all new and updated commands.
7894

7995
### 2.2 Document Processing Pipeline
96+
8097
The proposal is to add a `process` sub-command to the `data` command group.
8198

82-
For the Taxonomy path (no Model Training):
83-
```
99+
For the Taxonomy path (no Model Training):
100+
101+
```bash
84102
ilab data process --output /path/to/processed/folder
85103
```
86104

87-
For the Plug-and-Play RAG path:
88-
```
105+
For the Plug-and-Play RAG path:
106+
107+
```bash
89108
ilab data process --input /path/to/docs/folder --output /path/to/processed/folder
90109
```
91-
92-
#### Command Purpose
93-
Applies the docling transformation to the customer documents.
110+
111+
#### Processing-Command Purpose
112+
113+
Applies the docling transformation to the customer documents.
114+
94115
* Original documents are located in the `/path/to/docs/folder` input folder or in the taxonomy knowledge branch.
95116
* In the latter case, the input documents are the knowledge documents retrieved from the installed taxonomy repository
96117
according to the [SDG diff strategy][sdg-diff-strategy], e.g. `the new or changed YAMLs using git diff, including untracked files`.
97118
* Processed artifacts are stored under `/path/to/processed/folder`.
98119

99-
***Notes**:
100-
* In alignment with the current SDG implementation, the `--input` folder will not be navigated recursively. Only files located at the root
120+
***Notes**:
121+
122+
* In alignment with the current SDG implementation, the `--input` folder will not be navigated recursively. Only files located at the root
101123
level of the specified folder will be considered. The same principle applies to all other options outlined below.
102-
* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared
124+
* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared
103125
before execution. This ensures it contains only the artifacts generated from the most recent run.
104126

105127
The transformation is based on the latest version of the docling `DocumentConverter` (v2).
106-
The alternative to adopt the `instructlab-sdg` modules (e.g. the initial step of the `ilab data generate` pipeline) has been
128+
The alternative to adopt the `instructlab-sdg` modules (e.g. the initial step of the `ilab data generate` pipeline) has been
107129
discarded because it generates documents according to the so-called legacy docling schema.
108130

109-
#### Usage
131+
#### Processing-Usage
132+
110133
The generated artifacts can later be used to generate and ingest the embeddings into a vector database.
111134

112135
### 2.3 Document Processing Pipeline Options
136+
113137
```bash
114138
% ilab data process --help
115139
Usage: ilab data process [OPTIONS]
@@ -134,46 +158,57 @@ Options:
134158
| Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
135159

136160
### 2.4 Embedding Ingestion Pipeline
161+
137162
The proposal is to add an `ingest` sub-command to the `data` command group.
138163

139-
For the Model Training path:
140-
```
164+
For the Model Training path:
165+
166+
```bash
141167
ilab data ingest
142168
```
143169

144170
For the Taxonomy or Plug-and-Play RAG paths:
145-
```
171+
172+
```bash
146173
ilab data ingest --input path/to/processed/folder
147174
```
148175

149-
#### Working Assumption
176+
#### Ingestion-Working Assumption
177+
150178
The documents at the specified path have already been processed using the `data process` command or an equivalent method
151179
(see [Getting Started with Knowledge Contributions][ilab-knowledge]).
152180

153-
#### Command Purpose
181+
#### Ingestion-Command Purpose
182+
154183
Generate the embeddings from the pre-processed documents.
184+
155185
* In case of Model Training path, the documents are located in the location specified by the `generate.output_dir` configuration key
156186
(e.g. `_HOME_/.local/share/instructlab/datasets`).
157187
* In particular, only the latest folder with name starting by `documents-` will be explored.
158188
* It must include a subfolder `docling-artifacts` with the actual json files.
159189
* In case the */path/to/processed/folder* parameter is provided, it is used to lookup the processed documents to ingest.
160190

161191
**Notes**:
162-
* To ensure consistency and avoid issues with document versioning or outdated embeddings, the ingested collection will be cleared before execution.
192+
193+
* To ensure consistency and avoid issues with document versioning or outdated embeddings, the ingested collection will be cleared before execution.
163194
This ensures it contains only the embeddings generated from the most recent run.
164195

165-
### Why We Need It
196+
### Ingestion-Why We Need It
197+
166198
To populate embedding vector stores with pre-processed information that can be used at chat inference time.
167199

168-
#### Supported Databases
169-
The command may support various vector database types. A default configuration will align with the selected
200+
#### Ingestion-Supported Databases
201+
202+
The command may support various vector database types. A default configuration will align with the selected
170203
InstructLab technology stack.
171204

172-
#### Usage
173-
The generated embeddings can later be retrieved from a vector database and converted to text, enriching the
205+
#### Ingestion-Usage
206+
207+
The generated embeddings can later be retrieved from a vector database and converted to text, enriching the
174208
context for RAG-based chat pipelines.
175209

176210
### 2.5 Embedding Ingestion Pipeline Options
211+
177212
```bash
178213
% ilab data ingest --help
179214
Usage: ilab data ingest [OPTIONS]
@@ -206,23 +241,29 @@ Options:
206241
| Name of the embedding model. | **TBD** | `--retriever-embedder-model-name` | `ILAB_EMBEDDER_MODEL_NAME` |
207242

208243
### 2.6 RAG Chat Pipeline Command
244+
209245
The proposal is to add a `chat.rag.enable` configuration (or the equivalent `--rag` flag) to the `model chat` command, like:
210-
```
246+
247+
```bash
211248
ilab model chat --rag
212249
```
213250

214251
#### Command Purpose
215-
This command enhances the existing `ilab model chat` functionality by integrating contextual information retrieved from user-provided documents,
252+
253+
This command enhances the existing `ilab model chat` functionality by integrating contextual information retrieved from user-provided documents,
216254
enriching the conversational experience with relevant insights.
217255

218256
#### Revised chat pipeline
257+
219258
* Start with the user's input, `user_query`.
220259
* Use the given `user_query` to retrieve relevant contextual information from the embedding database (semantic search).
221260
* Append the retrieved context to the original LLM request.
222261
* Send the context augmented request to the LLM and return the response to the user.
223262
224263
#### Prompt Template
264+
225265
A default non-configurable template is used with parameters to specify the user query and the context, like:
266+
226267
```text
227268
Given the following information, answer the question.
228269
Context:
@@ -235,9 +276,11 @@ Answer:
235276
Future extensions should align prompt management with the existing InstructLab design.
236277
237278
### 2.7 RAG Chat Commands
279+
238280
The `/r` command may be added to the `ilab model chat` command to dynamically toggle the execution of the RAG pipeline.
239281
240282
The current status could be displayed with an additional marker on the chat status bar, as in (top right corner):
283+
241284
```console
242285
>>> /h [RAG][S][default]
243286
╭───────────────────────────────────────────────────────────── system ──────────────────────────────────────────────────────────────╮
@@ -265,6 +308,7 @@ The current status could be displayed with an additional marker on the chat stat
265308
```
266309
267310
### 2.8 RAG Chat Options
311+
268312
As we stated in [2.1 Working Assumptions](#21-working-assumption), we will introduce new configuration options for the spceific `chat` command,
269313
but we'll use flags and environment variables for the options that come from the embedding ingestion pipeline command.
270314

@@ -279,6 +323,7 @@ but we'll use flags and environment variables for the options that come from the
279323
| | Name of the embedding model. | **TBD** | `--retriever-embedder-model-name` | `ILAB_EMBEDDER_MODEL_NAME` |
280324

281325
Equivalent YAML document for the newly proposed options:
326+
282327
```yaml
283328
chat:
284329
enable: false
@@ -293,11 +338,11 @@ chat:
293338
```
294339

295340
### 2.9 References
341+
296342
* [Haystack-DocumentSplitter](https://github.com/deepset-ai/haystack/blob/f0c3692cf2a86c69de8738d53af925500e8a5126/haystack/components/preprocessors/document_splitter.py#L55) is temporarily adopted with default settings until a splitter based on the [docling chunkers][chunkers] is integrated
297343
in Haystack.
298344
* [MilvusEmbeddingRetriever](https://github.com/milvus-io/milvus-haystack/blob/77b27de00c2f0278e28b434f4883853a959f5466/src/milvus_haystack/milvus_embedding_retriever.py#L18)
299345

300-
301346
### 2.10 Workflow Visualization
302347

303348
Embedding ingestion pipeline:
@@ -306,6 +351,7 @@ RAG-based Chat pipeline:
306351
![rag-chat](./images/rag-chat.png)
307352

308353
### 2.11 Proposed Implementation Stack
354+
309355
> **ℹ️ Note:** This stack is still under review. The proposed list represents potential candidates based on the current state of discussions.
310356

311357
The following technologies form the foundation of the proposed solution:
@@ -315,48 +361,61 @@ The following technologies form the foundation of the proposed solution:
315361
* [Docling](https://github.com/DS4SD/docling): Document processing tool. For more details, refer to William’s blog, [Docling: The missing document processing companion for generative AI](https://www.redhat.com/en/blog/docling-missing-document-processing-companion-generative-ai).
316362

317363
## 3. Design Considerations
318-
* As decided in [PR #165](https://github.com/instructlab/dev-docs/pull/165), functions related to RAG ingestion
364+
365+
* As decided in [PR #165](https://github.com/instructlab/dev-docs/pull/165), functions related to RAG ingestion
319366
and retrieval are located in the dedicated folder `src/instructlab/rag`.
320-
* The solution must minimize changes to existing modules by importing the required functions from the
367+
* The solution must minimize changes to existing modules by importing the required functions from the
321368
`instructlab.rag` package.
322369
* The solution must adopt a pluggable design to facilitate seamless integration of additional components:
323370
* **Vector stores**: Support all selected implementations (e.g., Milvus).
324-
* **Embedding models**: Handle embedding models using the appropriate embedder implementation for the
371+
* **Embedding models**: Handle embedding models using the appropriate embedder implementation for the
325372
chosen framework (e.g., Haystack).
326373
* Consider using factory functions to abstract implementations and enhance code flexibility.
327-
* Optional dependencies for 3rd party integrations should be defined in `pyproject.toml` and documented for
374+
* Optional dependencies for 3rd party integrations should be defined in `pyproject.toml` and documented for
328375
clarity. Users can install optional components with commands like:
329376

330377
`pip install instructlab[milvus]`
331378

332379
3rd party dependencies may also be grouped in files such as `requirements/milvus.txt`.
333380

334381
## 4. Future Enhancements
335-
### 4.1 Model Evaluation
382+
383+
### 4.1 Model Evaluation
384+
336385
**TODO** A separate ADR will be defined.
337386

338387
### 4.2 Advanced RAG retrieval steps
339-
- [Ranking retriever's result][ranking]:
388+
389+
* [Ranking retriever's result][ranking]:
390+
340391
```bash
341392
ilab model chat --rag --ranking --ranking-top-k=5 --ranking-model=cross-encoder/ms-marco-MiniLM-L-12-v2
342393
```
343-
- [Query expansion][expansion]:
394+
395+
* [Query expansion][expansion]:
396+
344397
```bash
345398
ilab model chat --rag --query-expansion --query-expansion-prompt="$QUERY_EXPANSION_PROMPT" --query-expansion-num-of-queries=5
346399
```
347-
- Using retrieval strategy:
400+
401+
* Using retrieval strategy:
402+
348403
```bash
349404
ilab model chat --rag --retrieval-strategy query-expansion --retrieval-strategy-options="prompt=$QUERY_EXPANSION_PROMPT;num_of_queries=5"
350405
```
351-
- ...
406+
407+
* ...
352408
353409
### 4.3 Containerized Indexing Service
410+
354411
Generate a containerized RAG artifact to expose a `/query` endpoint that can serve as an alternative source :
412+
355413
```bash
356414
ilab data ingest --build-image --image-name=docker.io/user/my_rag_artifacts:1.0
357415
```
358416
359417
Then serve it and use it in a chat session:
418+
360419
```bash
361420
ilab serve --rag-embeddings --image-name=docker.io/user/my_rag_artifacts:1.0 --port 8123
362421
ilab model chat --rag --retriever-type api --retriever-uri http://localhost:8123

0 commit comments

Comments
 (0)