fixed linting issues

dmartinol · dmartinol · commit 066b789022c8 · 2025-01-08T18:00:04.000+01:00
Signed-off-by: Daniele Martinoli &lt;dmartino@redhat.com&gt;
diff --git a/docs/rag/ilab-rag-retrieval.md b/docs/rag/ilab-rag-retrieval.md
@@ -1,5 +1,7 @@
 # Design Proposal - Embedding Ingestion Pipeline And RAG-Based Chat
-**TODOs**: 
+
+**TODOs**
+
 * Vector store authentication options.
 * Document versioning and data update policies.
 * Unify prompt management in InstructLab. See (`chat_template` [configuration][chat_template] and
@@ -9,54 +11,66 @@
 
 **Version**: 0.1
 
-**Options to Rebuild Excalidraw Diagrams**:
+**Options to Rebuild Excalidraw Diagrams**
+
 * Using this [shareable link][shareable-excalidraw]
 * Importing the scene from the exported [DSL](./images/rag-ingestion-and-chat.excalidraw)
 
 ## 1. Introduction
+
 This document proposes enhancements to the `ilab` CLI to support workflows utilizing Retrieval-Augmented Generation
 (RAG) artifacts within `InstructLab`. The proposed changes introduce new commands and options for the embedding ingestion
 and RAG-based chat pipelines:
+
 * A new `ilab data` sub-command to process customer documentation.
   * Either from knowledge taxonomy or from actual user documents.
 * A new `ilab data` sub-command to generate and ingest embeddings from pre-processed documents into a configured vector store.
 * An option to enhance the chat pipeline by using the stored embeddings to augment the context of conversations, improving relevance and accuracy.
 
 ### 1.1 User Experience Overview
+
 The commands are tailored to support diverse user experiences, all enabling the use of RAG functionality to enrich chat sessions.
 
 ### 1.2 Model Training Path
+
 This flow is designed for users who aim to train their own models and leverage the source documents that support knowledge submissions to enhance the chat context:
 ![model-training](./images/rag-model-training.png)
 
 **Note**: documents are processed using `instructlab-sdg` package and are defined using the docling v1 schema.
 
 ### 1.3 Taxonomy Path (no Training)
+
 This flow is for users who have defined taxonomy knowledge but prefer not to train their own models. Instead, they aim to generate RAG artifacts from source documents to enhance the chat context:
 ![taxonomy-no-training](./images/rag-taxonomy-no-training.png)
 
 **Note**: documents are processed using `docling.DocumentConverter` and are defined using the docling v2 schema.
 
 ### 1.4 Plug-and-Play RAG Path
+
 This flow is designed for users who want to enhance their chat experience with pre-trained models by simply integrating the RAG functionality:
 ![plug-and-play](./images/rag-plug-and-play.png)
 
 **Note**: documents are processed using `docling.DocumentConverter` and are defined using the docling v2 schema.
 
 ## 2. Proposed Pipelines
+
 ### 2.1 Working Assumption
+
 This proposal aims to serve as a reference design to develop a Proof of Concept for RAG workflows, while
 also laying the foundation for future implementations of state-of-the-art RAG artifacts tailored to specific use
 cases.
 
 #### Command Options
+
 To maintain compatibility and simplicity, no new configurations will be introduced for new commands. Instead,
 the settings will be defined using the following hierarchy (options higher in the list overriding those below):
+
 * CLI flags (e.g., `--FLAG`).
 * Environment variables following a consistent naming convention, such as `ILAB_<UPPERCASE_FLAG_NAME>`.
 * Default values, for all the applicable use cases.
 
 For example, the `vectordb-uri` argument can be implemented using the `click` module like this:
+
 ```py
 @click.option(
     "--document-store-uri",
@@ -66,9 +80,11 @@ For example, the `vectordb-uri` argument can be implemented using the `click` mo
 ```
 
 #### Local embedding models
+
 The embedding model used to generate text embeddings must be downloaded locally before executing the pipeline.
 
 For example, this command can be used to download the `sentence-transformers/all-minilm-l6-v2` model to the local models cache:
+
 ```bash
 ilab model download -rp sentence-transformers/all-minilm-l6-v2
 ```
@@ -77,39 +93,47 @@ If the configured embedding model has not been cached, the command execution wil
 consistently to all new and updated commands.
 
 ### 2.2 Document Processing Pipeline
+
 The proposal is to add a `process`  sub-command to the `data` command group.
 
-For the Taxonomy path (no Model Training): 
-```
+For the Taxonomy path (no Model Training):
+
+```bash
 ilab data process --output /path/to/processed/folder
 ```
 
-For the Plug-and-Play RAG path: 
-```
+For the Plug-and-Play RAG path:
+
+```bash
 ilab data process --input /path/to/docs/folder --output /path/to/processed/folder
 ```
- 
-#### Command Purpose
-Applies the docling transformation to the customer documents. 
+
+#### Processing-Command Purpose
+
+Applies the docling transformation to the customer documents.
+
 * Original documents are located in the `/path/to/docs/folder` input folder or in the taxonomy knowledge branch.
   * In the latter case, the input documents are the knowledge documents retrieved from the installed taxonomy repository
     according to the [SDG diff strategy][sdg-diff-strategy], e.g. `the new or changed YAMLs using git diff, including untracked files`.
 * Processed artifacts are stored under `/path/to/processed/folder`.
 
-***Notes**: 
-* In alignment with the current SDG implementation, the `--input` folder will not be navigated recursively. Only files located at the root 
+***Notes**:
+
+* In alignment with the current SDG implementation, the `--input` folder will not be navigated recursively. Only files located at the root
   level of the specified folder will be considered. The same principle applies to all other options outlined below.
-* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared 
+* To ensure consistency and avoid issues with document versioning or outdated artifacts, the destination folder will be cleared
   before execution. This ensures it contains only the artifacts generated from the most recent run.
 
 The transformation is based on the latest version of the docling `DocumentConverter` (v2).
-The alternative to adopt the `instructlab-sdg` modules (e.g. the initial step of the `ilab data generate` pipeline) has been 
+The alternative to adopt the `instructlab-sdg` modules (e.g. the initial step of the `ilab data generate` pipeline) has been
 discarded because it generates documents according to the so-called legacy docling schema.
 
-#### Usage
+#### Processing-Usage
+
 The generated artifacts can later be used to generate and ingest the embeddings into a vector database.
 
 ### 2.3 Document Processing Pipeline Options
+
 ```bash
 % ilab data process --help
 Usage: ilab data process [OPTIONS]
@@ -134,46 +158,57 @@ Options:
 | Name of the embedding model. | **TBD** | `--embedding-model` | `ILAB_EMBEDDING_MODEL_NAME` |
 
 ### 2.4 Embedding Ingestion Pipeline
+
 The proposal is to add an `ingest` sub-command to the `data` command group.
 
-For the Model Training path: 
-```
+For the Model Training path:
+
+```bash
 ilab data ingest
 ```
 
 For the Taxonomy or Plug-and-Play RAG paths:
-```
+
+```bash
 ilab data ingest --input path/to/processed/folder
 ```
 
-#### Working Assumption
+#### Ingestion-Working Assumption
+
 The documents at the specified path have already been processed using the `data process` command or an equivalent method
 (see [Getting Started with Knowledge Contributions][ilab-knowledge]).
 
-#### Command Purpose
+#### Ingestion-Command Purpose
+
 Generate the embeddings from the pre-processed documents.
+
 * In case of Model Training path, the documents are located in the location specified by the `generate.output_dir` configuration key
   (e.g. `_HOME_/.local/share/instructlab/datasets`).
   * In particular, only the latest folder with name starting by `documents-` will be explored.
   * It must include a subfolder `docling-artifacts` with the actual json files.
 * In case the */path/to/processed/folder* parameter is provided, it is used to lookup the processed documents to ingest.
 
 **Notes**:
-* To ensure consistency and avoid issues with document versioning or outdated embeddings, the ingested collection will be cleared before execution. 
+
+* To ensure consistency and avoid issues with document versioning or outdated embeddings, the ingested collection will be cleared before execution.
   This ensures it contains only the embeddings generated from the most recent run.
 
-### Why We Need It
+### Ingestion-Why We Need It
+
 To populate embedding vector stores with pre-processed information that can be used at chat inference time.
 
-#### Supported Databases
-The command may support various vector database types. A default configuration will align with the selected 
+#### Ingestion-Supported Databases
+
+The command may support various vector database types. A default configuration will align with the selected
 InstructLab technology stack.
 
-#### Usage
-The generated embeddings can later be retrieved from a vector database and converted to text, enriching the 
+#### Ingestion-Usage
+
+The generated embeddings can later be retrieved from a vector database and converted to text, enriching the
 context for RAG-based chat pipelines.
 
 ### 2.5 Embedding Ingestion Pipeline Options
+
 ```bash
 % ilab data ingest --help 
 Usage: ilab data ingest [OPTIONS]
@@ -206,23 +241,29 @@ Options:
 | Name of the embedding model. | **TBD** | `--retriever-embedder-model-name` | `ILAB_EMBEDDER_MODEL_NAME` |
 
 ### 2.6 RAG Chat Pipeline Command
+
 The proposal is to add a `chat.rag.enable` configuration (or the equivalent `--rag` flag) to the `model chat` command, like:
-```
+
+```bash
 ilab model chat --rag
 ```
 
 #### Command Purpose
-This command enhances the existing `ilab model chat` functionality by integrating contextual information retrieved from user-provided documents, 
+
+This command enhances the existing `ilab model chat` functionality by integrating contextual information retrieved from user-provided documents,
 enriching the conversational experience with relevant insights.
 
 #### Revised chat pipeline
+
 * Start with the user's input, `user_query`.
 * Use the given `user_query` to retrieve relevant contextual information from the embedding database (semantic search).
 * Append the retrieved context to the original LLM request.
 * Send the context augmented request to the LLM and return the response to the user.
 
 #### Prompt Template
+
 A default non-configurable template is used with parameters to specify the user query and the context, like:
+
 ```text
 Given the following information, answer the question.
 Context:
@@ -235,9 +276,11 @@ Answer:
 Future extensions should align prompt management with the existing InstructLab design.
 
 ### 2.7 RAG Chat Commands
+
 The `/r` command may be added to the `ilab model chat` command to dynamically toggle the execution of the RAG pipeline.
 
 The current status could be displayed with an additional marker on the chat status bar, as in (top right corner):
+
 ```console
 >>> /h                                                                                                              [RAG][S][default]
 ╭───────────────────────────────────────────────────────────── system ──────────────────────────────────────────────────────────────╮
@@ -265,6 +308,7 @@ The current status could be displayed with an additional marker on the chat stat
 ```
 
 ### 2.8 RAG Chat Options
+
 As we stated in [2.1 Working Assumptions](#21-working-assumption), we will introduce new configuration options for the spceific `chat` command,
 but we'll use flags and environment variables for the options that come from the embedding ingestion pipeline command.
 
@@ -279,6 +323,7 @@ but we'll use flags and environment variables for the options that come from the
 | | Name of the embedding model. | **TBD** | `--retriever-embedder-model-name` | `ILAB_EMBEDDER_MODEL_NAME` |
 
 Equivalent YAML document for the newly proposed options:
+
 ```yaml
 chat:
     enable: false
@@ -293,11 +338,11 @@ chat:
 ```
 
 ### 2.9 References
+
 * [Haystack-DocumentSplitter](https://github.com/deepset-ai/haystack/blob/f0c3692cf2a86c69de8738d53af925500e8a5126/haystack/components/preprocessors/document_splitter.py#L55) is temporarily adopted with default settings  until a splitter based on the [docling chunkers][chunkers] is integrated
  in Haystack.
 * [MilvusEmbeddingRetriever](https://github.com/milvus-io/milvus-haystack/blob/77b27de00c2f0278e28b434f4883853a959f5466/src/milvus_haystack/milvus_embedding_retriever.py#L18)
 
-
 ### 2.10 Workflow Visualization
 
 Embedding ingestion pipeline:
@@ -306,6 +351,7 @@ RAG-based Chat pipeline:
 ![rag-chat](./images/rag-chat.png)
 
 ### 2.11 Proposed Implementation Stack
+
 > **ℹ️ Note:** This stack is still under review. The proposed list represents potential candidates based on the current state of discussions.
 
 The following technologies form the foundation of the proposed solution:
@@ -315,48 +361,61 @@ The following technologies form the foundation of the proposed solution:
 * [Docling](https://github.com/DS4SD/docling): Document processing tool. For more details, refer to William’s blog, [Docling: The missing document processing companion for generative AI](https://www.redhat.com/en/blog/docling-missing-document-processing-companion-generative-ai).
 
 ## 3. Design Considerations
-* As decided in [PR #165](https://github.com/instructlab/dev-docs/pull/165), functions related to RAG ingestion 
+
+* As decided in [PR #165](https://github.com/instructlab/dev-docs/pull/165), functions related to RAG ingestion
   and retrieval are located in the dedicated folder `src/instructlab/rag`.
-* The solution must minimize changes to existing modules by importing the required functions from the 
+* The solution must minimize changes to existing modules by importing the required functions from the
   `instructlab.rag` package.
 * The solution must adopt a pluggable design to facilitate seamless integration of additional components:
   * **Vector stores**: Support all selected implementations (e.g., Milvus).
-  * **Embedding models**: Handle embedding models using the appropriate embedder implementation for the 
+  * **Embedding models**: Handle embedding models using the appropriate embedder implementation for the
     chosen framework (e.g., Haystack).
   * Consider using factory functions to abstract implementations and enhance code flexibility.
-* Optional dependencies for 3rd party integrations should be defined in `pyproject.toml` and documented for 
+* Optional dependencies for 3rd party integrations should be defined in `pyproject.toml` and documented for
   clarity. Users can install optional components with commands like:
 
   `pip install instructlab[milvus]`
 
   3rd party dependencies may also be grouped in files such as `requirements/milvus.txt`.
 
 ## 4. Future Enhancements
-### 4.1 Model Evaluation 
+
+### 4.1 Model Evaluation
+
 **TODO** A separate ADR will be defined.
 
 ### 4.2 Advanced RAG retrieval steps
-- [Ranking retriever's result][ranking]: 
+
+* [Ranking retriever's result][ranking]:
+
 ```bash
 ilab model chat --rag --ranking --ranking-top-k=5 --ranking-model=cross-encoder/ms-marco-MiniLM-L-12-v2
 ```
-- [Query expansion][expansion]:
+
+* [Query expansion][expansion]:
+
 ```bash
 ilab model chat --rag --query-expansion --query-expansion-prompt="$QUERY_EXPANSION_PROMPT" --query-expansion-num-of-queries=5
 ```
-- Using retrieval strategy:
+
+* Using retrieval strategy:
+
 ```bash
 ilab model chat --rag --retrieval-strategy query-expansion --retrieval-strategy-options="prompt=$QUERY_EXPANSION_PROMPT;num_of_queries=5"
 ```
-- ...
+
+* ...
 
 ### 4.3 Containerized Indexing Service
+
 Generate a containerized RAG artifact to expose a `/query` endpoint that can serve as an alternative source :
+
 ```bash
 ilab data ingest --build-image --image-name=docker.io/user/my_rag_artifacts:1.0
 ```
 
 Then serve it and use it in a chat session:
+
 ```bash
 ilab serve --rag-embeddings --image-name=docker.io/user/my_rag_artifacts:1.0 --port 8123
 ilab model chat --rag --retriever-type api --retriever-uri http://localhost:8123