marvelai-org · vash02 · Jun 20, 2024 · Jun 20, 2024 · Jun 20, 2024 · Jun 20, 2024
diff --git a/.gcloudignore b/.gcloudignore
@@ -0,0 +1,13 @@
+Dockerfile
+.gitignore
+contribution.md
+diagram.png
+LICENSE
+load_env.sh
+local-start.sh
+README.md
+.env
+.pytest_cache/
+.github/
+app/__pycache__/
+__pycache__/
diff --git a/Dockerfile b/Dockerfile
@@ -3,7 +3,7 @@ FROM python:3.10.12
 
 WORKDIR /code
 
-COPY app/requirements.txt /code/requirements.txt
+COPY requirements.txt /code/requirements.txt
 
 RUN pip install --no-cache-dir -r /code/requirements.txt
 
@@ -16,4 +16,4 @@ COPY ./app /code/app
 
 ENV PYTHONPATH=/code/app
 
-CMD ["fastapi", "run", "app/main.py", "--port", "8000"]
+CMD ["fastapi", "dev", "app/main.py", "--host=0.0.0.0", "--port=8000"]
diff --git a/README.md b/README.md
@@ -1,20 +1,21 @@
 # Kai AI Platform
+
 ![Static Badge](https://img.shields.io/badge/v3.10.12-blue?logo=python&logoColor=yellow&labelColor=gray)
 ![Static Badge](https://img.shields.io/badge/Gemini%201.0-blue?logo=googlegemini&logoColor=blue&labelColor=gray)
 ![Static Badge](https://img.shields.io/badge/Vertex%20AI-blue?logo=googlecloud&logoColor=white&labelColor=gray)
 ![Static Badge](https://img.shields.io/badge/FastAPI-blue?logo=fastapi&logoColor=white&labelColor=gray)
 
-
 ## Table of Contents
 
 - [Architecture](#Architecture)
 - [Folder Structure](#folder-structure)
 - [Setup](#Setup)
 - [Local Development](#local-development)
 - [Contributing](#Contributing)
-![Architectural Diagram](diagram.png)
+  ![Architectural Diagram](diagram.png)
 
 ## Folder Structure
+
 ```plaintext
 backend/
 ├── app/                     # Contains the main application code
@@ -41,14 +42,17 @@ backend/
 ├── Dockerfile               # Dockerfile for containerizing the application
 └── README.md                # Documentation file
 ```
+
 ## Install all the necessary libraries:
 
 ### Navigate to the app directory
+
 ```bash
 cd backend/app
 ```
 
 ### Create and activate Virtual Environment
+
 ```bash
 python -m venv env
 source env/bin/activate
@@ -57,7 +61,8 @@ source env/bin/activate
 ```bash
 pip install -r requirements.txt
 ```
-## To Run Locally and Test 
+
+## To Run Locally and Test
 
 ## Prerequisites
 
@@ -66,88 +71,86 @@ pip install -r requirements.txt
 
 ## Steps for Authentication Setup
 
-### Step 1: Create a Service Account
+### Step 1: Create a Google Cloud Project
 
-1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/).
-2. Go to **IAM & Admin** > **Service Accounts**.
-3. Click **Create Service Account**.
-4. Enter a name and description for the service account.
-5. Click **Create**.
-6. Assign the necessary roles to the service account (e.g., Editor, Viewer).
-7. Click **Continue** and then **Done** to finish creating the service account.
+1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/) and create a new project.
 
-### Step 2: Download the Service Account Key
+### Step 2: Enable the Google Cloud APIs
 
-1. In the **Service Accounts** page, click on the newly created service account.
-2. Go to the **Keys** tab.
-3. Click **Add Key**, then select **Create new key**.
-4. Choose **JSON** as the key type and click **Create**.
-5. The key will be downloaded automatically. Save this file securely.
+1. Enable the following APIs:
+   - VertexAI
 
-### Step 3: Rename and Store the Key
+### Step 3: Create a new AI Studio API Key
 
-1. Rename the downloaded JSON key to `local-auth.json`.
-2. Move or copy this file to your application's directory, specifically inside the `/app` directory.
+1. Navigate to the [AI Studio API Key page](https://aistudio.google.com/app/u/1/apikey) and create a new API key. This will connect with your Google Cloud Project.
 
-### Step 4: Utilize Local Start Script
+### Step 4: Create a new .env and store the API Key
 
-1. Modify the `local-start.sh` script's environment variable `PROJECT_ID` to match the project ID of your Google Cloud project.
-2. Run the script: `./local-start.sh`
-3. Navigate to `http://localhost:8000` to view the application.
+1. Create a new file called `.env` in the root of the project.
+2. Copy the contents of the `.env.example` file into the `.env` file.
+3. Replace the placeholder values with your API key and project ID.
+4. Set the `ENV_TYPE` variable to `dev`.
+
+### Step 4: Run the Application with Local Shell Script
+
+1. Run the following command to start the application:
+
+```bash
+./local-start.sh
+```
 
 # Docker Setup Guide
 
 ## Overview
 
 This guide is designed to help contributors set up and run the backend service using Docker. Follow these steps to ensure that your development environment is configured correctly.
 
-NOTE: if you choose to authenticate Google Cloud through the SDK and not with a local serice account key, you must comment out `GOOGLE_APPLICATION_CREDENTIALS` in the Dockerfile.
-
 ## Prerequisites
 
 Before you start, ensure you have the following installed:
+
 - Docker
 - Python
 
-
 ## Installation Instructions
 
-### 1. Setting Up Local Credentials
-Obtain a local-auth.json file which contains the Google service account credentials and place it in the root of the app/ directory within your project.
+### 1. Build the Docker Image
 
-### 2. Build the Docker Image
-Navigate to the project's root directory and build the Docker image:
-``` Bash
-docker build -t kai-backend:latest .
+Navigate to the project's root directory and build the Docker image. Typically, this is done with the following command:
+
+```Bash
+docker build -t <image_name> .
 ```
+
 ### 3 Run the Docker Container
 
 Run the Docker container using the following command:
-``` bash
-docker run -p 8000:8000 kai-backend:latest
+
+```bash
+docker run -p 8000:8000 <image_name>
 ```
+
 This command starts a detached container that maps port 8000 of the container to port 8000 on the host.
 
 ## Environment Variables
+
 The Docker container uses several key environment variables:
 
--  GOOGLE_APPLICATION_CREDENTIALS points to /app/local-auth.json.
--  ENV_TYPE set to "sandbox" for development.
+- ENV_TYPE set to "dev" for development.
 - PROJECT_ID specifies your Google Cloud project ID.
-- LangChain API integration is configured via:
-`LANGCHAIN_TRACING_V2`
-`LANGCHAIN_ENDPOINT`
-`LANGCHAIN_API_KEY`
-`LANGCHAIN_PROJECT`
-- Ensure these variables are correctly configured in your Dockerfile or passed as additional parameters to your Docker run command, as shown in the example below:
-  ```bash
-  docker run --env ENV_TYPE=dev --env="Enter your project ID here" -p 8000:8000 kai-backend:latest 
-  ```
+- It is possible to enable LangChain tracing by setting the following environment variables. More information can be found on LangSmith
+  `LANGCHAIN_TRACING_V2`
+  `LANGCHAIN_ENDPOINT`
+  `LANGCHAIN_API_KEY`
+  `LANGCHAIN_PROJECT`
+- Ensure these variables are correctly configured in a .env file.
+
 ## Accessing the Application
+
 You can access the backend by visiting:
-```Bash
-http://localhost:8000
 
+```Bash
+http://localhost:8000/docs
 ```
 
-After your container starts, you should see the FastAPI landing page, indicating that the application is running successfully.
+After your container starts, you should see the FastAPI landing page, indicating that the application is running successfully.
diff --git a/app.yaml b/app.yaml
@@ -1,5 +1,5 @@
 runtime: python310
-entrypoint: fastapi run app/main.py --port $PORT
+entrypoint: uvicorn app.main:app --host 0.0.0.0 --port $PORT
 instance_class: F2
 automatic_scaling:
   min_instances: 1

diff --git a/app/api/tools_config.json b/app/api/tools_config.json
@@ -6,5 +6,9 @@
     "1": {
         "path": "features.dynamo.core",
         "metadata_file": "metadata.json"
+    },
+    "2": {
+        "path": "features.worksheet_generator.core",
+        "metadata_file": "metadata.json"
     }
 }
diff --git a/app/features/worksheet_generator/README.md b/app/features/worksheet_generator/README.md
@@ -0,0 +1,54 @@
+# Worksheet Question Generator
+
+This project provides a framework for generating quiz and worksheet questions using a machine learning model, integrated with the LangChain framework. The system allows users to input topics, difficulty levels, and hints to generate customized questions. To ensure the accuracy, quality, and relevance of the generated content, it leverages various validation mechanisms, including cosine similarity scores, to minimize hallucinations and ensure the correctness of the questions and answers.
+
+## Key Features
+
+### 1. Dual-Prompt Setup for Optimal Performance
+Through extensive fine-tuning and prompt engineering experiments, it was discovered that using two distinct prompts yields the best results. One prompt is designed for quiz question generation, and another is tailored for worksheet question creation. This dual-prompt approach optimizes question generation by better capturing the specific nuances of each question type, leading to higher-quality and more relevant outputs. 
+
+### 2. Worksheet and Quiz Question Generation
+The core functionality of this tool is to generate questions based on the provided topic, level, hint, and question type (quiz or worksheet). The `WorksheetBuilder` class is responsible for this generation process. By invoking machine learning models, configured through LangChain's VertexAI, the system can generate customized questions with high accuracy.
+
+### 3. Parameters
+- **Topic**: The subject matter for the questions.
+- **Level**: The difficulty level of the questions (e.g., beginner, intermediate, advanced).
+- **Hint**: A hint to guide the style or focus of the questions (e.g., "single sentence answer questions" or "multiple choice questions").
+- **q_type**: Specifies whether to generate quiz questions or worksheet questions.
+
+### 4. Question Validation
+The generated questions undergo several layers of validation to ensure quality, relevance, and correctness.
+
+#### a. Format Validation
+For **quiz questions**, validation checks for essential components such as the question, multiple answer choices, the correct answer, and an explanation. For **worksheet questions**, it ensures the presence of the question, the correct answer, and an explanation.
+
+#### b. Cosine Similarity Validation
+To further ensure the relevance of the question-answer pair:
+- The system uses a `SentenceTransformer` model to calculate cosine similarity scores between:
+  - The question and its answer.
+  - The question and its explanation.
+- These scores help validate whether the generated answer and explanation are semantically aligned with the question.
+
+### 5. Correctness and Avoiding Hallucinations
+To avoid irrelevant or incorrect outputs (hallucinations) from the language model, the system implements the following approach:
+- **Cosine Similarity Score**: The system computes similarity scores between the question-answer pair and the question-explanation pair.
+- **Maximum Similarity Score**: The higher score between these two pairs is chosen as a measure of content relevance.
+- **Validation Threshold**: Only questions where the cosine similarity score exceeds a pre-set threshold (typically 0.6) are deemed valid and added to the final set of generated questions.
+
+This validation pipeline ensures that the generated questions are not only syntactically correct but also semantically aligned with the topic, minimizing the risk of irrelevant or nonsensical questions.
+
+### 6. Logging and Error Handling
+The system includes detailed logging and error handling mechanisms. Logs capture key stages such as question generation, validation, and any encountered errors. This makes debugging and system monitoring more efficient.
+
+## How to Use
+
+1. Instantiate the `WorksheetBuilder` class with the necessary parameters, including the topic, level, hint, and question type.
+2. Use one of the two dedicated prompts based on your requirements:
+   - For quiz generation, call the `create_questions()` method.
+   - For worksheet generation, use the `create_worksheet_questions()` method.
+3. The system will generate, validate, and return the questions as a list of dictionaries. Each dictionary contains a question, its answer, and an explanation.
+
+## Example Usage
+```python
+executor([ToolFile(url="https://courses.edx.org/asset-v1:ColumbiaX+CSMM.101x+1T2017+type@asset+block@AI_edx_ml_5.1intro.pdf")],
+         "machine learning", "Masters", "single sentence answer questions", 5, 5)
diff --git a/app/features/worksheet_generator/__init__.py b/app/features/worksheet_generator/__init__.py
diff --git a/app/features/worksheet_generator/core.py b/app/features/worksheet_generator/core.py
@@ -0,0 +1,36 @@
+# import sys
+# import os
+from features.quizzify.tools import RAGpipeline
+from services.tool_registry import ToolFile
+from services.logger import setup_logger
+from features.worksheet_generator.tools import WorksheetBuilder
+from api.error_utilities import LoaderError, ToolExecutorError
+logger = setup_logger()
+
+
+def executor(files: list[ToolFile], topic: str, level: str, hint: str, hint_num: int, num_questions: int, verbose=True):
+    try:
+        if verbose: logger.debug(f"Files: {files}")
+        # Instantiate RAG pipeline with default values
+        pipeline = RAGpipeline(verbose=verbose)
+        pipeline.compile()
+        # Process the uploaded files
+        db = pipeline(files)
+
+        # Create and return the quiz questions
+        output = WorksheetBuilder(db, topic, level, hint, "quiz").create_questions(num_questions)
+        output.extend(WorksheetBuilder(db, topic, level, hint, "worksheet").create_worksheet_questions(hint_num))
+
+    # Try-Except blocks on custom defined exceptions to provide better logging
+    except LoaderError as e:
+        error_message = e
+        logger.error(f"Error in RAGPipeline -> {error_message}")
+        raise ToolExecutorError(error_message)
+
+    # These help differentiate user-input errors and internal errors. Use 4XX and 5XX status respectively.
+    except Exception as e:
+        error_message = f"Error in executor: {e}"
+        logger.error(error_message)
+        raise ValueError(error_message)
+
+    return output
diff --git a/app/features/worksheet_generator/metadata.json b/app/features/worksheet_generator/metadata.json
@@ -0,0 +1,29 @@
+{
+  "inputs": [
+    {
+      "label": "Topic",
+      "name": "topic",
+      "type": "text"
+    },
+    {
+      "label": "Level",
+      "name": "level",
+      "type": "text"
+    },
+    {
+      "label": "Hint",
+      "name": "hint",
+      "type": "text"
+    },
+    {
+      "label": "Number of Questions",
+      "name": "num_questions",
+      "type": "number"
+    },
+    {
+      "label": "Upload PDF files",
+      "name": "files",
+      "type": "file"
+    }
+  ]
+}
diff --git a/app/features/worksheet_generator/prompt/worksheet-generator-quiz-prompt.txt b/app/features/worksheet_generator/prompt/worksheet-generator-quiz-prompt.txt
@@ -0,0 +1,16 @@
+You are a subject matter expert on the topic:
+{topic}
+
+You have to generate {q_type} type questions on the topic based on academic qualification level of a {level} degree
+
+Follow these instructions if you are creating a quiz question:
+1. Generate a question based on the topic provided and context as key "question"
+2. Provide 4 multiple choice answers to the question as a list of key-value pairs "choices"
+3. Provide the correct answer for the question from the list of answers as key "answer"
+4. Provide an explanation as to why the answer is correct as key "explanation"
+
+You must respond as a JSON object:
+{format_instructions}
+
+Context:
+{context}
diff --git a/app/features/worksheet_generator/prompt/worksheet-generator-worksheet-prompt.txt b/app/features/worksheet_generator/prompt/worksheet-generator-worksheet-prompt.txt
@@ -0,0 +1,16 @@
+You are a subject matter expert on the topic:
+{topic}
+
+You have to generate {q_type} type questions on the topic based on academic qualification level of a {level} degree
+
+Follow these instructions if you are creating worksheet type questions:
+1. Generate a question based on the topic provided, which should follow the constraint "{hint}" and should be value of key "question"
+2. There should be no answer choices for these type of questions
+3. Provide the correct answer to the question following the constraint {hint} as a string value of key "answer"
+4. Provide an explanation as to why the answer is correct as key "explanation"
+
+You must respond as a JSON object:
+{format_instructions}
+
+Context:
+{context}