Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worksheet Generator Feature PR #102

Closed
wants to merge 17 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .gcloudignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Dockerfile
.gitignore
contribution.md
diagram.png
LICENSE
load_env.sh
local-start.sh
README.md
.env
.pytest_cache/
.github/
app/__pycache__/
__pycache__/
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ FROM python:3.10.12

WORKDIR /code

COPY app/requirements.txt /code/requirements.txt
COPY requirements.txt /code/requirements.txt

RUN pip install --no-cache-dir -r /code/requirements.txt

Expand All @@ -16,4 +16,4 @@ COPY ./app /code/app

ENV PYTHONPATH=/code/app

CMD ["fastapi", "run", "app/main.py", "--port", "8000"]
CMD ["fastapi", "dev", "app/main.py", "--host=0.0.0.0", "--port=8000"]
101 changes: 52 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,21 @@
# Kai AI Platform

![Static Badge](https://img.shields.io/badge/v3.10.12-blue?logo=python&logoColor=yellow&labelColor=gray)
![Static Badge](https://img.shields.io/badge/Gemini%201.0-blue?logo=googlegemini&logoColor=blue&labelColor=gray)
![Static Badge](https://img.shields.io/badge/Vertex%20AI-blue?logo=googlecloud&logoColor=white&labelColor=gray)
![Static Badge](https://img.shields.io/badge/FastAPI-blue?logo=fastapi&logoColor=white&labelColor=gray)


## Table of Contents

- [Architecture](#Architecture)
- [Folder Structure](#folder-structure)
- [Setup](#Setup)
- [Local Development](#local-development)
- [Contributing](#Contributing)
![Architectural Diagram](diagram.png)
![Architectural Diagram](diagram.png)

## Folder Structure

```plaintext
backend/
├── app/ # Contains the main application code
Expand All @@ -41,14 +42,17 @@ backend/
├── Dockerfile # Dockerfile for containerizing the application
└── README.md # Documentation file
```

## Install all the necessary libraries:

### Navigate to the app directory

```bash
cd backend/app
```

### Create and activate Virtual Environment

```bash
python -m venv env
source env/bin/activate
Expand All @@ -57,7 +61,8 @@ source env/bin/activate
```bash
pip install -r requirements.txt
```
## To Run Locally and Test

## To Run Locally and Test

## Prerequisites

Expand All @@ -66,88 +71,86 @@ pip install -r requirements.txt

## Steps for Authentication Setup

### Step 1: Create a Service Account
### Step 1: Create a Google Cloud Project

1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/).
2. Go to **IAM & Admin** > **Service Accounts**.
3. Click **Create Service Account**.
4. Enter a name and description for the service account.
5. Click **Create**.
6. Assign the necessary roles to the service account (e.g., Editor, Viewer).
7. Click **Continue** and then **Done** to finish creating the service account.
1. Navigate to the [Google Cloud Console](https://console.cloud.google.com/) and create a new project.

### Step 2: Download the Service Account Key
### Step 2: Enable the Google Cloud APIs

1. In the **Service Accounts** page, click on the newly created service account.
2. Go to the **Keys** tab.
3. Click **Add Key**, then select **Create new key**.
4. Choose **JSON** as the key type and click **Create**.
5. The key will be downloaded automatically. Save this file securely.
1. Enable the following APIs:
- VertexAI

### Step 3: Rename and Store the Key
### Step 3: Create a new AI Studio API Key

1. Rename the downloaded JSON key to `local-auth.json`.
2. Move or copy this file to your application's directory, specifically inside the `/app` directory.
1. Navigate to the [AI Studio API Key page](https://aistudio.google.com/app/u/1/apikey) and create a new API key. This will connect with your Google Cloud Project.

### Step 4: Utilize Local Start Script
### Step 4: Create a new .env and store the API Key

1. Modify the `local-start.sh` script's environment variable `PROJECT_ID` to match the project ID of your Google Cloud project.
2. Run the script: `./local-start.sh`
3. Navigate to `http://localhost:8000` to view the application.
1. Create a new file called `.env` in the root of the project.
2. Copy the contents of the `.env.example` file into the `.env` file.
3. Replace the placeholder values with your API key and project ID.
4. Set the `ENV_TYPE` variable to `dev`.

### Step 4: Run the Application with Local Shell Script

1. Run the following command to start the application:

```bash
./local-start.sh
```

# Docker Setup Guide

## Overview

This guide is designed to help contributors set up and run the backend service using Docker. Follow these steps to ensure that your development environment is configured correctly.

NOTE: if you choose to authenticate Google Cloud through the SDK and not with a local serice account key, you must comment out `GOOGLE_APPLICATION_CREDENTIALS` in the Dockerfile.

## Prerequisites

Before you start, ensure you have the following installed:

- Docker
- Python


## Installation Instructions

### 1. Setting Up Local Credentials
Obtain a local-auth.json file which contains the Google service account credentials and place it in the root of the app/ directory within your project.
### 1. Build the Docker Image

### 2. Build the Docker Image
Navigate to the project's root directory and build the Docker image:
``` Bash
docker build -t kai-backend:latest .
Navigate to the project's root directory and build the Docker image. Typically, this is done with the following command:

```Bash
docker build -t <image_name> .
```

### 3 Run the Docker Container

Run the Docker container using the following command:
``` bash
docker run -p 8000:8000 kai-backend:latest

```bash
docker run -p 8000:8000 <image_name>
```

This command starts a detached container that maps port 8000 of the container to port 8000 on the host.

## Environment Variables

The Docker container uses several key environment variables:

- GOOGLE_APPLICATION_CREDENTIALS points to /app/local-auth.json.
- ENV_TYPE set to "sandbox" for development.
- ENV_TYPE set to "dev" for development.
- PROJECT_ID specifies your Google Cloud project ID.
- LangChain API integration is configured via:
`LANGCHAIN_TRACING_V2`
`LANGCHAIN_ENDPOINT`
`LANGCHAIN_API_KEY`
`LANGCHAIN_PROJECT`
- Ensure these variables are correctly configured in your Dockerfile or passed as additional parameters to your Docker run command, as shown in the example below:
```bash
docker run --env ENV_TYPE=dev --env="Enter your project ID here" -p 8000:8000 kai-backend:latest
```
- It is possible to enable LangChain tracing by setting the following environment variables. More information can be found on LangSmith
`LANGCHAIN_TRACING_V2`
`LANGCHAIN_ENDPOINT`
`LANGCHAIN_API_KEY`
`LANGCHAIN_PROJECT`
- Ensure these variables are correctly configured in a .env file.

## Accessing the Application

You can access the backend by visiting:
```Bash
http://localhost:8000

```Bash
http://localhost:8000/docs
```

After your container starts, you should see the FastAPI landing page, indicating that the application is running successfully.
After your container starts, you should see the FastAPI landing page, indicating that the application is running successfully.
2 changes: 1 addition & 1 deletion app.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
runtime: python310
entrypoint: fastapi run app/main.py --port $PORT
entrypoint: uvicorn app.main:app --host 0.0.0.0 --port $PORT
instance_class: F2
automatic_scaling:
min_instances: 1
Expand Down
4 changes: 4 additions & 0 deletions app/api/tools_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,9 @@
"1": {
"path": "features.dynamo.core",
"metadata_file": "metadata.json"
},
"2": {
"path": "features.worksheet_generator.core",
"metadata_file": "metadata.json"
}
}
54 changes: 54 additions & 0 deletions app/features/worksheet_generator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Worksheet Question Generator

This project provides a framework for generating quiz and worksheet questions using a machine learning model, integrated with the LangChain framework. The system allows users to input topics, difficulty levels, and hints to generate customized questions. To ensure the accuracy, quality, and relevance of the generated content, it leverages various validation mechanisms, including cosine similarity scores, to minimize hallucinations and ensure the correctness of the questions and answers.

## Key Features

### 1. Dual-Prompt Setup for Optimal Performance
Through extensive fine-tuning and prompt engineering experiments, it was discovered that using two distinct prompts yields the best results. One prompt is designed for quiz question generation, and another is tailored for worksheet question creation. This dual-prompt approach optimizes question generation by better capturing the specific nuances of each question type, leading to higher-quality and more relevant outputs.

### 2. Worksheet and Quiz Question Generation
The core functionality of this tool is to generate questions based on the provided topic, level, hint, and question type (quiz or worksheet). The `WorksheetBuilder` class is responsible for this generation process. By invoking machine learning models, configured through LangChain's VertexAI, the system can generate customized questions with high accuracy.

### 3. Parameters
- **Topic**: The subject matter for the questions.
- **Level**: The difficulty level of the questions (e.g., beginner, intermediate, advanced).
- **Hint**: A hint to guide the style or focus of the questions (e.g., "single sentence answer questions" or "multiple choice questions").
- **q_type**: Specifies whether to generate quiz questions or worksheet questions.

### 4. Question Validation
The generated questions undergo several layers of validation to ensure quality, relevance, and correctness.

#### a. Format Validation
For **quiz questions**, validation checks for essential components such as the question, multiple answer choices, the correct answer, and an explanation. For **worksheet questions**, it ensures the presence of the question, the correct answer, and an explanation.

#### b. Cosine Similarity Validation
To further ensure the relevance of the question-answer pair:
- The system uses a `SentenceTransformer` model to calculate cosine similarity scores between:
- The question and its answer.
- The question and its explanation.
- These scores help validate whether the generated answer and explanation are semantically aligned with the question.

### 5. Correctness and Avoiding Hallucinations
To avoid irrelevant or incorrect outputs (hallucinations) from the language model, the system implements the following approach:
- **Cosine Similarity Score**: The system computes similarity scores between the question-answer pair and the question-explanation pair.
- **Maximum Similarity Score**: The higher score between these two pairs is chosen as a measure of content relevance.
- **Validation Threshold**: Only questions where the cosine similarity score exceeds a pre-set threshold (typically 0.6) are deemed valid and added to the final set of generated questions.

This validation pipeline ensures that the generated questions are not only syntactically correct but also semantically aligned with the topic, minimizing the risk of irrelevant or nonsensical questions.

### 6. Logging and Error Handling
The system includes detailed logging and error handling mechanisms. Logs capture key stages such as question generation, validation, and any encountered errors. This makes debugging and system monitoring more efficient.

## How to Use

1. Instantiate the `WorksheetBuilder` class with the necessary parameters, including the topic, level, hint, and question type.
2. Use one of the two dedicated prompts based on your requirements:
- For quiz generation, call the `create_questions()` method.
- For worksheet generation, use the `create_worksheet_questions()` method.
3. The system will generate, validate, and return the questions as a list of dictionaries. Each dictionary contains a question, its answer, and an explanation.

## Example Usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good README.md. You can try adding the request and response interfaces for enabling a better understanding of what the dev team can expect to interact with during the requests to the AI endpoint.

```python
executor([ToolFile(url="https://courses.edx.org/asset-v1:ColumbiaX+CSMM.101x+1T2017+type@asset+block@AI_edx_ml_5.1intro.pdf")],
"machine learning", "Masters", "single sentence answer questions", 5, 5)
Empty file.
36 changes: 36 additions & 0 deletions app/features/worksheet_generator/core.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# import sys
# import os
from features.quizzify.tools import RAGpipeline
from services.tool_registry import ToolFile
from services.logger import setup_logger
from features.worksheet_generator.tools import WorksheetBuilder
from api.error_utilities import LoaderError, ToolExecutorError
logger = setup_logger()


def executor(files: list[ToolFile], topic: str, level: str, hint: str, hint_num: int, num_questions: int, verbose=True):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a Pydantic schema for decoupling the args.

try:
if verbose: logger.debug(f"Files: {files}")
# Instantiate RAG pipeline with default values
pipeline = RAGpipeline(verbose=verbose)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chage the name of this class as it does not support a complete understanding of what it does for the Worksheet Generator.

pipeline.compile()
# Process the uploaded files
db = pipeline(files)

# Create and return the quiz questions
output = WorksheetBuilder(db, topic, level, hint, "quiz").create_questions(num_questions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can instantiate your WorksheetBuilder class first instead for reusing the same variable and improve the readability.

output.extend(WorksheetBuilder(db, topic, level, hint, "worksheet").create_worksheet_questions(hint_num))

# Try-Except blocks on custom defined exceptions to provide better logging
except LoaderError as e:
error_message = e
logger.error(f"Error in RAGPipeline -> {error_message}")
raise ToolExecutorError(error_message)

# These help differentiate user-input errors and internal errors. Use 4XX and 5XX status respectively.
except Exception as e:
error_message = f"Error in executor: {e}"
logger.error(error_message)
raise ValueError(error_message)

return output
29 changes: 29 additions & 0 deletions app/features/worksheet_generator/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
{
"inputs": [
{
"label": "Topic",
"name": "topic",
"type": "text"
},
{
"label": "Level",
"name": "level",
"type": "text"
},
{
"label": "Hint",
"name": "hint",
"type": "text"
},
{
"label": "Number of Questions",
"name": "num_questions",
"type": "number"
},
{
"label": "Upload PDF files",
"name": "files",
"type": "file"
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
You are a subject matter expert on the topic:
{topic}

You have to generate {q_type} type questions on the topic based on academic qualification level of a {level} degree

Follow these instructions if you are creating a quiz question:
1. Generate a question based on the topic provided and context as key "question"
2. Provide 4 multiple choice answers to the question as a list of key-value pairs "choices"
3. Provide the correct answer for the question from the list of answers as key "answer"
4. Provide an explanation as to why the answer is correct as key "explanation"

You must respond as a JSON object:
{format_instructions}

Context:
{context}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this file and the other worksheet-generator-quiz-prompt.txt? If they are different, please make sure to use different names for better readability.

Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
You are a subject matter expert on the topic:
{topic}

You have to generate {q_type} type questions on the topic based on academic qualification level of a {level} degree

Follow these instructions if you are creating worksheet type questions:
1. Generate a question based on the topic provided, which should follow the constraint "{hint}" and should be value of key "question"
2. There should be no answer choices for these type of questions
3. Provide the correct answer to the question following the constraint {hint} as a string value of key "answer"
4. Provide an explanation as to why the answer is correct as key "explanation"

You must respond as a JSON object:
{format_instructions}

Context:
{context}
Loading