Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lecture transcriptions webhook endpoint #204

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

sebastianloose
Copy link
Contributor

@sebastianloose sebastianloose commented Feb 4, 2025

Add Lecture Transcription Ingestion Pipeline

Summary by CodeRabbit

  • New Features

    • Expanded transcription processing to include video transcription ingestion.
    • Introduced structured components for managing transcription segments and complete transcriptions with robust validation.
    • Enhanced the ingestion pipeline to automatically chunk, summarize, and insert transcription data into our vector database.
    • Launched a new endpoint that enables full transcription ingestion with real-time status updates.
    • Added a new enumeration member for improved pipeline stage representation.
    • Implemented semantic text splitting capabilities for enhanced embedding model functionality.
  • Dependencies

    • Upgraded the AI module dependency to improve performance and compatibility.

Copy link
Contributor

coderabbitai bot commented Feb 4, 2025

Walkthrough

This pull request introduces a transcription ingestion feature. It adds a new enumeration member to the existing PipelineEnum, creates several new DTO classes for transcription data, and implements a transcription ingestion pipeline with chunking, summarization, and batch insertion into a vector database. Additionally, it introduces a new prompt generator function, a lecture transcription schema, webhook endpoints and worker functions to trigger the ingestion process, and a status callback class for pipeline status updates. The openai dependency version has also been updated.

Changes

File(s) Change Summary
app/common/PipelineEnum.py Added new enum member IRIS_VIDEO_TRANSCRIPTION_INGESTION to the PipelineEnum class.
app/domain/data/metrics/transcription_dto.py
app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py
Added new DTO classes: TranscriptionSegmentDTO, TranscriptionDTO, TranscriptionWebhookDTO, and TranscriptionIngestionPipelineExecutionDto for structured transcription ingestion data.
app/pipeline/prompts/transcription_ingestion_prompts.py
app/pipeline/transcription_ingestion_pipeline.py
Introduced transcription ingestion processing by adding the transcription_summary_prompt function and implementing the TranscriptionIngestionPipeline class with methods for chunking, summarizing, and batch database insertion.
app/vector_database/lecture_transcription_schema.py
app/web/routers/webhooks.py
app/web/status/transcription_ingestion_callback.py
Added a new lecture transcription schema with collection properties; introduced a webhook endpoint (/transcriptions/fullIngestion) with a worker function to trigger the ingestion pipeline; and added the TranscriptionIngestionStatus class for status updates.
requirements.txt Updated the openai package version from 1.54.4 to 1.60.2.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Webhook Client
    participant Router as FastAPI Router
    participant Worker as Ingestion Pipeline Worker
    participant Pipeline as TranscriptionIngestionPipeline
    participant DB as Vector Database
    participant Status as Ingestion Status Callback

    Client->>Router: POST /transcriptions/fullIngestion with DTO
    Router->>Worker: run_transcription_ingestion_pipeline_worker(dto)
    Worker->>Pipeline: Initialize and execute pipeline
    Pipeline->>Pipeline: Chunk transcriptions & generate summaries
    Pipeline->>DB: Batch insert summarized chunks
    Pipeline->>Status: Update status during processing
    Status-->>Worker: Status update logged
Loading

Possibly related PRs

  • Track token usage of iris requests #165: The changes in the main PR, which involve adding a new enumeration member to the PipelineEnum class, are directly related to the modifications in the retrieved PR that also involve the PipelineEnum class, specifically the introduction of new enum members for various pipelines.
  • Enable IRIS to use FAQs  #187: The changes in the main PR, which involve adding a new enumeration member to the PipelineEnum class, are related to the retrieved PR as both involve modifications to the same class, specifically adding new members to the PipelineEnum.

Suggested reviewers

  • Hialus
  • yassinsws
  • bassner
  • MichaelOwenDyer
  • kaancayli
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (9)
app/pipeline/transcription_ingestion_pipeline.py (5)

11-11: Consider using standard Python logging over asyncio.log.logger.

While importing logger from asyncio.log may work, it's more common and flexible to use the built-in logging module for consistency and configurability across the codebase.


38-39: Avoid using a global lock; encapsulate it in the pipeline class instead.

Relying on a global batch_insert_lock can introduce confusion and potential tight coupling between modules. It's typically cleaner to store locks as part of a class or instance variable, promoting clarity and reducing the likelihood of unexpected concurrency behavior.


95-96: Combine multiple context managers into a single with statement.

Per the linter hint (SIM117), using a single with statement is more readable and avoids deeply nested blocks.

Below is a suggested diff:

-with batch_insert_lock:
-    with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
     try:
         for chunk in chunks:
             embed_chunk = self.llm_embedding.embed(
                 chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value]
             )
             batch.add_object(properties=chunk, vector=embed_chunk)
🧰 Tools
🪛 Ruff (0.8.2)

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


153-155: Expose chunk_size as a configurable parameter.

The hardcoded chunk_size=1024 might be too large or too small depending on the lecture length and average text size. Making it configurable helps adapt the chunking logic for different use cases.


220-250: Consider parallel or batched summarization to improve performance.

The current implementation processes and summarizes chunks sequentially. If the system experiences performance bottlenecks on large transcripts, parallelizing chunk summaries or batching them could reduce overall processing time.

app/domain/data/metrics/transcription_dto.py (1)

6-11: Consider adding field constraints and documentation.

The DTOs would benefit from:

  1. Time constraints (start_time ≤ end_time)
  2. Field descriptions
  3. Class-level docstrings explaining their purpose

Also applies to: 14-16, 19-25

app/vector_database/lecture_transcription_schema.py (1)

28-106: Consider adding error handling and logging.

The initialization function would benefit from:

  1. Error handling for collection creation failures
  2. Logging for successful creation/retrieval
  3. Validation of client parameter

Example implementation:

def init_lecture_transcription_schema(client: WeaviateClient) -> Collection:
    if not client:
        raise ValueError("WeaviateClient is required")
    
    collection_name = LectureTranscriptionSchema.COLLECTION_NAME.value
    try:
        if client.collections.exists(collection_name):
            logger.info(f"Collection {collection_name} already exists")
            return client.collections.get(collection_name)

        collection = client.collections.create(...)  # existing creation code
        logger.info(f"Successfully created collection {collection_name}")
        return collection
    except Exception as e:
        logger.error(f"Failed to initialize collection {collection_name}: {str(e)}")
        raise
app/web/status/transcription_ingestion_callback.py (1)

17-52: Add input validation and enhance logging.

Consider adding:

  1. URL validation for base_url
  2. Logging for stage transitions
  3. Type hints for the return type

Example improvements:

from urllib.parse import urlparse
from typing import List, Optional

def __init__(
    self,
    run_id: str,
    base_url: str,
    initial_stages: Optional[List[StageDTO]] = None,
    lecture_id: Optional[int] = None,
) -> None:
    if not base_url:
        raise ValueError("base_url is required")
    if not urlparse(base_url).scheme:
        raise ValueError("Invalid base_url format")
        
    logger.info(f"Initializing TranscriptionIngestionStatus for run_id: {run_id}")
    # ... rest of the implementation
app/web/routers/webhooks.py (1)

181-193: Remove debug print and add explicit return.

The endpoint has a few minor issues:

  1. Contains a debug print statement.
  2. Missing explicit return statement for consistency with other endpoints.

Apply this diff to fix the issues:

 @router.post(
     "/transcriptions/fullIngestion",
     status_code=status.HTTP_202_ACCEPTED,
     dependencies=[Depends(TokenValidator())],
 )
 def transcription_ingestion_webhook(dto: TranscriptionIngestionPipelineExecutionDto):
     """
     Webhook endpoint to trigger the lecture transcription ingestion pipeline
     """
-    print(f"transcription ingestion got DTO {dto}")
     thread = Thread(target=run_transcription_ingestion_pipeline_worker, args=(dto,))
     thread.start()
+    return
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4337c6 and 036e31a.

📒 Files selected for processing (9)
  • app/common/PipelineEnum.py (1 hunks)
  • app/domain/data/metrics/transcription_dto.py (1 hunks)
  • app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py (1 hunks)
  • app/pipeline/prompts/transcription_ingestion_prompts.py (1 hunks)
  • app/pipeline/transcription_ingestion_pipeline.py (1 hunks)
  • app/vector_database/lecture_transcription_schema.py (1 hunks)
  • app/web/routers/webhooks.py (3 hunks)
  • app/web/status/transcription_ingestion_callback.py (1 hunks)
  • requirements.txt (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • requirements.txt
🧰 Additional context used
🪛 GitHub Actions: Run linters
app/pipeline/prompts/transcription_ingestion_prompts.py

[error] 3-3: line too long (226 > 120 characters)

🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Push Docker Image
🔇 Additional comments (4)
app/pipeline/transcription_ingestion_pipeline.py (1)

161-185: Verify correctness of offset-based time calculations.

This section uses chained reductions to compute offsets, then correlates them with segment start/end times. Confirm it reliably handles edge cases (e.g., partial overlaps at chunk boundaries, extremely large segments).

Would you like a verification script that inspects the length of each chunk and checks whether the start/end positions align with the segment boundaries?

app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py (1)

1-17: No major issues found; class design appears solid.

All fields are well-typed, and the alias usage for initial_stages is consistent with Pydantic patterns.

app/common/PipelineEnum.py (1)

21-21: LGTM!

The new pipeline enum value follows the established naming convention and is appropriately placed.

app/web/routers/webhooks.py (1)

16-16: LGTM!

The new imports are well-organized and follow the existing pattern in the file.

Also applies to: 21-23, 26-26

app/domain/data/metrics/transcription_dto.py Outdated Show resolved Hide resolved
app/domain/data/metrics/transcription_dto.py Outdated Show resolved Hide resolved
app/vector_database/lecture_transcription_schema.py Outdated Show resolved Hide resolved
app/web/status/transcription_ingestion_callback.py Outdated Show resolved Hide resolved
app/web/routers/webhooks.py Outdated Show resolved Hide resolved
@ls1intum ls1intum deleted a comment from coderabbitai bot Feb 4, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
app/pipeline/transcription_ingestion_pipeline.py (5)

11-11: Consider using a more appropriate logging module.

The asyncio.log import seems out of place as the code is not using asyncio. Consider using Python's standard logging module instead.

-from asyncio.log import logger
+import logging
+
+logger = logging.getLogger(__name__)

41-45: Add class-level documentation.

Consider adding a docstring to describe the purpose of the class and its attributes.

 class TranscriptionIngestionPipeline(Pipeline):
+    """Handles ingestion of lecture transcriptions into a vector database.
+    
+    Attributes:
+        llm (IrisLangchainChatModel): Language model for text processing
+        pipeline (Runnable): Processing pipeline for LLM outputs
+        prompt (ChatPromptTemplate): Template for generating prompts
+    """
     llm: IrisLangchainChatModel
     pipeline: Runnable
     prompt: ChatPromptTemplate

93-110: Simplify nested context managers and improve error handling.

The batch insertion method can be simplified and its error handling improved.

     def batch_insert(self, chunks):
         global batch_insert_lock
-        with batch_insert_lock:
-            with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+        with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
                 try:
                     for chunk in chunks:
                         embed_chunk = self.llm_embedding.embed(
                             chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value]
                         )
                         batch.add_object(properties=chunk, vector=embed_chunk)
                 except Exception as e:
+                    error_msg = f"Error embedding lecture transcription chunk: {str(e)}"
                     logger.error(f"Error embedding lecture transcription chunk: {e}")
                     self.callback.error(
-                        f"Failed to ingest lecture transcriptions into the database: {e}",
+                        error_msg,
                         exception=e,
                         tokens=self.tokens,
                     )
+                    raise  # Re-raise to ensure the pipeline stops on error
🧰 Tools
🪛 Ruff (0.8.2)

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


111-205: Consider extracting complex offset calculations into helper methods.

The chunk_transcriptions method contains complex offset calculations that could be made more readable by extracting them into helper methods.

+    def _calculate_slide_chunk_offset(self, slide_chunks: Dict, index: int) -> int:
+        """Calculate the offset for a slide chunk based on previous chunks."""
+        return reduce(
+            lambda acc, txt: acc + len(txt.replace(self.CHUNK_SEPARATOR_CHAR, "")),
+            map(
+                lambda seg: seg[LectureTranscriptionSchema.SEGMENT_TEXT.value],
+                list(slide_chunks.values())[:index],
+            ),
+            0,
+        )

+    def _calculate_semantic_chunk_offset(self, semantic_chunks: List[str], index: int) -> int:
+        """Calculate the offset for a semantic chunk based on previous chunks."""
+        return reduce(
+            lambda acc, txt: acc + len(txt.replace(self.CHUNK_SEPARATOR_CHAR, "")),
+            semantic_chunks[:index],
+            0,
+        )

Also, consider adding a constant for the chunk size threshold:

+    CHUNK_SIZE_THRESHOLD = 1200  # Characters
+
     def chunk_transcriptions(self, transcriptions: List[TranscriptionWebhookDTO]) -> List[Dict[str, Any]]:
-        CHUNK_SEPARATOR_CHAR = "\x1F"
+        self.CHUNK_SEPARATOR_CHAR = "\x1F"
         chunks = []

220-249: Improve error handling in summarize_chunks method.

The error handling in summarize_chunks could be more informative.

     def summarize_chunks(self, chunks):
         chunks_with_summaries = []
-        for chunk in chunks:
+        for i, chunk in enumerate(chunks, 1):
             self.prompt = ChatPromptTemplate.from_messages(
                 [
                     (
                         "system",
                         transcription_summary_prompt(
                             chunk[LectureTranscriptionSchema.LECTURE_NAME.value],
                             chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value],
                         ),
                     ),
                 ]
             )
             prompt_val = self.prompt.format_messages()
             self.prompt = ChatPromptTemplate.from_messages(prompt_val)
             try:
                 response = (self.prompt | self.pipeline).invoke({})
                 self._append_tokens(
                     self.llm.tokens, PipelineEnum.IRIS_VIDEO_TRANSCRIPTION_INGESTION
                 )
                 chunks_with_summaries.append(
                     {
                         **chunk,
                         LectureTranscriptionSchema.SEGMENT_SUMMARY.value: response,
                     }
                 )
             except Exception as e:
-                raise e
+                error_msg = f"Failed to summarize chunk {i}/{len(chunks)}: {str(e)}"
+                logger.error(error_msg)
+                raise Exception(error_msg) from e
         return chunks_with_summaries
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 036e31a and b1db0bc.

📒 Files selected for processing (4)
  • app/domain/data/metrics/transcription_dto.py (1 hunks)
  • app/pipeline/transcription_ingestion_pipeline.py (1 hunks)
  • app/vector_database/lecture_transcription_schema.py (1 hunks)
  • app/web/status/transcription_ingestion_callback.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • app/web/status/transcription_ingestion_callback.py
  • app/domain/data/metrics/transcription_dto.py
  • app/vector_database/lecture_transcription_schema.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py

76-76: Local variable chunks is assigned to but never used

Remove assignment to unused variable chunks

(F841)


95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

🪛 GitHub Actions: Run linters
app/pipeline/transcription_ingestion_pipeline.py

[warning] 76-76: local variable 'chunks' is assigned to but never used

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Push Docker Image
🔇 Additional comments (1)
app/pipeline/transcription_ingestion_pipeline.py (1)

73-92: ⚠️ Potential issue

Uncomment pipeline steps or add TODO comments.

The pipeline execution has commented out critical steps (summarization and batch insertion). Either:

  1. Uncomment these steps if they are ready for use, or
  2. Add TODO comments explaining why they are commented out and when they will be implemented.

Additionally, the chunks variable is assigned but never used due to the commented code.

     def __call__(self) -> None:
         try:
             self.callback.in_progress("Chunking transcriptions")
-            chunks = self.chunk_transcriptions(self.dto.transcriptions)
+            # TODO: Uncomment and test the following pipeline steps
+            # Current status: Implementation in progress, expected completion by next sprint
+            # chunks = self.chunk_transcriptions(self.dto.transcriptions)
 
             self.callback.in_progress("Summarizing transcriptions")
             # chunks = self.summarize_chunks(chunks)
🧰 Tools
🪛 Ruff (0.8.2)

76-76: Local variable chunks is assigned to but never used

Remove assignment to unused variable chunks

(F841)

🪛 GitHub Actions: Run linters

[warning] 76-76: local variable 'chunks' is assigned to but never used

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 4, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
app/pipeline/transcription_ingestion_pipeline.py (3)

95-96: Combine nested with statements.

Multiple context managers can be combined into a single with statement for better readability.

Apply this diff to combine the with statements:

-        with batch_insert_lock:
-            with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+        with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
🧰 Tools
🪛 Ruff (0.8.2)

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


146-146: Define chunk size constants.

Magic numbers (1200 and 1024) for chunk sizes should be defined as constants at the class level for better maintainability and consistency.

Add these constants at the class level:

+    MAX_CHUNK_SIZE = 1200
+    SEMANTIC_CHUNK_SIZE = 1024
+
     def chunk_transcriptions(
         self, transcriptions: List[TranscriptionWebhookDTO]
     ) -> List[Dict[str, Any]]:

Then update the references:

-                if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < 1200:
+                if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < self.MAX_CHUNK_SIZE:

-                    chunk_size=1024, chunk_overlap=0
+                    chunk_size=self.SEMANTIC_CHUNK_SIZE, chunk_overlap=0

Also applies to: 154-154


206-219: Document edge case handling in get_transcription_segment_of_char_position.

The method silently returns the last segment when the character position exceeds the total length. This behavior should be documented.

Add docstring to explain the behavior:

     @staticmethod
     def get_transcription_segment_of_char_position(
         char_position: int, segments: List[TranscriptionSegmentDTO]
     ):
+        """
+        Find the transcription segment containing the given character position.
+        
+        Args:
+            char_position: The character position to look up
+            segments: List of transcription segments
+        
+        Returns:
+            The segment containing the character position, or the last segment if the
+            position exceeds the total length of all segments.
+        """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1db0bc and ffd9c43.

📒 Files selected for processing (4)
  • app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py (1 hunks)
  • app/pipeline/transcription_ingestion_pipeline.py (1 hunks)
  • app/web/routers/webhooks.py (3 hunks)
  • app/web/status/transcription_ingestion_callback.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py
  • app/web/status/transcription_ingestion_callback.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Push Docker Image
🔇 Additional comments (1)
app/web/routers/webhooks.py (1)

101-101: ⚠️ Potential issue

Fix incorrect error message.

The error message incorrectly states "Error while deleting lectures" for a transcription ingestion operation.

Apply this diff to fix the error message:

-            logger.error(f"Error while deleting lectures: {e}")
+            logger.error(f"Error in transcription ingestion pipeline: {e}")

Likely invalid or redundant comment.

app/web/routers/webhooks.py Outdated Show resolved Hide resolved
@isabellagessl isabellagessl changed the title Feature/transcriptions webhook endpoint Add lecture transcriptions webhook endpoint Feb 6, 2025
Copy link
Contributor

@cremertim cremertim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work so far, some minor comments

app/pipeline/prompts/transcription_ingestion_prompts.py Outdated Show resolved Hide resolved
app/pipeline/transcription_ingestion_pipeline.py Outdated Show resolved Hide resolved
Copy link
Member

@alexjoham alexjoham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good so far, I added some change suggestions.

app/domain/data/metrics/transcription_dto.py Outdated Show resolved Hide resolved
app/vector_database/lecture_transcription_schema.py Outdated Show resolved Hide resolved
app/vector_database/lecture_transcription_schema.py Outdated Show resolved Hide resolved
app/web/routers/webhooks.py Outdated Show resolved Hide resolved
coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 11, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
app/web/routers/webhooks.py (1)

101-101: ⚠️ Potential issue

Fix incorrect error message.

The error message "Error while deleting lectures" is incorrect for the transcription ingestion pipeline.

Apply this diff to fix the error message:

-            logger.error(f"Error while deleting lectures: {e}")
+            logger.error(f"Error in transcription ingestion pipeline: {e}")
🧹 Nitpick comments (2)
app/pipeline/transcription_ingestion_pipeline.py (2)

96-97: Optimize with statements.

Use a single with statement with multiple contexts for better readability.

Apply this diff to combine the with statements:

-        with batch_insert_lock:
-            with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+        with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
🧰 Tools
🪛 Ruff (0.8.2)

96-97: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


147-147: Extract magic number into a named constant.

The chunk size threshold of 1200 should be a named constant at the class level for better maintainability.

Apply this diff to extract the magic number:

+    MAX_CHUNK_SIZE = 1200
+
     def chunk_transcriptions(
         self, transcriptions: List[TranscriptionWebhookDTO]
     ) -> List[Dict[str, Any]]:
         CHUNK_SEPARATOR_CHAR = "\x1F"
         chunks = []

         for transcription in transcriptions:
             slide_chunks = {}
             for segment in transcription.transcription.segments:
                 slide_key = f"{transcription.lecture_id}_{transcription.lecture_unit_id}_{segment.slide_number}"

-                if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < 1200:
+                if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < self.MAX_CHUNK_SIZE:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ffd9c43 and 216cdf7.

📒 Files selected for processing (5)
  • app/domain/data/metrics/transcription_dto.py (1 hunks)
  • app/pipeline/prompts/transcription_ingestion_prompts.py (1 hunks)
  • app/pipeline/transcription_ingestion_pipeline.py (1 hunks)
  • app/vector_database/lecture_transcription_schema.py (1 hunks)
  • app/web/routers/webhooks.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • app/pipeline/prompts/transcription_ingestion_prompts.py
  • app/vector_database/lecture_transcription_schema.py
  • app/domain/data/metrics/transcription_dto.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py

1-1: threading imported but unused

Remove unused import: threading

(F401)


96-97: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

🪛 GitHub Actions: Run linters
app/pipeline/transcription_ingestion_pipeline.py

[warning] 1-1: 'threading' imported but unused

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Push Docker Image
🔇 Additional comments (3)
app/web/routers/webhooks.py (2)

194-194: Replace print statement with proper logging.

Using print statements for debugging is not recommended in production code. Use the logger instead.

Apply this diff to use proper logging:

-    logger.info(f"transcription ingestion got DTO {dto}")
+    logger.info(f"Received transcription ingestion request with DTO: {dto}")

185-197: LGTM! Consistent implementation with other webhook endpoints.

The new transcription ingestion webhook endpoint follows the established pattern:

  • Uses proper authentication via TokenValidator
  • Runs the worker in a separate thread
  • Returns 202 Accepted status code
app/pipeline/transcription_ingestion_pipeline.py (1)

39-39: LGTM! Good reuse of existing lock.

Using the shared batch_update_lock from the FAQ ingestion pipeline ensures consistent concurrency control across all ingestion pipelines.

app/pipeline/transcription_ingestion_pipeline.py Outdated Show resolved Hide resolved
@ls1intum ls1intum deleted a comment from coderabbitai bot Feb 11, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
app/pipeline/transcription_ingestion_pipeline.py (1)

223-235: 🛠️ Refactor suggestion

Optimize prompt template creation.

Creating a new prompt template for each chunk is inefficient. This issue was previously identified and should be addressed.

     def summarize_chunks(self, chunks):
+        self.prompt = ChatPromptTemplate.from_messages(
+            [
+                (
+                    "system",
+                    lambda chunk: transcription_summary_prompt(
+                        chunk[LectureTranscriptionSchema.LECTURE_NAME.value],
+                        chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value],
+                    ),
+                ),
+            ]
+        )
         chunks_with_summaries = []
         for chunk in chunks:
-            self.prompt = ChatPromptTemplate.from_messages(
-                [
-                    (
-                        "system",
-                        transcription_summary_prompt(
-                            chunk[LectureTranscriptionSchema.LECTURE_NAME.value],
-                            chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value],
-                        ),
-                    ),
-                ]
-            )
-            prompt_val = self.prompt.format_messages()
-            self.prompt = ChatPromptTemplate.from_messages(prompt_val)
+            prompt_val = self.prompt.format_messages(chunk=chunk)
🧹 Nitpick comments (5)
app/pipeline/transcription_ingestion_pipeline.py (5)

57-57: Consider making the embedding model configurable.

The embedding model is hardcoded to "embedding-small". For better flexibility and future-proofing, consider making this configurable through the constructor or environment variables.

-        self.llm_embedding = BasicRequestHandler("embedding-small")
+        self.llm_embedding = BasicRequestHandler(
+            self.config.get("EMBEDDING_MODEL", "embedding-small")
+        )

95-96: Combine nested with statements for better readability.

Multiple with statements can be combined using a comma.

-        with batch_insert_lock:
-            with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+        with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
🧰 Tools
🪛 Ruff (0.8.2)

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


96-96: Consider making the rate limit configurable.

The rate limit of 600 requests per minute is hardcoded. Consider making this configurable to allow for different environments and requirements.

-            with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+            with self.collection.batch.rate_limit(
+                requests_per_minute=self.config.get("BATCH_REQUESTS_PER_MINUTE", 600)
+            ) as batch:
🧰 Tools
🪛 Ruff (0.8.2)

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


146-146: Extract magic number into a named constant.

The chunk size threshold of 1200 should be extracted into a named constant for better maintainability.

+    MAX_CHUNK_SIZE = 1200
+
     def chunk_transcriptions(
         self, transcriptions: List[TranscriptionWebhookDTO]
     ) -> List[Dict[str, Any]]:
         # ...
-                if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < 1200:
+                if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < self.MAX_CHUNK_SIZE:

162-172: Add documentation for offset calculations.

The offset calculations are complex and would benefit from documentation explaining their purpose and logic.

                 for j, chunk in enumerate(semantic_chunks):
+                    # Calculate the total length of text in previous slides
+                    # This is used to maintain correct character position mapping
+                    # when splitting text into semantic chunks
                     offset_slide_chunk = reduce(
                         lambda acc, txt: acc
                         + len(txt.replace(CHUNK_SEPARATOR_CHAR, "")),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 216cdf7 and 5bd5e0d.

📒 Files selected for processing (2)
  • app/pipeline/prompts/transcription_ingestion_prompts.py (1 hunks)
  • app/pipeline/transcription_ingestion_pipeline.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • app/pipeline/prompts/transcription_ingestion_prompts.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py

95-96: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build and Push Docker Image
🔇 Additional comments (2)
app/pipeline/transcription_ingestion_pipeline.py (2)

38-38: LGTM! Good job reusing the existing lock.

The reuse of batch_update_lock from the FAQ ingestion pipeline ensures consistent locking behavior across different ingestion pipelines, preventing potential concurrency issues.


73-92: LGTM! Well-structured pipeline execution with proper error handling.

The implementation includes:

  • Clear progress updates through callback
  • Comprehensive error handling with logging
  • Token tracking for billing/monitoring

coderabbitai[bot]
coderabbitai bot previously approved these changes Feb 11, 2025
…ls1intum/Pyris into feature/transcriptions-webhook-endpoint
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
app/pipeline/transcription_ingestion_pipeline.py (3)

95-112: Combine nested with statements for better readability.

The nested with statements can be combined into a single statement.

Apply this diff to improve readability:

-        with batch_insert_lock:
-            with self.collection.batch.rate_limit(requests_per_minute=600) as batch:
+        with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:
🧰 Tools
🪛 Ruff (0.8.2)

97-98: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


175-175: Rename unused loop variable.

The loop variable j is not used within the loop body.

Apply this diff to improve code clarity:

-                for j, chunk in enumerate(semantic_chunks):
+                for _j, chunk in enumerate(semantic_chunks):
🧰 Tools
🪛 Ruff (0.8.2)

175-175: Loop control variable j not used within loop body

Rename unused j to _j

(B007)


199-214: Add docstrings to helper methods.

The helper methods would benefit from docstrings explaining their purpose, parameters, and return values.

Consider adding docstrings like this:

     @staticmethod
     def get_transcription_segment_of_char_position(
         char_position: int, segments: List[TranscriptionSegmentDTO]
     ):
+        """
+        Find the transcription segment containing the given character position.
+        
+        Args:
+            char_position: The character position to look up
+            segments: List of transcription segments to search
+            
+        Returns:
+            The segment containing the character position, or the last segment if not found
+        """

Also applies to: 215-225

app/llm/external/openai_embeddings.py (2)

3-3: Minor formatting change recognized.
The added blank line doesn't affect functionality. Feel free to keep it for readability or remove if unnecessary.


48-65: Consider adding a docstring and handling errors in semantic splitting.
While the implementation is straightforward, you may want to add a docstring describing the parameters and return value. Additionally, consider how this method should behave if SemanticChunker fails or if text is excessively large or empty.

app/llm/request_handler/basic_request_handler.py (1)

49-63: Suggest adding docstring and potential error handling.
Documenting the new split_text_semantically method would help clarify its purpose, parameters, and return type. In addition, consider how to handle unexpected values (e.g., invalid breakpoint_threshold_type) or empty text inputs.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5bd5e0d and 7458423.

📒 Files selected for processing (4)
  • app/llm/external/openai_embeddings.py (4 hunks)
  • app/llm/request_handler/basic_request_handler.py (2 hunks)
  • app/pipeline/transcription_ingestion_pipeline.py (1 hunks)
  • requirements.txt (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • requirements.txt
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py

97-98: Use a single with statement with multiple contexts instead of nested with statements

(SIM117)


175-175: Loop control variable j not used within loop body

Rename unused j to _j

(B007)

🔇 Additional comments (9)
app/pipeline/transcription_ingestion_pipeline.py (3)

37-37: LGTM! Good reuse of the FAQ ingestion lock.

Reusing the batch_update_lock from the FAQ ingestion pipeline ensures consistent locking behavior across different ingestion pipelines, preventing potential concurrency issues.


47-73: LGTM! Well-structured initialization with appropriate LLM configuration.

The initialization is thorough with:

  • Proper vector database setup
  • Well-configured LLM with appropriate requirements (GPT-4.5, 16K context, privacy compliance)
  • Suitable completion arguments (temperature=0 for deterministic output)

74-94: LGTM! Robust pipeline execution with comprehensive error handling.

The execution flow is well-structured with:

  • Clear progress reporting via callbacks
  • Proper error handling with detailed error messages
  • Logical progression of operations (chunk -> summarize -> insert)
app/llm/external/openai_embeddings.py (5)

4-4: Verify compatibility of langchain_experimental.text_splitter.
Ensure that the installed langchain_experimental version supports SemanticChunker and that it behaves as expected, especially in a production environment.


11-11: New imports look good.
These updated imports align with the switch to OpenAIEmbeddings and AzureOpenAIEmbeddings. No immediate issues detected.


20-20: Clear type annotation for _client.
Declaring _client as OpenAIEmbeddings makes the code more readable and consistent with the new approach.


71-71: Client initialization is consistent with updated type.
The updated _client instantiation matches the revised embedding model approach. Looks good.


84-84: Azure embeddings initialization aligns with the new structure.
Everything matches the shift to Azure-specific embeddings. No issues here.

app/llm/request_handler/basic_request_handler.py (1)

1-1: Good addition of Literal for typed parameters.
This helps enforce type safety and improves readability of parameter usage.

Copy link
Contributor

@cremertim cremertim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants