-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lecture transcriptions webhook endpoint #204
base: main
Are you sure you want to change the base?
Conversation
…ls1intum/Pyris into feature/transcriptions-webhook-endpoint
WalkthroughThis pull request introduces a transcription ingestion feature. It adds a new enumeration member to the existing Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Webhook Client
participant Router as FastAPI Router
participant Worker as Ingestion Pipeline Worker
participant Pipeline as TranscriptionIngestionPipeline
participant DB as Vector Database
participant Status as Ingestion Status Callback
Client->>Router: POST /transcriptions/fullIngestion with DTO
Router->>Worker: run_transcription_ingestion_pipeline_worker(dto)
Worker->>Pipeline: Initialize and execute pipeline
Pipeline->>Pipeline: Chunk transcriptions & generate summaries
Pipeline->>DB: Batch insert summarized chunks
Pipeline->>Status: Update status during processing
Status-->>Worker: Status update logged
Possibly related PRs
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (9)
app/pipeline/transcription_ingestion_pipeline.py (5)
11-11
: Consider using standard Python logging overasyncio.log.logger
.While importing
logger
fromasyncio.log
may work, it's more common and flexible to use the built-inlogging
module for consistency and configurability across the codebase.
38-39
: Avoid using a global lock; encapsulate it in the pipeline class instead.Relying on a global
batch_insert_lock
can introduce confusion and potential tight coupling between modules. It's typically cleaner to store locks as part of a class or instance variable, promoting clarity and reducing the likelihood of unexpected concurrency behavior.
95-96
: Combine multiple context managers into a singlewith
statement.Per the linter hint (SIM117), using a single
with
statement is more readable and avoids deeply nested blocks.Below is a suggested diff:
-with batch_insert_lock: - with self.collection.batch.rate_limit(requests_per_minute=600) as batch: +with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch: try: for chunk in chunks: embed_chunk = self.llm_embedding.embed( chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value] ) batch.add_object(properties=chunk, vector=embed_chunk)🧰 Tools
🪛 Ruff (0.8.2)
95-96: Use a single
with
statement with multiple contexts instead of nestedwith
statements(SIM117)
153-155
: Exposechunk_size
as a configurable parameter.The hardcoded
chunk_size=1024
might be too large or too small depending on the lecture length and average text size. Making it configurable helps adapt the chunking logic for different use cases.
220-250
: Consider parallel or batched summarization to improve performance.The current implementation processes and summarizes chunks sequentially. If the system experiences performance bottlenecks on large transcripts, parallelizing chunk summaries or batching them could reduce overall processing time.
app/domain/data/metrics/transcription_dto.py (1)
6-11
: Consider adding field constraints and documentation.The DTOs would benefit from:
- Time constraints (start_time ≤ end_time)
- Field descriptions
- Class-level docstrings explaining their purpose
Also applies to: 14-16, 19-25
app/vector_database/lecture_transcription_schema.py (1)
28-106
: Consider adding error handling and logging.The initialization function would benefit from:
- Error handling for collection creation failures
- Logging for successful creation/retrieval
- Validation of client parameter
Example implementation:
def init_lecture_transcription_schema(client: WeaviateClient) -> Collection: if not client: raise ValueError("WeaviateClient is required") collection_name = LectureTranscriptionSchema.COLLECTION_NAME.value try: if client.collections.exists(collection_name): logger.info(f"Collection {collection_name} already exists") return client.collections.get(collection_name) collection = client.collections.create(...) # existing creation code logger.info(f"Successfully created collection {collection_name}") return collection except Exception as e: logger.error(f"Failed to initialize collection {collection_name}: {str(e)}") raiseapp/web/status/transcription_ingestion_callback.py (1)
17-52
: Add input validation and enhance logging.Consider adding:
- URL validation for base_url
- Logging for stage transitions
- Type hints for the return type
Example improvements:
from urllib.parse import urlparse from typing import List, Optional def __init__( self, run_id: str, base_url: str, initial_stages: Optional[List[StageDTO]] = None, lecture_id: Optional[int] = None, ) -> None: if not base_url: raise ValueError("base_url is required") if not urlparse(base_url).scheme: raise ValueError("Invalid base_url format") logger.info(f"Initializing TranscriptionIngestionStatus for run_id: {run_id}") # ... rest of the implementationapp/web/routers/webhooks.py (1)
181-193
: Remove debug print and add explicit return.The endpoint has a few minor issues:
- Contains a debug print statement.
- Missing explicit return statement for consistency with other endpoints.
Apply this diff to fix the issues:
@router.post( "/transcriptions/fullIngestion", status_code=status.HTTP_202_ACCEPTED, dependencies=[Depends(TokenValidator())], ) def transcription_ingestion_webhook(dto: TranscriptionIngestionPipelineExecutionDto): """ Webhook endpoint to trigger the lecture transcription ingestion pipeline """ - print(f"transcription ingestion got DTO {dto}") thread = Thread(target=run_transcription_ingestion_pipeline_worker, args=(dto,)) thread.start() + return
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
app/common/PipelineEnum.py
(1 hunks)app/domain/data/metrics/transcription_dto.py
(1 hunks)app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py
(1 hunks)app/pipeline/prompts/transcription_ingestion_prompts.py
(1 hunks)app/pipeline/transcription_ingestion_pipeline.py
(1 hunks)app/vector_database/lecture_transcription_schema.py
(1 hunks)app/web/routers/webhooks.py
(3 hunks)app/web/status/transcription_ingestion_callback.py
(1 hunks)requirements.txt
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- requirements.txt
🧰 Additional context used
🪛 GitHub Actions: Run linters
app/pipeline/prompts/transcription_ingestion_prompts.py
[error] 3-3: line too long (226 > 120 characters)
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py
95-96: Use a single with
statement with multiple contexts instead of nested with
statements
(SIM117)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Push Docker Image
🔇 Additional comments (4)
app/pipeline/transcription_ingestion_pipeline.py (1)
161-185
: Verify correctness of offset-based time calculations.This section uses chained reductions to compute offsets, then correlates them with segment start/end times. Confirm it reliably handles edge cases (e.g., partial overlaps at chunk boundaries, extremely large segments).
Would you like a verification script that inspects the length of each chunk and checks whether the start/end positions align with the segment boundaries?
app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py (1)
1-17
: No major issues found; class design appears solid.All fields are well-typed, and the alias usage for
initial_stages
is consistent with Pydantic patterns.app/common/PipelineEnum.py (1)
21-21
: LGTM!The new pipeline enum value follows the established naming convention and is appropriately placed.
app/web/routers/webhooks.py (1)
16-16
: LGTM!The new imports are well-organized and follow the existing pattern in the file.
Also applies to: 21-23, 26-26
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
app/pipeline/transcription_ingestion_pipeline.py (5)
11-11
: Consider using a more appropriate logging module.The
asyncio.log
import seems out of place as the code is not using asyncio. Consider using Python's standardlogging
module instead.-from asyncio.log import logger +import logging + +logger = logging.getLogger(__name__)
41-45
: Add class-level documentation.Consider adding a docstring to describe the purpose of the class and its attributes.
class TranscriptionIngestionPipeline(Pipeline): + """Handles ingestion of lecture transcriptions into a vector database. + + Attributes: + llm (IrisLangchainChatModel): Language model for text processing + pipeline (Runnable): Processing pipeline for LLM outputs + prompt (ChatPromptTemplate): Template for generating prompts + """ llm: IrisLangchainChatModel pipeline: Runnable prompt: ChatPromptTemplate
93-110
: Simplify nested context managers and improve error handling.The batch insertion method can be simplified and its error handling improved.
def batch_insert(self, chunks): global batch_insert_lock - with batch_insert_lock: - with self.collection.batch.rate_limit(requests_per_minute=600) as batch: + with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch: try: for chunk in chunks: embed_chunk = self.llm_embedding.embed( chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value] ) batch.add_object(properties=chunk, vector=embed_chunk) except Exception as e: + error_msg = f"Error embedding lecture transcription chunk: {str(e)}" logger.error(f"Error embedding lecture transcription chunk: {e}") self.callback.error( - f"Failed to ingest lecture transcriptions into the database: {e}", + error_msg, exception=e, tokens=self.tokens, ) + raise # Re-raise to ensure the pipeline stops on error🧰 Tools
🪛 Ruff (0.8.2)
95-96: Use a single
with
statement with multiple contexts instead of nestedwith
statements(SIM117)
111-205
: Consider extracting complex offset calculations into helper methods.The
chunk_transcriptions
method contains complex offset calculations that could be made more readable by extracting them into helper methods.+ def _calculate_slide_chunk_offset(self, slide_chunks: Dict, index: int) -> int: + """Calculate the offset for a slide chunk based on previous chunks.""" + return reduce( + lambda acc, txt: acc + len(txt.replace(self.CHUNK_SEPARATOR_CHAR, "")), + map( + lambda seg: seg[LectureTranscriptionSchema.SEGMENT_TEXT.value], + list(slide_chunks.values())[:index], + ), + 0, + ) + def _calculate_semantic_chunk_offset(self, semantic_chunks: List[str], index: int) -> int: + """Calculate the offset for a semantic chunk based on previous chunks.""" + return reduce( + lambda acc, txt: acc + len(txt.replace(self.CHUNK_SEPARATOR_CHAR, "")), + semantic_chunks[:index], + 0, + )Also, consider adding a constant for the chunk size threshold:
+ CHUNK_SIZE_THRESHOLD = 1200 # Characters + def chunk_transcriptions(self, transcriptions: List[TranscriptionWebhookDTO]) -> List[Dict[str, Any]]: - CHUNK_SEPARATOR_CHAR = "\x1F" + self.CHUNK_SEPARATOR_CHAR = "\x1F" chunks = []
220-249
: Improve error handling in summarize_chunks method.The error handling in
summarize_chunks
could be more informative.def summarize_chunks(self, chunks): chunks_with_summaries = [] - for chunk in chunks: + for i, chunk in enumerate(chunks, 1): self.prompt = ChatPromptTemplate.from_messages( [ ( "system", transcription_summary_prompt( chunk[LectureTranscriptionSchema.LECTURE_NAME.value], chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value], ), ), ] ) prompt_val = self.prompt.format_messages() self.prompt = ChatPromptTemplate.from_messages(prompt_val) try: response = (self.prompt | self.pipeline).invoke({}) self._append_tokens( self.llm.tokens, PipelineEnum.IRIS_VIDEO_TRANSCRIPTION_INGESTION ) chunks_with_summaries.append( { **chunk, LectureTranscriptionSchema.SEGMENT_SUMMARY.value: response, } ) except Exception as e: - raise e + error_msg = f"Failed to summarize chunk {i}/{len(chunks)}: {str(e)}" + logger.error(error_msg) + raise Exception(error_msg) from e return chunks_with_summaries
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
app/domain/data/metrics/transcription_dto.py
(1 hunks)app/pipeline/transcription_ingestion_pipeline.py
(1 hunks)app/vector_database/lecture_transcription_schema.py
(1 hunks)app/web/status/transcription_ingestion_callback.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- app/web/status/transcription_ingestion_callback.py
- app/domain/data/metrics/transcription_dto.py
- app/vector_database/lecture_transcription_schema.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py
76-76: Local variable chunks
is assigned to but never used
Remove assignment to unused variable chunks
(F841)
95-96: Use a single with
statement with multiple contexts instead of nested with
statements
(SIM117)
🪛 GitHub Actions: Run linters
app/pipeline/transcription_ingestion_pipeline.py
[warning] 76-76: local variable 'chunks' is assigned to but never used
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Push Docker Image
🔇 Additional comments (1)
app/pipeline/transcription_ingestion_pipeline.py (1)
73-92
:⚠️ Potential issueUncomment pipeline steps or add TODO comments.
The pipeline execution has commented out critical steps (summarization and batch insertion). Either:
- Uncomment these steps if they are ready for use, or
- Add TODO comments explaining why they are commented out and when they will be implemented.
Additionally, the
chunks
variable is assigned but never used due to the commented code.def __call__(self) -> None: try: self.callback.in_progress("Chunking transcriptions") - chunks = self.chunk_transcriptions(self.dto.transcriptions) + # TODO: Uncomment and test the following pipeline steps + # Current status: Implementation in progress, expected completion by next sprint + # chunks = self.chunk_transcriptions(self.dto.transcriptions) self.callback.in_progress("Summarizing transcriptions") # chunks = self.summarize_chunks(chunks)🧰 Tools
🪛 Ruff (0.8.2)
76-76: Local variable
chunks
is assigned to but never usedRemove assignment to unused variable
chunks
(F841)
🪛 GitHub Actions: Run linters
[warning] 76-76: local variable 'chunks' is assigned to but never used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
app/pipeline/transcription_ingestion_pipeline.py (3)
95-96
: Combine nested with statements.Multiple context managers can be combined into a single with statement for better readability.
Apply this diff to combine the with statements:
- with batch_insert_lock: - with self.collection.batch.rate_limit(requests_per_minute=600) as batch: + with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:🧰 Tools
🪛 Ruff (0.8.2)
95-96: Use a single
with
statement with multiple contexts instead of nestedwith
statements(SIM117)
146-146
: Define chunk size constants.Magic numbers (1200 and 1024) for chunk sizes should be defined as constants at the class level for better maintainability and consistency.
Add these constants at the class level:
+ MAX_CHUNK_SIZE = 1200 + SEMANTIC_CHUNK_SIZE = 1024 + def chunk_transcriptions( self, transcriptions: List[TranscriptionWebhookDTO] ) -> List[Dict[str, Any]]:Then update the references:
- if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < 1200: + if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < self.MAX_CHUNK_SIZE: - chunk_size=1024, chunk_overlap=0 + chunk_size=self.SEMANTIC_CHUNK_SIZE, chunk_overlap=0Also applies to: 154-154
206-219
: Document edge case handling in get_transcription_segment_of_char_position.The method silently returns the last segment when the character position exceeds the total length. This behavior should be documented.
Add docstring to explain the behavior:
@staticmethod def get_transcription_segment_of_char_position( char_position: int, segments: List[TranscriptionSegmentDTO] ): + """ + Find the transcription segment containing the given character position. + + Args: + char_position: The character position to look up + segments: List of transcription segments + + Returns: + The segment containing the character position, or the last segment if the + position exceeds the total length of all segments. + """
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py
(1 hunks)app/pipeline/transcription_ingestion_pipeline.py
(1 hunks)app/web/routers/webhooks.py
(3 hunks)app/web/status/transcription_ingestion_callback.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- app/domain/ingestion/transcription_ingestion/transcription_ingestion_pipeline_execution_dto.py
- app/web/status/transcription_ingestion_callback.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py
95-96: Use a single with
statement with multiple contexts instead of nested with
statements
(SIM117)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Push Docker Image
🔇 Additional comments (1)
app/web/routers/webhooks.py (1)
101-101
:⚠️ Potential issueFix incorrect error message.
The error message incorrectly states "Error while deleting lectures" for a transcription ingestion operation.
Apply this diff to fix the error message:
- logger.error(f"Error while deleting lectures: {e}") + logger.error(f"Error in transcription ingestion pipeline: {e}")Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work so far, some minor comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good so far, I added some change suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
app/web/routers/webhooks.py (1)
101-101
:⚠️ Potential issueFix incorrect error message.
The error message "Error while deleting lectures" is incorrect for the transcription ingestion pipeline.
Apply this diff to fix the error message:
- logger.error(f"Error while deleting lectures: {e}") + logger.error(f"Error in transcription ingestion pipeline: {e}")
🧹 Nitpick comments (2)
app/pipeline/transcription_ingestion_pipeline.py (2)
96-97
: Optimize with statements.Use a single
with
statement with multiple contexts for better readability.Apply this diff to combine the with statements:
- with batch_insert_lock: - with self.collection.batch.rate_limit(requests_per_minute=600) as batch: + with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:🧰 Tools
🪛 Ruff (0.8.2)
96-97: Use a single
with
statement with multiple contexts instead of nestedwith
statements(SIM117)
147-147
: Extract magic number into a named constant.The chunk size threshold of 1200 should be a named constant at the class level for better maintainability.
Apply this diff to extract the magic number:
+ MAX_CHUNK_SIZE = 1200 + def chunk_transcriptions( self, transcriptions: List[TranscriptionWebhookDTO] ) -> List[Dict[str, Any]]: CHUNK_SEPARATOR_CHAR = "\x1F" chunks = [] for transcription in transcriptions: slide_chunks = {} for segment in transcription.transcription.segments: slide_key = f"{transcription.lecture_id}_{transcription.lecture_unit_id}_{segment.slide_number}" - if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < 1200: + if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < self.MAX_CHUNK_SIZE:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
app/domain/data/metrics/transcription_dto.py
(1 hunks)app/pipeline/prompts/transcription_ingestion_prompts.py
(1 hunks)app/pipeline/transcription_ingestion_pipeline.py
(1 hunks)app/vector_database/lecture_transcription_schema.py
(1 hunks)app/web/routers/webhooks.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- app/pipeline/prompts/transcription_ingestion_prompts.py
- app/vector_database/lecture_transcription_schema.py
- app/domain/data/metrics/transcription_dto.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py
1-1: threading
imported but unused
Remove unused import: threading
(F401)
96-97: Use a single with
statement with multiple contexts instead of nested with
statements
(SIM117)
🪛 GitHub Actions: Run linters
app/pipeline/transcription_ingestion_pipeline.py
[warning] 1-1: 'threading' imported but unused
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Push Docker Image
🔇 Additional comments (3)
app/web/routers/webhooks.py (2)
194-194
: Replace print statement with proper logging.Using print statements for debugging is not recommended in production code. Use the logger instead.
Apply this diff to use proper logging:
- logger.info(f"transcription ingestion got DTO {dto}") + logger.info(f"Received transcription ingestion request with DTO: {dto}")
185-197
: LGTM! Consistent implementation with other webhook endpoints.The new transcription ingestion webhook endpoint follows the established pattern:
- Uses proper authentication via TokenValidator
- Runs the worker in a separate thread
- Returns 202 Accepted status code
app/pipeline/transcription_ingestion_pipeline.py (1)
39-39
: LGTM! Good reuse of existing lock.Using the shared
batch_update_lock
from the FAQ ingestion pipeline ensures consistent concurrency control across all ingestion pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
app/pipeline/transcription_ingestion_pipeline.py (1)
223-235
: 🛠️ Refactor suggestionOptimize prompt template creation.
Creating a new prompt template for each chunk is inefficient. This issue was previously identified and should be addressed.
def summarize_chunks(self, chunks): + self.prompt = ChatPromptTemplate.from_messages( + [ + ( + "system", + lambda chunk: transcription_summary_prompt( + chunk[LectureTranscriptionSchema.LECTURE_NAME.value], + chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value], + ), + ), + ] + ) chunks_with_summaries = [] for chunk in chunks: - self.prompt = ChatPromptTemplate.from_messages( - [ - ( - "system", - transcription_summary_prompt( - chunk[LectureTranscriptionSchema.LECTURE_NAME.value], - chunk[LectureTranscriptionSchema.SEGMENT_TEXT.value], - ), - ), - ] - ) - prompt_val = self.prompt.format_messages() - self.prompt = ChatPromptTemplate.from_messages(prompt_val) + prompt_val = self.prompt.format_messages(chunk=chunk)
🧹 Nitpick comments (5)
app/pipeline/transcription_ingestion_pipeline.py (5)
57-57
: Consider making the embedding model configurable.The embedding model is hardcoded to "embedding-small". For better flexibility and future-proofing, consider making this configurable through the constructor or environment variables.
- self.llm_embedding = BasicRequestHandler("embedding-small") + self.llm_embedding = BasicRequestHandler( + self.config.get("EMBEDDING_MODEL", "embedding-small") + )
95-96
: Combine nestedwith
statements for better readability.Multiple
with
statements can be combined using a comma.- with batch_insert_lock: - with self.collection.batch.rate_limit(requests_per_minute=600) as batch: + with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:🧰 Tools
🪛 Ruff (0.8.2)
95-96: Use a single
with
statement with multiple contexts instead of nestedwith
statements(SIM117)
96-96
: Consider making the rate limit configurable.The rate limit of 600 requests per minute is hardcoded. Consider making this configurable to allow for different environments and requirements.
- with self.collection.batch.rate_limit(requests_per_minute=600) as batch: + with self.collection.batch.rate_limit( + requests_per_minute=self.config.get("BATCH_REQUESTS_PER_MINUTE", 600) + ) as batch:🧰 Tools
🪛 Ruff (0.8.2)
95-96: Use a single
with
statement with multiple contexts instead of nestedwith
statements(SIM117)
146-146
: Extract magic number into a named constant.The chunk size threshold of 1200 should be extracted into a named constant for better maintainability.
+ MAX_CHUNK_SIZE = 1200 + def chunk_transcriptions( self, transcriptions: List[TranscriptionWebhookDTO] ) -> List[Dict[str, Any]]: # ... - if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < 1200: + if len(segment[LectureTranscriptionSchema.SEGMENT_TEXT.value]) < self.MAX_CHUNK_SIZE:
162-172
: Add documentation for offset calculations.The offset calculations are complex and would benefit from documentation explaining their purpose and logic.
for j, chunk in enumerate(semantic_chunks): + # Calculate the total length of text in previous slides + # This is used to maintain correct character position mapping + # when splitting text into semantic chunks offset_slide_chunk = reduce( lambda acc, txt: acc + len(txt.replace(CHUNK_SEPARATOR_CHAR, "")),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
app/pipeline/prompts/transcription_ingestion_prompts.py
(1 hunks)app/pipeline/transcription_ingestion_pipeline.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- app/pipeline/prompts/transcription_ingestion_prompts.py
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py
95-96: Use a single with
statement with multiple contexts instead of nested with
statements
(SIM117)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build and Push Docker Image
🔇 Additional comments (2)
app/pipeline/transcription_ingestion_pipeline.py (2)
38-38
: LGTM! Good job reusing the existing lock.The reuse of
batch_update_lock
from the FAQ ingestion pipeline ensures consistent locking behavior across different ingestion pipelines, preventing potential concurrency issues.
73-92
: LGTM! Well-structured pipeline execution with proper error handling.The implementation includes:
- Clear progress updates through callback
- Comprehensive error handling with logging
- Token tracking for billing/monitoring
…ls1intum/Pyris into feature/transcriptions-webhook-endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
app/pipeline/transcription_ingestion_pipeline.py (3)
95-112
: Combine nestedwith
statements for better readability.The nested
with
statements can be combined into a single statement.Apply this diff to improve readability:
- with batch_insert_lock: - with self.collection.batch.rate_limit(requests_per_minute=600) as batch: + with batch_insert_lock, self.collection.batch.rate_limit(requests_per_minute=600) as batch:🧰 Tools
🪛 Ruff (0.8.2)
97-98: Use a single
with
statement with multiple contexts instead of nestedwith
statements(SIM117)
175-175
: Rename unused loop variable.The loop variable
j
is not used within the loop body.Apply this diff to improve code clarity:
- for j, chunk in enumerate(semantic_chunks): + for _j, chunk in enumerate(semantic_chunks):🧰 Tools
🪛 Ruff (0.8.2)
175-175: Loop control variable
j
not used within loop bodyRename unused
j
to_j
(B007)
199-214
: Add docstrings to helper methods.The helper methods would benefit from docstrings explaining their purpose, parameters, and return values.
Consider adding docstrings like this:
@staticmethod def get_transcription_segment_of_char_position( char_position: int, segments: List[TranscriptionSegmentDTO] ): + """ + Find the transcription segment containing the given character position. + + Args: + char_position: The character position to look up + segments: List of transcription segments to search + + Returns: + The segment containing the character position, or the last segment if not found + """Also applies to: 215-225
app/llm/external/openai_embeddings.py (2)
3-3
: Minor formatting change recognized.
The added blank line doesn't affect functionality. Feel free to keep it for readability or remove if unnecessary.
48-65
: Consider adding a docstring and handling errors in semantic splitting.
While the implementation is straightforward, you may want to add a docstring describing the parameters and return value. Additionally, consider how this method should behave ifSemanticChunker
fails or iftext
is excessively large or empty.app/llm/request_handler/basic_request_handler.py (1)
49-63
: Suggest adding docstring and potential error handling.
Documenting the newsplit_text_semantically
method would help clarify its purpose, parameters, and return type. In addition, consider how to handle unexpected values (e.g., invalidbreakpoint_threshold_type
) or empty text inputs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
app/llm/external/openai_embeddings.py
(4 hunks)app/llm/request_handler/basic_request_handler.py
(2 hunks)app/pipeline/transcription_ingestion_pipeline.py
(1 hunks)requirements.txt
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- requirements.txt
🧰 Additional context used
🪛 Ruff (0.8.2)
app/pipeline/transcription_ingestion_pipeline.py
97-98: Use a single with
statement with multiple contexts instead of nested with
statements
(SIM117)
175-175: Loop control variable j
not used within loop body
Rename unused j
to _j
(B007)
🔇 Additional comments (9)
app/pipeline/transcription_ingestion_pipeline.py (3)
37-37
: LGTM! Good reuse of the FAQ ingestion lock.Reusing the
batch_update_lock
from the FAQ ingestion pipeline ensures consistent locking behavior across different ingestion pipelines, preventing potential concurrency issues.
47-73
: LGTM! Well-structured initialization with appropriate LLM configuration.The initialization is thorough with:
- Proper vector database setup
- Well-configured LLM with appropriate requirements (GPT-4.5, 16K context, privacy compliance)
- Suitable completion arguments (temperature=0 for deterministic output)
74-94
: LGTM! Robust pipeline execution with comprehensive error handling.The execution flow is well-structured with:
- Clear progress reporting via callbacks
- Proper error handling with detailed error messages
- Logical progression of operations (chunk -> summarize -> insert)
app/llm/external/openai_embeddings.py (5)
4-4
: Verify compatibility oflangchain_experimental.text_splitter
.
Ensure that the installedlangchain_experimental
version supportsSemanticChunker
and that it behaves as expected, especially in a production environment.
11-11
: New imports look good.
These updated imports align with the switch toOpenAIEmbeddings
andAzureOpenAIEmbeddings
. No immediate issues detected.
20-20
: Clear type annotation for_client
.
Declaring_client
asOpenAIEmbeddings
makes the code more readable and consistent with the new approach.
71-71
: Client initialization is consistent with updated type.
The updated_client
instantiation matches the revised embedding model approach. Looks good.
84-84
: Azure embeddings initialization aligns with the new structure.
Everything matches the shift to Azure-specific embeddings. No issues here.app/llm/request_handler/basic_request_handler.py (1)
1-1
: Good addition ofLiteral
for typed parameters.
This helps enforce type safety and improves readability of parameter usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code lgtm
Add Lecture Transcription Ingestion Pipeline
Summary by CodeRabbit
New Features
Dependencies