-
Notifications
You must be signed in to change notification settings - Fork 3k
adding my repo "langchain-text-chunker" #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi, I wanted to contribute this valuable collection with my own repo. This repo is built with Gradio, empowers users to effortlessly upload various document types, extract their raw text content, and then apply a diverse set of LangChain text splitting (chunking) methods. Hope you find it useful.
WalkthroughNew files have been added to the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant GradioUI
participant AppLogic
participant LangChainSplitter
User->>GradioUI: Upload file & set chunking parameters
GradioUI->>AppLogic: Pass file and parameters
AppLogic->>AppLogic: Extract text from file
AppLogic->>LangChainSplitter: Apply chunking methods (various)
LangChainSplitter-->>AppLogic: Return chunked text and metadata
AppLogic->>GradioUI: Return raw text, chunks, and code examples
GradioUI->>User: Display results in UI tabs
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (4)
langchain-text-chunker/requirements.txt (1)
1-7
: Consider pinning all dependency versions for better reproducibility.While only
gradio
is version-pinned, unpinned dependencies can lead to compatibility issues and make builds non-reproducible across different environments.Consider pinning major versions for all dependencies:
gradio==5.33.2 -langchain -langchain_community -pypdf -python-docx -nbformat -unstructured +langchain>=0.1.0,<1.0.0 +langchain_community>=0.0.1,<1.0.0 +pypdf>=3.0.0,<4.0.0 +python-docx>=0.8.11,<2.0.0 +nbformat>=5.7.0,<6.0.0 +unstructured>=0.10.0,<1.0.0langchain-text-chunker/app.py (1)
123-126
: Apply suggested code simplification.Static analysis correctly identified this can be simplified with a ternary operator.
- if isinstance(separator, list): - separator_str = "".join(separator) - else: - separator_str = separator + separator_str = "".join(separator) if isinstance(separator, list) else separatorlangchain-text-chunker/README.md (2)
13-34
: Fix markdown list indentation for consistency.The nested list items have inconsistent indentation which affects readability and markdown parsing.
* **Multi-Document Type Support**: Seamlessly process text from a wide range of document formats, including: - * PDF (`.pdf`) - * Microsoft Word (`.docx`) - * Plain Text (`.txt`) - * HTML (`.html`) - * CSS (`.css`) - * Python Code (`.py`) - * Jupyter Notebooks (`.ipynb`) - * CSV (`.csv`) + * PDF (`.pdf`) + * Microsoft Word (`.docx`) + * Plain Text (`.txt`) + * HTML (`.html`) + * CSS (`.css`) + * Python Code (`.py`) + * Jupyter Notebooks (`.ipynb`) + * CSV (`.csv`) * **Diverse Chunking Strategies**: Explore and compare the output of various LangChain text splitters: - * **Recursive Character Text Splitter**: Ideal for general-purpose text, attempting to split on a list of characters in order. - * **Character Text Splitter**: Splits text based on a single, user-defined separator. - * **Markdown Text Splitter**: Specifically designed to understand and preserve the structure of Markdown documents. - * **Python Code Text Splitter**: Optimized for splitting Python source code while maintaining syntactical integrity. - * **JavaScript Code Text Splitter**: Utilizes language-specific rules to chunk JavaScript code effectively. + * **Recursive Character Text Splitter**: Ideal for general-purpose text, attempting to split on a list of characters in order. + * **Character Text Splitter**: Splits text based on a single, user-defined separator. + * **Markdown Text Splitter**: Specifically designed to understand and preserve the structure of Markdown documents. + * **Python Code Text Splitter**: Optimized for splitting Python source code while maintaining syntactical integrity. + * **JavaScript Code Text Splitter**: Utilizes language-specific rules to chunk JavaScript code effectively.
101-101
: Fix typo in screenshot description.-*Chunking results of Recursice Chunking Method.* +*Chunking results of Recursive Chunking Method.*
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
langchain-text-chunker/assets/1.JPG
is excluded by!**/*.jpg
langchain-text-chunker/assets/2.JPG
is excluded by!**/*.jpg
📒 Files selected for processing (4)
langchain-text-chunker/LICENSE
(1 hunks)langchain-text-chunker/README.md
(1 hunks)langchain-text-chunker/app.py
(1 hunks)langchain-text-chunker/requirements.txt
(1 hunks)
🧰 Additional context used
🪛 LanguageTool
langchain-text-chunker/README.md
[uncategorized] ~30-~30: Loose punctuation mark.
Context: ...nerated chunks.
* Chunk Overlap
: Specify the number of characters that o...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~31-~31: Loose punctuation mark.
Context: ...
* Character Splitter Separator
: Choose custom separators for the Charac...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~32-~32: Loose punctuation mark.
Context: ...unking method.
* Keep Separator
: Control whether the separator is includ...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~33-~33: Loose punctuation mark.
Context: ....
* Add Start Index to Metadata
: Option to include the starting characte...
(UNLIKELY_OPENING_PUNCTUATION)
[uncategorized] ~34-~34: Loose punctuation mark.
Context: ...ts metadata.
* Strip Whitespace
: Automatically remove leading/trailing w...
(UNLIKELY_OPENING_PUNCTUATION)
[style] ~37-~37: Consider a different adjective to strengthen your wording.
Context: ...o experiment with text chunking without deep programming knowledge.
Installati...
(DEEP_PROFOUND)
🪛 markdownlint-cli2 (0.17.2)
langchain-text-chunker/README.md
14-14: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
15-15: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
16-16: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
17-17: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
18-18: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
19-19: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
20-20: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
21-21: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
23-23: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
24-24: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
25-25: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
26-26: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
27-27: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
29-29: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
30-30: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
31-31: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
32-32: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
33-33: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
34-34: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
🪛 Ruff (0.11.9)
langchain-text-chunker/app.py
123-126: Use ternary operator separator_str = "".join(separator) if isinstance(separator, list) else separator
instead of if
-else
-block
Replace if
-else
-block with separator_str = "".join(separator) if isinstance(separator, list) else separator
(SIM108)
🪛 Pylint (3.3.7)
langchain-text-chunker/app.py
[refactor] 12-12: Too many local variables (16/15)
(R0914)
[refactor] 12-12: Too many branches (15/12)
(R0912)
[refactor] 12-12: Too many statements (56/50)
(R0915)
[refactor] 82-82: Too many arguments (6/5)
(R0913)
[refactor] 82-82: Too many positional arguments (6/5)
(R0917)
[refactor] 119-119: Too many arguments (7/5)
(R0913)
[refactor] 119-119: Too many positional arguments (7/5)
(R0917)
[refactor] 164-164: Too many arguments (6/5)
(R0913)
[refactor] 164-164: Too many positional arguments (6/5)
(R0917)
[refactor] 199-199: Too many arguments (6/5)
(R0913)
[refactor] 199-199: Too many positional arguments (6/5)
(R0917)
[refactor] 236-236: Too many arguments (6/5)
(R0913)
[refactor] 236-236: Too many positional arguments (6/5)
(R0917)
[refactor] 273-273: Too many arguments (7/5)
(R0913)
[refactor] 273-273: Too many positional arguments (7/5)
(R0917)
[refactor] 273-273: Too many local variables (19/15)
(R0914)
🔇 Additional comments (5)
langchain-text-chunker/LICENSE (1)
1-21
: Standard MIT License properly formatted.The license file is correctly structured with appropriate copyright attribution.
langchain-text-chunker/app.py (3)
1-9
: Import organization looks good.The imports are well organized with clear separation between standard libraries, third-party packages, and LangChain modules.
311-416
: Well-structured Gradio interface with good UX design.The interface is thoughtfully designed with:
- Clear parameter controls in an accordion
- Tabbed output for different chunking methods
- Helpful tooltips and descriptions
- Copy functionality for extracted text
3-3
: Consider security implications of PyPDF2.PyPDF2 has known security vulnerabilities and is no longer actively maintained. Consider migrating to a more secure alternative.
What are the current security status and recommended alternatives to PyPDF2?
langchain-text-chunker/README.md (1)
1-118
: Comprehensive and well-structured documentation.The README provides excellent coverage of features, installation, usage, and contribution guidelines. The documentation is clear and user-friendly.
def chunk_recursive(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace): | ||
if not text: | ||
return [], "" | ||
text_splitter = RecursiveCharacterTextSplitter( | ||
chunk_size=chunk_size, | ||
chunk_overlap=chunk_overlap, | ||
length_function=len, | ||
keep_separator=keep_separator, | ||
add_start_index=add_start_index, | ||
strip_whitespace=strip_whitespace, | ||
) | ||
chunks = text_splitter.create_documents([text]) | ||
formatted_chunks = [] | ||
for chunk in chunks: | ||
if isinstance(chunk, Document): | ||
formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata}) | ||
else: | ||
formatted_chunks.append({"content": str(chunk), "metadata": {}}) | ||
|
||
code_example = f""" | ||
from langchain.text_splitter import RecursiveCharacterTextSplitter | ||
|
||
text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example | ||
|
||
text_splitter = RecursiveCharacterTextSplitter( | ||
chunk_size={chunk_size}, | ||
chunk_overlap={chunk_overlap}, | ||
length_function=len, | ||
keep_separator={keep_separator}, | ||
add_start_index={add_start_index}, | ||
strip_whitespace={strip_whitespace}, | ||
) | ||
chunks = text_splitter.create_documents([text_content]) | ||
# Access chunks: chunks[0].page_content, chunks[0].metadata | ||
""" | ||
return formatted_chunks, code_example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Reduce function complexity by extracting common code patterns.
The chunking functions share significant code duplication in formatting chunks and generating code examples.
Extract common functionality:
+def format_chunks_and_code(chunks, text, splitter_class, splitter_params):
+ """Common function to format chunks and generate code examples."""
+ formatted_chunks = []
+ for chunk in chunks:
+ if isinstance(chunk, Document):
+ formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})
+ else:
+ formatted_chunks.append({"content": str(chunk), "metadata": {}})
+
+ # Generate code example
+ params_str = ",\n ".join([f"{k}={v}" for k, v in splitter_params.items()])
+ code_example = f"""
+from langchain.text_splitter import {splitter_class}
+
+text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example
+
+text_splitter = {splitter_class}(
+ {params_str}
+)
+chunks = text_splitter.create_documents([text_content])
+# Access chunks: chunks[0].page_content, chunks[0].metadata
+"""
+ return formatted_chunks, code_example
+
def chunk_recursive(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace):
if not text:
return [], ""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=len,
keep_separator=keep_separator,
add_start_index=add_start_index,
strip_whitespace=strip_whitespace,
)
chunks = text_splitter.create_documents([text])
- formatted_chunks = []
- for chunk in chunks:
- if isinstance(chunk, Document):
- formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})
- else:
- formatted_chunks.append({"content": str(chunk), "metadata": {}})
-
- code_example = f"""
-from langchain.text_splitter import RecursiveCharacterTextSplitter
-
-text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example
-
-text_splitter = RecursiveCharacterTextSplitter(
- chunk_size={chunk_size},
- chunk_overlap={chunk_overlap},
- length_function=len,
- keep_separator={keep_separator},
- add_start_index={add_start_index},
- strip_whitespace={strip_whitespace},
-)
-chunks = text_splitter.create_documents([text_content])
-# Access chunks: chunks[0].page_content, chunks[0].metadata
-"""
- return formatted_chunks, code_example
+
+ params = {
+ "chunk_size": chunk_size,
+ "chunk_overlap": chunk_overlap,
+ "length_function": "len",
+ "keep_separator": keep_separator,
+ "add_start_index": add_start_index,
+ "strip_whitespace": strip_whitespace
+ }
+ return format_chunks_and_code(chunks, text, "RecursiveCharacterTextSplitter", params)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def chunk_recursive(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace): | |
if not text: | |
return [], "" | |
text_splitter = RecursiveCharacterTextSplitter( | |
chunk_size=chunk_size, | |
chunk_overlap=chunk_overlap, | |
length_function=len, | |
keep_separator=keep_separator, | |
add_start_index=add_start_index, | |
strip_whitespace=strip_whitespace, | |
) | |
chunks = text_splitter.create_documents([text]) | |
formatted_chunks = [] | |
for chunk in chunks: | |
if isinstance(chunk, Document): | |
formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata}) | |
else: | |
formatted_chunks.append({"content": str(chunk), "metadata": {}}) | |
code_example = f""" | |
from langchain.text_splitter import RecursiveCharacterTextSplitter | |
text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example | |
text_splitter = RecursiveCharacterTextSplitter( | |
chunk_size={chunk_size}, | |
chunk_overlap={chunk_overlap}, | |
length_function=len, | |
keep_separator={keep_separator}, | |
add_start_index={add_start_index}, | |
strip_whitespace={strip_whitespace}, | |
) | |
chunks = text_splitter.create_documents([text_content]) | |
# Access chunks: chunks[0].page_content, chunks[0].metadata | |
""" | |
return formatted_chunks, code_example | |
def format_chunks_and_code(chunks, text, splitter_class, splitter_params): | |
"""Common function to format chunks and generate code examples.""" | |
formatted_chunks = [] | |
for chunk in chunks: | |
if isinstance(chunk, Document): | |
formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata}) | |
else: | |
formatted_chunks.append({"content": str(chunk), "metadata": {}}) | |
# Generate code example | |
params_str = ",\n ".join([f"{k}={v}" for k, v in splitter_params.items()]) | |
code_example = f""" | |
from langchain.text_splitter import {splitter_class} | |
text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example | |
text_splitter = {splitter_class}( | |
{params_str} | |
) | |
chunks = text_splitter.create_documents([text_content]) | |
# Access chunks: chunks[0].page_content, chunks[0].metadata | |
""" | |
return formatted_chunks, code_example | |
def chunk_recursive(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace): | |
if not text: | |
return [], "" | |
text_splitter = RecursiveCharacterTextSplitter( | |
chunk_size=chunk_size, | |
chunk_overlap=chunk_overlap, | |
length_function=len, | |
keep_separator=keep_separator, | |
add_start_index=add_start_index, | |
strip_whitespace=strip_whitespace, | |
) | |
chunks = text_splitter.create_documents([text]) | |
params = { | |
"chunk_size": chunk_size, | |
"chunk_overlap": chunk_overlap, | |
"length_function": "len", | |
"keep_separator": keep_separator, | |
"add_start_index": add_start_index, | |
"strip_whitespace": strip_whitespace | |
} | |
return format_chunks_and_code(chunks, text, "RecursiveCharacterTextSplitter", params) |
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 82-82: Too many arguments (6/5)
(R0913)
[refactor] 82-82: Too many positional arguments (6/5)
(R0917)
🤖 Prompt for AI Agents
In langchain-text-chunker/app.py between lines 82 and 117, the chunk_recursive
function duplicates code for formatting chunks and generating code examples. To
fix this, extract the repeated logic for formatting chunk objects into a
separate helper function that takes chunks and returns formatted_chunks.
Similarly, create a utility function to generate the code example string based
on input parameters. Then, update chunk_recursive to call these helper
functions, reducing code duplication and improving maintainability.
def main_interface(uploaded_file, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace): | ||
if uploaded_file is None: | ||
return "", "", [], [], [], [], [], "", "", "", "", "", "", "", "", "", "", "" | ||
|
||
# Ensure chunk_size and chunk_overlap are integers | ||
chunk_size = int(chunk_size) | ||
chunk_overlap = int(chunk_overlap) | ||
|
||
raw_text, display_content = process_uploaded_file(uploaded_file) | ||
|
||
recursive_chunks, recursive_code = chunk_recursive(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | ||
character_chunks, character_code = chunk_character(raw_text, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace) | ||
markdown_chunks, markdown_code = chunk_markdown(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | ||
python_chunks, python_code = chunk_python_code(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | ||
javascript_chunks, javascript_code = chunk_javascript_code(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | ||
|
||
return ( | ||
display_content, | ||
raw_text, | ||
recursive_chunks, | ||
character_chunks, | ||
markdown_chunks, | ||
python_chunks, | ||
javascript_chunks, | ||
f"Number of chunks: {len(recursive_chunks)}", | ||
f"Number of chunks: {len(character_chunks)}", | ||
f"Number of chunks: {len(markdown_chunks)}", | ||
f"Number of chunks: {len(python_chunks)}", | ||
f"Number of chunks: {len(javascript_chunks)}", | ||
recursive_code, | ||
character_code, | ||
markdown_code, | ||
python_code, | ||
javascript_code | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Simplify the main interface function return statement.
The function returns 18 values which makes it hard to maintain and understand. Consider using a data structure.
+from dataclasses import dataclass
+from typing import List, Dict, Any
+
+@dataclass
+class ChunkingResult:
+ display_content: str
+ raw_text: str
+ recursive_chunks: List[Dict[str, Any]]
+ character_chunks: List[Dict[str, Any]]
+ markdown_chunks: List[Dict[str, Any]]
+ python_chunks: List[Dict[str, Any]]
+ javascript_chunks: List[Dict[str, Any]]
+ recursive_count: str
+ character_count: str
+ markdown_count: str
+ python_count: str
+ javascript_count: str
+ recursive_code: str
+ character_code: str
+ markdown_code: str
+ python_code: str
+ javascript_code: str
def main_interface(uploaded_file, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace):
if uploaded_file is None:
- return "", "", [], [], [], [], [], "", "", "", "", "", "", "", "", "", "", ""
+ return ChunkingResult("", "", [], [], [], [], [], "", "", "", "", "", "", "", "", "", "", "")
# ... existing processing code ...
- return (
- display_content,
- raw_text,
- recursive_chunks,
- character_chunks,
- markdown_chunks,
- python_chunks,
- javascript_chunks,
- f"Number of chunks: {len(recursive_chunks)}",
- f"Number of chunks: {len(character_chunks)}",
- f"Number of chunks: {len(markdown_chunks)}",
- f"Number of chunks: {len(python_chunks)}",
- f"Number of chunks: {len(javascript_chunks)}",
- recursive_code,
- character_code,
- markdown_code,
- python_code,
- javascript_code
- )
+ return ChunkingResult(
+ display_content=display_content,
+ raw_text=raw_text,
+ recursive_chunks=recursive_chunks,
+ character_chunks=character_chunks,
+ markdown_chunks=markdown_chunks,
+ python_chunks=python_chunks,
+ javascript_chunks=javascript_chunks,
+ recursive_count=f"Number of chunks: {len(recursive_chunks)}",
+ character_count=f"Number of chunks: {len(character_chunks)}",
+ markdown_count=f"Number of chunks: {len(markdown_chunks)}",
+ python_count=f"Number of chunks: {len(python_chunks)}",
+ javascript_count=f"Number of chunks: {len(javascript_chunks)}",
+ recursive_code=recursive_code,
+ character_code=character_code,
+ markdown_code=markdown_code,
+ python_code=python_code,
+ javascript_code=javascript_code
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def main_interface(uploaded_file, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace): | |
if uploaded_file is None: | |
return "", "", [], [], [], [], [], "", "", "", "", "", "", "", "", "", "", "" | |
# Ensure chunk_size and chunk_overlap are integers | |
chunk_size = int(chunk_size) | |
chunk_overlap = int(chunk_overlap) | |
raw_text, display_content = process_uploaded_file(uploaded_file) | |
recursive_chunks, recursive_code = chunk_recursive(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | |
character_chunks, character_code = chunk_character(raw_text, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace) | |
markdown_chunks, markdown_code = chunk_markdown(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | |
python_chunks, python_code = chunk_python_code(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | |
javascript_chunks, javascript_code = chunk_javascript_code(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace) | |
return ( | |
display_content, | |
raw_text, | |
recursive_chunks, | |
character_chunks, | |
markdown_chunks, | |
python_chunks, | |
javascript_chunks, | |
f"Number of chunks: {len(recursive_chunks)}", | |
f"Number of chunks: {len(character_chunks)}", | |
f"Number of chunks: {len(markdown_chunks)}", | |
f"Number of chunks: {len(python_chunks)}", | |
f"Number of chunks: {len(javascript_chunks)}", | |
recursive_code, | |
character_code, | |
markdown_code, | |
python_code, | |
javascript_code | |
) | |
from dataclasses import dataclass | |
from typing import List, Dict, Any | |
@dataclass | |
class ChunkingResult: | |
display_content: str | |
raw_text: str | |
recursive_chunks: List[Dict[str, Any]] | |
character_chunks: List[Dict[str, Any]] | |
markdown_chunks: List[Dict[str, Any]] | |
python_chunks: List[Dict[str, Any]] | |
javascript_chunks: List[Dict[str, Any]] | |
recursive_count: str | |
character_count: str | |
markdown_count: str | |
python_count: str | |
javascript_count: str | |
recursive_code: str | |
character_code: str | |
markdown_code: str | |
python_code: str | |
javascript_code: str | |
def main_interface(uploaded_file, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace): | |
if uploaded_file is None: | |
return ChunkingResult( | |
"", "", [], [], [], [], [], | |
"", "", "", "", "", | |
"", "", "", "", "" | |
) | |
# Ensure chunk_size and chunk_overlap are integers | |
chunk_size = int(chunk_size) | |
chunk_overlap = int(chunk_overlap) | |
raw_text, display_content = process_uploaded_file(uploaded_file) | |
recursive_chunks, recursive_code = chunk_recursive( | |
raw_text, chunk_size, chunk_overlap, | |
keep_separator, add_start_index, strip_whitespace | |
) | |
character_chunks, character_code = chunk_character( | |
raw_text, chunk_size, chunk_overlap, separator, | |
keep_separator, add_start_index, strip_whitespace | |
) | |
markdown_chunks, markdown_code = chunk_markdown( | |
raw_text, chunk_size, chunk_overlap, | |
keep_separator, add_start_index, strip_whitespace | |
) | |
python_chunks, python_code = chunk_python_code( | |
raw_text, chunk_size, chunk_overlap, | |
keep_separator, add_start_index, strip_whitespace | |
) | |
javascript_chunks, javascript_code = chunk_javascript_code( | |
raw_text, chunk_size, chunk_overlap, | |
keep_separator, add_start_index, strip_whitespace | |
) | |
return ChunkingResult( | |
display_content=display_content, | |
raw_text=raw_text, | |
recursive_chunks=recursive_chunks, | |
character_chunks=character_chunks, | |
markdown_chunks=markdown_chunks, | |
python_chunks=python_chunks, | |
javascript_chunks=javascript_chunks, | |
recursive_count=f"Number of chunks: {len(recursive_chunks)}", | |
character_count=f"Number of chunks: {len(character_chunks)}", | |
markdown_count=f"Number of chunks: {len(markdown_chunks)}", | |
python_count=f"Number of chunks: {len(python_chunks)}", | |
javascript_count=f"Number of chunks: {len(javascript_chunks)}", | |
recursive_code=recursive_code, | |
character_code=character_code, | |
markdown_code=markdown_code, | |
python_code=python_code, | |
javascript_code=javascript_code | |
) |
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 273-273: Too many arguments (7/5)
(R0913)
[refactor] 273-273: Too many positional arguments (7/5)
(R0917)
[refactor] 273-273: Too many local variables (19/15)
(R0914)
🤖 Prompt for AI Agents
In langchain-text-chunker/app.py around lines 273 to 307, the main_interface
function returns 18 separate values, making it difficult to maintain and
understand. Refactor the return statement to return a single dictionary or a
custom data class containing all these values as named fields. This will
simplify the interface and improve code readability and maintainability.
def process_uploaded_file(uploaded_file): | ||
text = "" | ||
display_content = "" | ||
file_extension = uploaded_file.name.split(".")[-1] | ||
|
||
if file_extension == "pdf": | ||
try: | ||
# Gradio's uploaded_file.name provides the path to the temporary file | ||
pdf = PdfReader(uploaded_file.name) | ||
for page in pdf.pages: | ||
page_text = page.extract_text() | ||
text += page_text + "\n" | ||
display_content += page_text + "\n" | ||
except Exception as e: | ||
display_content = f"Error reading PDF file: {e}" | ||
text = "" | ||
|
||
elif file_extension == "docx": | ||
try: | ||
docx_loader = Docx2txtLoader(uploaded_file.name) | ||
documents = docx_loader.load() | ||
text = "\n".join([doc.page_content for doc in documents]) | ||
display_content = text | ||
except Exception as e: | ||
display_content = f"Error reading DOCX file: {e}" | ||
text = "" | ||
|
||
elif file_extension in ["html", "css", "py", "txt"]: | ||
try: | ||
with open(uploaded_file.name, "r", encoding="utf-8") as f: | ||
file_content = f.read() | ||
display_content = file_content # Display as plain text in Textbox | ||
text = file_content | ||
except Exception as e: | ||
display_content = f"Error reading {file_extension.upper()} file: {e}" | ||
text = "" | ||
|
||
elif file_extension == "ipynb": | ||
try: | ||
# nbformat.read can take a file path | ||
nb_content = nbformat.read(uploaded_file.name, as_version=4) | ||
nb_filtered = [cell for cell in nb_content["cells"] if cell["cell_type"] in ["code", "markdown"]] | ||
|
||
for cell in nb_filtered: | ||
if cell["cell_type"] == "code": | ||
display_content += f"```python\n{cell['source']}\n```\n" | ||
text += cell["source"] + "\n" | ||
elif cell["cell_type"] == "markdown": | ||
display_content += f"{cell['source']}\n" | ||
text += cell["source"] + "\n" | ||
except Exception as e: | ||
display_content = f"Error reading IPYNB file: {e}" | ||
text = "" | ||
|
||
elif file_extension == "csv": | ||
try: | ||
loader = CSVLoader(file_path=uploaded_file.name, encoding="utf-8", csv_args={'delimiter': ','}) | ||
documents = loader.load() | ||
text = "\n".join([doc.page_content for doc in documents]) | ||
display_content = text # For CSV, display the concatenated text | ||
except Exception as e: | ||
display_content = f"Error reading CSV file: {e}" | ||
text = "" | ||
else: | ||
display_content = "Unsupported file type." | ||
text = "" | ||
|
||
return text, display_content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error handling and security in file processing function.
The function handles multiple file types but has several areas for improvement:
- Security Risk: The function directly opens files without validation, which could be exploited with malicious file paths.
- Error Handling: Generic exception handling masks specific error types.
- Code Complexity: The function is too long and handles too many responsibilities.
Apply these security and structure improvements:
def process_uploaded_file(uploaded_file):
+ """Process uploaded file and extract text content safely."""
+ if not uploaded_file or not hasattr(uploaded_file, 'name'):
+ return "", "No file provided"
+
text = ""
display_content = ""
- file_extension = uploaded_file.name.split(".")[-1]
+ file_extension = uploaded_file.name.split(".")[-1].lower()
+
+ # Validate file extension
+ allowed_extensions = {"pdf", "docx", "html", "css", "py", "txt", "ipynb", "csv"}
+ if file_extension not in allowed_extensions:
+ return "", f"Unsupported file type: {file_extension}"
if file_extension == "pdf":
try:
- # Gradio's uploaded_file.name provides the path to the temporary file
- pdf = PdfReader(uploaded_file.name)
+ pdf = PdfReader(uploaded_file.name)
for page in pdf.pages:
page_text = page.extract_text()
text += page_text + "\n"
display_content += page_text + "\n"
- except Exception as e:
+ except (FileNotFoundError, PermissionError) as e:
+ display_content = f"File access error: {e}"
+ text = ""
+ except Exception as e:
display_content = f"Error reading PDF file: {e}"
text = ""
Consider splitting this function into smaller, specialized functions for each file type.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def process_uploaded_file(uploaded_file): | |
text = "" | |
display_content = "" | |
file_extension = uploaded_file.name.split(".")[-1] | |
if file_extension == "pdf": | |
try: | |
# Gradio's uploaded_file.name provides the path to the temporary file | |
pdf = PdfReader(uploaded_file.name) | |
for page in pdf.pages: | |
page_text = page.extract_text() | |
text += page_text + "\n" | |
display_content += page_text + "\n" | |
except Exception as e: | |
display_content = f"Error reading PDF file: {e}" | |
text = "" | |
elif file_extension == "docx": | |
try: | |
docx_loader = Docx2txtLoader(uploaded_file.name) | |
documents = docx_loader.load() | |
text = "\n".join([doc.page_content for doc in documents]) | |
display_content = text | |
except Exception as e: | |
display_content = f"Error reading DOCX file: {e}" | |
text = "" | |
elif file_extension in ["html", "css", "py", "txt"]: | |
try: | |
with open(uploaded_file.name, "r", encoding="utf-8") as f: | |
file_content = f.read() | |
display_content = file_content # Display as plain text in Textbox | |
text = file_content | |
except Exception as e: | |
display_content = f"Error reading {file_extension.upper()} file: {e}" | |
text = "" | |
elif file_extension == "ipynb": | |
try: | |
# nbformat.read can take a file path | |
nb_content = nbformat.read(uploaded_file.name, as_version=4) | |
nb_filtered = [cell for cell in nb_content["cells"] if cell["cell_type"] in ["code", "markdown"]] | |
for cell in nb_filtered: | |
if cell["cell_type"] == "code": | |
display_content += f"```python\n{cell['source']}\n```\n" | |
text += cell["source"] + "\n" | |
elif cell["cell_type"] == "markdown": | |
display_content += f"{cell['source']}\n" | |
text += cell["source"] + "\n" | |
except Exception as e: | |
display_content = f"Error reading IPYNB file: {e}" | |
text = "" | |
elif file_extension == "csv": | |
try: | |
loader = CSVLoader(file_path=uploaded_file.name, encoding="utf-8", csv_args={'delimiter': ','}) | |
documents = loader.load() | |
text = "\n".join([doc.page_content for doc in documents]) | |
display_content = text # For CSV, display the concatenated text | |
except Exception as e: | |
display_content = f"Error reading CSV file: {e}" | |
text = "" | |
else: | |
display_content = "Unsupported file type." | |
text = "" | |
return text, display_content | |
def process_uploaded_file(uploaded_file): | |
"""Process uploaded file and extract text content safely.""" | |
if not uploaded_file or not hasattr(uploaded_file, "name"): | |
return "", "No file provided" | |
text = "" | |
display_content = "" | |
file_extension = uploaded_file.name.split(".")[-1].lower() | |
# Validate file extension | |
allowed_extensions = {"pdf", "docx", "html", "css", "py", "txt", "ipynb", "csv"} | |
if file_extension not in allowed_extensions: | |
return "", f"Unsupported file type: {file_extension}" | |
if file_extension == "pdf": | |
try: | |
pdf = PdfReader(uploaded_file.name) | |
for page in pdf.pages: | |
page_text = page.extract_text() | |
text += page_text + "\n" | |
display_content += page_text + "\n" | |
except (FileNotFoundError, PermissionError) as e: | |
display_content = f"File access error: {e}" | |
text = "" | |
except Exception as e: | |
display_content = f"Error reading PDF file: {e}" | |
text = "" | |
elif file_extension == "docx": | |
try: | |
docx_loader = Docx2txtLoader(uploaded_file.name) | |
documents = docx_loader.load() | |
text = "\n".join([doc.page_content for doc in documents]) | |
display_content = text | |
except Exception as e: | |
display_content = f"Error reading DOCX file: {e}" | |
text = "" | |
elif file_extension in ["html", "css", "py", "txt"]: | |
try: | |
with open(uploaded_file.name, "r", encoding="utf-8") as f: | |
file_content = f.read() | |
display_content = file_content # Display as plain text in Textbox | |
text = file_content | |
except Exception as e: | |
display_content = f"Error reading {file_extension.upper()} file: {e}" | |
text = "" | |
elif file_extension == "ipynb": | |
try: | |
# nbformat.read can take a file path | |
nb_content = nbformat.read(uploaded_file.name, as_version=4) | |
nb_filtered = [ | |
cell for cell in nb_content["cells"] | |
if cell["cell_type"] in ["code", "markdown"] | |
] | |
for cell in nb_filtered: | |
if cell["cell_type"] == "code": | |
display_content += f" |
🧰 Tools
🪛 Pylint (3.3.7)
[refactor] 12-12: Too many local variables (16/15)
(R0914)
[refactor] 12-12: Too many branches (15/12)
(R0912)
[refactor] 12-12: Too many statements (56/50)
(R0915)
🤖 Prompt for AI Agents
In langchain-text-chunker/app.py from lines 12 to 79, the process_uploaded_file
function is too long, mixes responsibilities, and uses generic exception
handling with potential security risks from unvalidated file paths. Refactor by
splitting the function into smaller, specialized functions for each file type
(e.g., process_pdf_file, process_docx_file, etc.). Add validation to ensure
uploaded_file.name is safe and sanitized before opening. Replace generic except
blocks with specific exception types to improve error handling clarity.
Hi,
I wanted to contribute this valuable collection with my own repo. This repo is built with Gradio, empowers users to effortlessly upload various document types, extract their raw text content, and then apply a diverse set of LangChain text splitting (chunking) methods.
Hope you find it useful.
Summary by CodeRabbit
New Features
Documentation
Chores