[Question]: PDFReader how to specify chunk size ? #8399
Replies: 5 comments
-
🤖 Hello, Thank you for your question. Currently, the Here is the method signature for reference: def load_data(
self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]: In the method body, the PDF file is opened and read in its entirety without any chunking: with open(file, "rb") as fp:
# Create a PDF object
pdf = pypdf.PdfReader(fp) Therefore, there is no provision for specifying a chunk size when loading data from a PDF file in the current implementation of the If you need to process the PDF file in chunks, you might need to implement this functionality yourself. You could, for example, modify the I hope this answers your question. If you have any further questions, please feel free to ask. Best regards, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@xrtechspot there is not, right now it just creates one document object per page (this helps with citations using the metadata, since the page number goes into the metadata) |
Beta Was this translation helpful? Give feedback.
-
You can specify the chunk size when you actually convert the documents into nodes |
Beta Was this translation helpful? Give feedback.
-
how to do that ? |
Beta Was this translation helpful? Give feedback.
-
Usually, just set the chunk size in the service context
|
Beta Was this translation helpful? Give feedback.
-
Question Validation
Question
Here is a way to specify in loader chunk size tu use?
PDFReader = download_loader("PDFReader")
loader = PDFReader()
chunks = loader.load_data(file=Path(file_path))
Thank you
Beta Was this translation helpful? Give feedback.
All reactions