[Question]: PDFReader how to specify chunk size ? #8399

xrtechspot · 2023-09-01T14:19:00Z

xrtechspot
Sep 1, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

Here is a way to specify in loader chunk size tu use?

PDFReader = download_loader("PDFReader")
loader = PDFReader()
chunks = loader.load_data(file=Path(file_path))

Thank you

@dosu-bot · 2023-09-01T14:25:46Z

dosubot[bot]
bot Sep 1, 2023

🤖

Hello,

Thank you for your question. Currently, the PDFReader class in the LlamaIndex codebase does not have a parameter to specify the chunk size when loading data from a PDF file. The load_data method in the PDFReader class only takes a file parameter of type Path and an extra_info parameter of type Optional[Dict].

Here is the method signature for reference:

def load_data(
    self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:

In the method body, the PDF file is opened and read in its entirety without any chunking:

with open(file, "rb") as fp:
    # Create a PDF object
    pdf = pypdf.PdfReader(fp)

Therefore, there is no provision for specifying a chunk size when loading data from a PDF file in the current implementation of the PDFReader class.

If you need to process the PDF file in chunks, you might need to implement this functionality yourself. You could, for example, modify the load_data method to take an additional parameter for the chunk size and then read the PDF file in chunks of that size.

I hope this answers your question. If you have any further questions, please feel free to ask.

Best regards,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

logan-markewich · 2023-09-01T15:50:59Z

logan-markewich
Sep 1, 2023
Maintainer

@xrtechspot there is not, right now it just creates one document object per page (this helps with citations using the metadata, since the page number goes into the metadata)

0 replies

logan-markewich · 2023-09-01T15:51:16Z

logan-markewich
Sep 1, 2023
Maintainer

You can specify the chunk size when you actually convert the documents into nodes

0 replies

xrtechspot · 2023-09-01T15:57:15Z

xrtechspot
Sep 1, 2023
Author

You can specify the chunk size when you actually convert the documents into nodes

how to do that ?

0 replies

logan-markewich · 2023-09-01T17:23:39Z

logan-markewich
Sep 1, 2023
Maintainer

Usually, just set the chunk size in the service context

from llama_index import ServiceContext, VectorStoreIndex

service_context = ServiceContext.from_defaults(chunk_size=512)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: PDFReader how to specify chunk size ? #8399

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question]: PDFReader how to specify chunk size ? #8399

xrtechspot Sep 1, 2023

Question Validation

Question

Replies: 5 comments

dosubot[bot] bot Sep 1, 2023

Sources

logan-markewich Sep 1, 2023 Maintainer

logan-markewich Sep 1, 2023 Maintainer

xrtechspot Sep 1, 2023 Author

logan-markewich Sep 1, 2023 Maintainer

xrtechspot
Sep 1, 2023

dosubot[bot]
bot Sep 1, 2023

logan-markewich
Sep 1, 2023
Maintainer

logan-markewich
Sep 1, 2023
Maintainer

xrtechspot
Sep 1, 2023
Author

logan-markewich
Sep 1, 2023
Maintainer