Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too long documents will not get processed #127

Open
Fumblesneeze opened this issue Nov 22, 2024 · 4 comments
Open

Too long documents will not get processed #127

Fumblesneeze opened this issue Nov 22, 2024 · 4 comments

Comments

@Fumblesneeze
Copy link

There are two documents that don't get processed for me, one with 12 pages and another with 24, I think they exceed the context length of the llm.

These are the logs:

Nov 22 18:04:55.114 INFO Application started, version: 0.0.0

Nov 22 18:04:55.265 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:04:55.298 INFO Fields: [Field { id: 1, name: "tagged", data_type: "boolean" }]

Nov 22 18:04:55.298 INFO Retrieve Documents from paperless at: http://webserver:8000, with query: NOT tagged=true

Nov 22 18:04:55.659 INFO Successfully retrieved 2 Documents

Nov 22 18:04:55.659 INFO Generate Response with LLM model

Nov 22 18:05:07.457 ERRO No JSON object found in the response!

Nov 22 18:05:07.457 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:07.492 INFO tags: []

Nov 22 18:05:07.492 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:07.527 INFO tags: []

Nov 22 18:05:10.037 ERRO No JSON object found in the response!

Nov 22 18:05:10.037 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:10.073 INFO document_types: [(...)]

Nov 22 18:05:16.668 ERRO No JSON object found in the response!

Nov 22 18:05:16.668 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:16.702 INFO correspondents: []

Nov 22 18:05:22.905 ERRO No JSON object found in the response!

Nov 22 18:05:22.905 INFO Generate Response with LLM model

Nov 22 18:05:29.430 ERRO No JSON object found in the response!

Nov 22 18:05:29.430 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:29.464 INFO tags: []

Nov 22 18:05:29.464 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:29.501 INFO tags: []

Nov 22 18:05:38.668 ERRO No JSON object found in the response!

Nov 22 18:05:38.668 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:38.704 INFO document_types: [(...)]

Nov 22 18:05:46.970 ERRO No JSON object found in the response!

Nov 22 18:05:46.970 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:47.004 INFO correspondents: []

Nov 22 18:05:53.309 ERRO No JSON object found in the response!
@B-urb
Copy link
Owner

B-urb commented Nov 27, 2024

Which model are you using ? If it is a limitation of the llm, I would have to implement a sort of split processing to obtain data for those documents.

@Fumblesneeze
Copy link
Author

I tried llama3:7b, qwen2.5:14b and llama2:13b

@Fumblesneeze
Copy link
Author

I think for now it would suffice to just cut off after a certain length sent to the llm since the most important info of the document should be in the beginning. That should suffice to properly categorize and tag it. So a simple env-variable to limit the content length would probably do.

@B-urb
Copy link
Owner

B-urb commented Dec 10, 2024

I will address this issue soon and cap the length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants