Too long documents will not get processed #127

Fumblesneeze · 2024-11-22T18:12:51Z

There are two documents that don't get processed for me, one with 12 pages and another with 24, I think they exceed the context length of the llm.

These are the logs:

Nov 22 18:04:55.114 INFO Application started, version: 0.0.0

Nov 22 18:04:55.265 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:04:55.298 INFO Fields: [Field { id: 1, name: "tagged", data_type: "boolean" }]

Nov 22 18:04:55.298 INFO Retrieve Documents from paperless at: http://webserver:8000, with query: NOT tagged=true

Nov 22 18:04:55.659 INFO Successfully retrieved 2 Documents

Nov 22 18:04:55.659 INFO Generate Response with LLM model

Nov 22 18:05:07.457 ERRO No JSON object found in the response!

Nov 22 18:05:07.457 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:07.492 INFO tags: []

Nov 22 18:05:07.492 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:07.527 INFO tags: []

Nov 22 18:05:10.037 ERRO No JSON object found in the response!

Nov 22 18:05:10.037 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:10.073 INFO document_types: [(...)]

Nov 22 18:05:16.668 ERRO No JSON object found in the response!

Nov 22 18:05:16.668 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:16.702 INFO correspondents: []

Nov 22 18:05:22.905 ERRO No JSON object found in the response!

Nov 22 18:05:22.905 INFO Generate Response with LLM model

Nov 22 18:05:29.430 ERRO No JSON object found in the response!

Nov 22 18:05:29.430 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:29.464 INFO tags: []

Nov 22 18:05:29.464 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:29.501 INFO tags: []

Nov 22 18:05:38.668 ERRO No JSON object found in the response!

Nov 22 18:05:38.668 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:38.704 INFO document_types: [(...)]

Nov 22 18:05:46.970 ERRO No JSON object found in the response!

Nov 22 18:05:46.970 INFO Fetching custom fields from paperless at http://webserver:8000

Nov 22 18:05:47.004 INFO correspondents: []

Nov 22 18:05:53.309 ERRO No JSON object found in the response!

The text was updated successfully, but these errors were encountered:

B-urb · 2024-11-27T09:22:46Z

Which model are you using ? If it is a limitation of the llm, I would have to implement a sort of split processing to obtain data for those documents.

Fumblesneeze · 2024-11-27T12:25:42Z

I tried llama3:7b, qwen2.5:14b and llama2:13b

Fumblesneeze · 2024-11-27T12:29:04Z

I think for now it would suffice to just cut off after a certain length sent to the llm since the most important info of the document should be in the beginning. That should suffice to properly categorize and tag it. So a simple env-variable to limit the content length would probably do.

B-urb · 2024-12-10T14:00:48Z

I will address this issue soon and cap the length.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too long documents will not get processed #127

Too long documents will not get processed #127

Fumblesneeze commented Nov 22, 2024

B-urb commented Nov 27, 2024

Fumblesneeze commented Nov 27, 2024

Fumblesneeze commented Nov 27, 2024

B-urb commented Dec 10, 2024

Too long documents will not get processed #127

Too long documents will not get processed #127

Comments

Fumblesneeze commented Nov 22, 2024

B-urb commented Nov 27, 2024

Fumblesneeze commented Nov 27, 2024

Fumblesneeze commented Nov 27, 2024

B-urb commented Dec 10, 2024