-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a manifest file for indexing but paper-qa still uses LLM during indexing. #861
Comments
To index documents in PaperQA using only the title (or the name of the PDF file) without relying on LLMs for metadata retrieval, you can create a manifest file with just the Here's how you can adjust your manifest file creation:
def make_manifest_file(path, name='Manifest.csv', delimiter=','):
with open(name, 'w', newline='') as file:
writer = csv.writer(file, delimiter=delimiter)
fields = ["file_location", "title"]
writer.writerow(fields)
for currentpath, folders, files in os.walk(path):
for file in files:
try:
writer.writerow([os.path.join(currentpath, file), file])
except Exception as e:
print(e)
By following these steps, you should be able to index documents using only the title without needing additional metadata like DOI, year, or authors [1][2]. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report |
It doesn't work. Script return this error while indexing :
And then still uses LLM whant performing indexing task... Any solution ? |
To resolve the "TypeError: keywords must be strings" error, ensure that your manifest file is correctly formatted with the expected column names and data types. The manifest file should have the following columns:
def make_manifest_file(path, name='Manifest.csv', delimiter=','):
with open(name, 'w', newline='') as file:
writer = csv.writer(file, delimiter=delimiter)
fields = ["file_location", "doi", "title", "key", "docname", "citation"]
writer.writerow(fields)
for currentpath, folders, files in os.walk(path):
for file in files:
try:
file_location = os.path.join(currentpath, file)
writer.writerow([file_location, "", file, file, file, "_"])
except Exception as e:
print(e)
settings = Settings(
agent=AgentSettings(
index=IndexSettings(
manifest_file="Manifest.csv",
# other settings...
)
),
parsing=ParsingSettings(
use_doc_details=False,
# other settings...
),
)
build_index(settings=settings) By ensuring the manifest file is correctly formatted and setting To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report |
where can I find |
The If To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report |
Cannot solve this problem... I try to index without using LLM because LLM is blocking with some PDF files (I guess it is a problem with PDF themselves). Setting Unfortunately, I can't have the error return to help to understand the problem. (I set verbose=2) Setting Any solution ? |
Half-solved the problem, Indexing still uses llm i don't know why... But LLM doesn't 'blocks' anymore while setting I guess my local LLM was overloading... |
Hi, I am battling with the same question (using manifest file for indexing). I followed @dosu recommendation, but I got the following error when calling
Complete code below: import os
from paperqa import Settings
from paperqa.agents import build_index
from paperqa.settings import AgentSettings, IndexSettings, ParsingSettings
os.environ['OPENAI_API_KEY'] = "ollama"
model="ollama/llama3.2"
paper_directory = "data/subsample14"
manifest_file="data/subsample14/manifest.csv"
local_llm_config = {
"model_list": [
{
"model_name": model,
"litellm_params": dict(
model=model,
api_base="http://localhost:11434",
),
}
]
}
agent = AgentSettings(
agent_llm=model,
agent_llm_config=local_llm_config,
index=IndexSettings(
manifest_file=manifest_file,
),
)
settings=Settings(
temperature=0.5,
llm=model,
llm_config=local_llm_config,
summary_llm=model,
summary_llm_config=local_llm_config,
embedding="ollama/mxbai-embed-large",
agent=agent,
paper_directory=paper_directory,
parsing=ParsingSettings(
use_doc_details=False,
),
#verbosity=3
)
settings.answer.answer_max_sources = 3
settings.answer.evidence_k = 5
# index building
build_index(settings=settings) |
To address the issues you're encountering:
By ensuring these configurations and adjustments, you should be able to index documents without unnecessary LLM usage and avoid event loop issues. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
Hi @Snikch63200! Also, about this line in your error: records = [DocDetails(**r) for r in csv.DictReader(content.splitlines())] We don't have that line anymore, I think it could be useful to install the latest Please let me know if this helps. |
Hi @JoaquinPolonuer, Thanks for you're answer. I've updated to 5.17.0. I creatad the manifest as you recomanded but still uses LLM while indexing... Here's my complete code :
I guess the problem is related to the manifest formatting... Help welcome... Best regards. |
Hey @Snikch63200 ! |
You can also try to debug this by putting breakpoints in the functions |
How can i imports
|
To import the from tests.test_agents import test_get_directory_index_w_no_citations This function is defined as an asynchronous test function using the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
Returns following error : No module named 'test.test_agents' |
You can only get this from a local Also, I think Joaquin was more just saying to use |
Hello,
I created a manifest file as follows, before indexing :
The tutorial here (https://github.com/Future-House/paper-qa?tab=readme-ov-file#manifest-files) explains crating a manifest file avoids llm usage for information like DOI retrieval but indexing still uses LLM...
My problem is DOI, year, authors, etc... cannot be retrieved from docs and then, I don't need it.
I'm just looking for a simple solution to add docs with only title (in fact, name of the PDF file) to an indes to perform agentic query.
Is it possible ?
Best regards.
The text was updated successfully, but these errors were encountered: