How do I manually index papers? #873
Replies: 1 comment 4 replies
-
Hey @Smogglethorp-T-Hendlesworthington, I'm here to help you with any bugs, questions, or contributions you have! I'm a bot that can assist you while you're waiting for a human maintainer. To manually index papers and save their embeddings in a way that allows PaperQA to handle failures gracefully, you can use the from paperqa.docs import Docs
from pathlib import Path
async def index_papers(docs: Docs, paper_paths: list[Path], embedding_model):
for paper_path in paper_paths:
try:
await docs.aadd(
path=paper_path,
embedding_model=embedding_model
)
print(f"Successfully indexed: {paper_path}")
except Exception as e:
print(f"Failed to index {paper_path}: {e}")
# Example usage
# docs = Docs()
# paper_paths = [Path("paper1.pdf"), Path("paper2.pdf"), ...]
# embedding_model = ... # Your embedding model instance
# await index_papers(docs, paper_paths, embedding_model) This code snippet demonstrates how to iterate over a list of paper paths, attempting to add each one to the To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other |
Beta Was this translation helpful? Give feedback.
-
Context: I've got several thousand papers I'd like to index and ask questions of, and would like to build some kind of indexing pipeline such that, if one paper fails, the whole indexing process does not.
This means I've got to manually index papers and add the result to the whole index. I can create and add papers to
Docs
, but this doesn't seem like quite the correct answer; it takes increasingly long to add a new paper to a largeDocs
object.Is there a way to manually index papers in this way and save their embedding so that when I ask a question at any time, PaperQA can reference my large index?
Beta Was this translation helpful? Give feedback.
All reactions