Replies: 1 comment
-
|
Sounds like an interesting idea in general. Since generating prototypes of this is quickly done with an agent, I suggest to just try out the options instead of overthinking it. Once you find something that kinda works, stick to the stack you picked and iteratively improve it. That's the hard part, as properly doing that will require you to set up evals. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all. I am contemplating creating a custom skill for document search for my pi agent, but before I start I wanted to check with you guys if anybody knows of some good alternative solutions - or just what you think about this from your experience. Any feedback is highly welcomed!
Problem:
I often find that the combination of "grep" and "read" leads to quite excessive context window expansion, both when used for codebase investigation, and searching a folder full of documents. The agent typically reads a lot of documents, and it also reads in the full documents even though there might only be parts of it relevant - massively expanding the input tokens usage.
Idea:
With this experience in mind, I have looked into utilising CLI-based indexed search options. The idea is to install one or more, and then creating a "doc-search" skill, which then instructs the agent how to conduct searching for different document types (code, company reports, webpages, etc), by using the bash command and the proper CLI-based indexed search utility. Initially I considered creating a complex search tool that handled a lot of cases; but inspired by the Bitter Lesson, I think its better to avoid too much scaffolding and rather just tell the agent how to do it via prompting. By designing it as a skill, I can also instruct the agent to always start by only reading parts of the files - e.g. from line number X to Y, and only read the whole file if absolutely needed. Hopefully, this can decrease input token usage quite a bit.
Concrete alternatives:
One option have stood out as most attractive for me so far:
MiniSearch is a tiny JS search engine which can be used via Node. For smaller projects or repo's, this could offer a lightweight, inverted index-based full-text search. This should be easy to wrap in a script that loads files from a directory (incl subfolders) into memory, builds the index, and performs queries - all callabe from bash for the agent. Would probably need to use the persistance option via serialization, and maybe run this as a hook (via bash) every time the agent starts up, or even more often (every agent turn) if needed. Another alternative would be Lunr.js which seems similar, and to my understanding includes stemming and language support. Creator of MiniSearch, Luca Ongaro, provides a comparison in one of his blog-posts.
Then there are two options that seems more rigorous, but also maybe a bit more complex (at least for my use cases):
Recoll (Xapian-based) for basically everything (inverted indexing for fast queries, headless CLI operation, can be setup with cron for regular indexing, handles text files natively and supports PDFs via external tools, has relevance ranking, is light weight, can be queried with keywords/nautral-ish language, etc). This also seems quite low-maintenance to use, i.e. no need for hefty setups and reindexing can be done fully automatically using cron.
Tantivy could mabye excel for codebase search/investigation (because such search normally requires handling symbols, exact matches, and often large volumes of small text files where performance and "grep-like" precision are vital). However this one seems to need more setup, config, and requires input in a specific JSON-lines format, which means you need preprocessing scripts to convert directories of .txt or .md files into this format before indexing.
Beta Was this translation helpful? Give feedback.
All reactions