-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. #1126
Conversation
Signed-off-by: Henry Lindeman <[email protected]>
…kwargs Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
…prompts (to do things like adhere to token limits) Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
…ive implementation of extract entity bc I don't want to deal with llm_filter just yet Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
… docs say because azure disagrees and they'd better both accept 'system' Signed-off-by: Henry Lindeman <[email protected]>
afaict the ITs that are failing are mostly due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot here. For the most part it looks good, I think. Couple of things to think about:
-
Is there a reasonable path to replace the generate_old calls at some point, do you think? I think you are doing the right thing doing the replacement incrementally, but I have a suspicion that we are going to live with this for a while.
-
Anything else we need to test to get confidence here? I guess most of our demos don't use sycamore directly anymore. Does AddDoc in DocStore still work? What about the DocPrep generated scripts?
@@ -113,6 +113,19 @@ def set(self, **kwargs) -> "SycamorePrompt": | |||
new.__dict__[k] = v | |||
return new | |||
|
|||
def is_done(self, s: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to understand this (and maybe it will become clearer as I get through the rest of the files). Is the idea that the sequence will be an iterator or something and you invoke this between calls? Is it used anywhere as anything other than True? I don't see an implementation in ElementListIterPrompt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's the idea. In extract entities with a tokenizer we set this to s != "None"
, which is the condition it used to use. I'm not sure that the prompt is the right place for this piece of logic to live, though I guess the prompt is also what decides whether to make a sequence of renderedprompts so maybe?
UNKNOWN = 0 | ||
SYNC = 1 | ||
ASYNC = 2 | ||
BATCH = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something done by the LLM, or are we batching by combining multiple things into a prompt before sending?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
class LLMMap(MapBatch): | ||
"""The LLMMap transform renders each Document in a docset into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as part of this PR, but I wonder if eventually it would make sense to have an option to attach the raw LLM request/response the document as well for debugging.
from sycamore.data import Document, Element | ||
|
||
|
||
def _infer_prompts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confess I'm having a hard time following all of the prompt sequence stuff. I understand the basic motivation for the token stuff, but I can't help but wonder if there is a cleaner way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought I had which I think is beyond the scope of this is some sort of conditional branching/looping logic in a sycamore pipeline. Like
docset.do()
.llm_map(<extract entity prompt>)
.map(if not json set back to 'None')
.while(lambda d: d.properties.get('entity', 'None') == 'None')
.continue_processing()
I'm sure that creates all sorts of theoretic problems.
But then prompts can go back to being a single prompt and we can increment a counter on the document object and stuff.
I guess we can do the to-render-object counter thing and keep prompts to single-render in a for loop in llm_map which is probably cleaner. I'll try it and see
@@ -465,6 +467,7 @@ def extract_document_structure(self, structure: DocumentStructure, **kwargs): | |||
document_structure = ExtractDocumentStructure(self.plan, structure=structure, **kwargs) | |||
return DocSet(self.context, document_structure) | |||
|
|||
@deprecated(version="0.1.31", reason="Use llm_map instead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan also to deprecate extract_properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the plan is to deprecate just about everything on my sprint tasks
Also, have you gone through the integ test failures to see if they are a related? I did a cursory glance and a few did look llm related. |
The goal is for generate_old to be deleted entirely. git grep claims there are 28 uses (10 of them are unit test mocks). Most of the rest of them are in the path of the replacements I have planned. There's 3 more in sycamore query that I'm hoping should be relatively simple to switch over, and the constant warnings should annoy vinayak into doing it (or I can if it isn't done by the time I try to remove generate_old).
Yep. Of the 7 failing ITs I think 4 are directly related to these changes - with generate_old not quite handling images and response_formats correctly. My plan was to let those break (which I guess includes breaking the transforms) and then fix them when I dealt with their associated transforms. Maybe they need pytest skip marks then? Or maybe I need to actually fix them. Who's using the graph stuff anyways? SummarizeImages was the next thing I was gonna tackle though
The docprep IT's passed (except for the pinecone what that gets rate-limited and dies). I can try to figure out how to test random things in AddDoc - probably should know that anyway. But overall I feel like passing sycamore test suite (barring the 4 guys above) is probably good. Should make sure that AddDoc doesn't hit the bad paths in the failing ITs though. |
Signed-off-by: Henry Lindeman <[email protected]>
update: was able to run Add a doc with docstore using this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your plan seems reasonable. You'll have to keep an eye on things when you push this as it potentially has a big blast radius, but I don't think I see any reason to wait.
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
This is hard to break up bc I'm changing APIs of things that already exist and have tests...
llm.generate
now takes a RenderedPrompt instead of a dictllm.generate_old
, and it's marked as deprecated.llm.generate_old
docset.llm_map
anddocset.llm_map_elements
EntityExtractor.as_llm_map
method that converts all of the functionality ofOpenAIEntityExtractor
into prompts and hooks for llm_mapdocset.extract_entity()
now uses that to use LLMMap instead of ExtractEntity.Also changes the prompt rendering apis to produce either a single RenderedPrompt os a sequence of them in order to support a mode of extract entity that uses a tokenizer to make token-limited chunks to try to extract from