Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. #1126

Merged
merged 49 commits into from
Jan 30, 2025

Conversation

HenryL27
Copy link
Collaborator

@HenryL27 HenryL27 commented Jan 24, 2025

This is hard to break up bc I'm changing APIs of things that already exist and have tests...

  • llm.generate now takes a RenderedPrompt instead of a dict
  • the old api exists as llm.generate_old, and it's marked as deprecated.
  • most operations have been switched to call llm.generate_old
  • implement docset.llm_map and docset.llm_map_elements
  • implement a EntityExtractor.as_llm_map method that converts all of the functionality of OpenAIEntityExtractor into prompts and hooks for llm_map
  • docset.extract_entity() now uses that to use LLMMap instead of ExtractEntity.

Also changes the prompt rendering apis to produce either a single RenderedPrompt os a sequence of them in order to support a mode of extract entity that uses a tokenizer to make token-limited chunks to try to extract from

Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
…ive implementation of extract entity bc I don't want to deal with llm_filter just yet

Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
@HenryL27 HenryL27 marked this pull request as ready for review January 28, 2025 21:58
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
@HenryL27 HenryL27 changed the title [llm unify 2/n] Implement llm_map(_elements) and move ops to it. [llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. Jan 28, 2025
… docs say because azure disagrees and they'd better both accept 'system'

Signed-off-by: Henry Lindeman <[email protected]>
@HenryL27
Copy link
Collaborator Author

afaict the ITs that are failing are mostly due to llm.generate_old() not quite converting prompt_kwargs to a RenderedPrompt correctly. I could spend a lot of time debugging that but given my goal is to delete generate_old entirely I'm tempted to just leave the tests broken and fix them when I translate the transforms they're testing to LLMMap.

Copy link
Contributor

@bsowell bsowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot here. For the most part it looks good, I think. Couple of things to think about:

  • Is there a reasonable path to replace the generate_old calls at some point, do you think? I think you are doing the right thing doing the replacement incrementally, but I have a suspicion that we are going to live with this for a while.

  • Anything else we need to test to get confidence here? I guess most of our demos don't use sycamore directly anymore. Does AddDoc in DocStore still work? What about the DocPrep generated scripts?

@@ -113,6 +113,19 @@ def set(self, **kwargs) -> "SycamorePrompt":
new.__dict__[k] = v
return new

def is_done(self, s: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand this (and maybe it will become clearer as I get through the rest of the files). Is the idea that the sequence will be an iterator or something and you invoke this between calls? Is it used anywhere as anything other than True? I don't see an implementation in ElementListIterPrompt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's the idea. In extract entities with a tokenizer we set this to s != "None", which is the condition it used to use. I'm not sure that the prompt is the right place for this piece of logic to live, though I guess the prompt is also what decides whether to make a sequence of renderedprompts so maybe?

lib/sycamore/sycamore/llms/prompts/default_prompts.py Outdated Show resolved Hide resolved
lib/sycamore/sycamore/llms/llms.py Outdated Show resolved Hide resolved
UNKNOWN = 0
SYNC = 1
ASYNC = 2
BATCH = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something done by the LLM, or are we batching by combining multiple things into a prompt before sending?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something done by the LLM - OpenAI and Anthropic have batch apis now: Anthropic / OpenAI. I think similar things exist in bedrock/azure as well but less sure



class LLMMap(MapBatch):
"""The LLMMap transform renders each Document in a docset into
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as part of this PR, but I wonder if eventually it would make sense to have an option to attach the raw LLM request/response the document as well for debugging.

lib/sycamore/sycamore/transforms/base_llm.py Outdated Show resolved Hide resolved
from sycamore.data import Document, Element


def _infer_prompts(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confess I'm having a hard time following all of the prompt sequence stuff. I understand the basic motivation for the token stuff, but I can't help but wonder if there is a cleaner way to do it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought I had which I think is beyond the scope of this is some sort of conditional branching/looping logic in a sycamore pipeline. Like

docset.do()
    .llm_map(<extract entity prompt>)
    .map(if not json set back to 'None')
    .while(lambda d: d.properties.get('entity', 'None') == 'None')
    .continue_processing()

I'm sure that creates all sorts of theoretic problems.

But then prompts can go back to being a single prompt and we can increment a counter on the document object and stuff.

I guess we can do the to-render-object counter thing and keep prompts to single-render in a for loop in llm_map which is probably cleaner. I'll try it and see

lib/sycamore/sycamore/transforms/llm_filter.py Outdated Show resolved Hide resolved
@@ -465,6 +467,7 @@ def extract_document_structure(self, structure: DocumentStructure, **kwargs):
document_structure = ExtractDocumentStructure(self.plan, structure=structure, **kwargs)
return DocSet(self.context, document_structure)

@deprecated(version="0.1.31", reason="Use llm_map instead")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan also to deprecate extract_properties?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plan is to deprecate just about everything on my sprint tasks

@bsowell
Copy link
Contributor

bsowell commented Jan 29, 2025

Also, have you gone through the integ test failures to see if they are a related? I did a cursory glance and a few did look llm related.

@HenryL27
Copy link
Collaborator Author

HenryL27 commented Jan 29, 2025

Is there a reasonable path to replace the generate_old calls at some point, do you think?

The goal is for generate_old to be deleted entirely. git grep claims there are 28 uses (10 of them are unit test mocks). Most of the rest of them are in the path of the replacements I have planned. There's 3 more in sycamore query that I'm hoping should be relatively simple to switch over, and the constant warnings should annoy vinayak into doing it (or I can if it isn't done by the time I try to remove generate_old).

Also, have you gone through the integ test failures to see if they are a related?

Yep. Of the 7 failing ITs I think 4 are directly related to these changes - with generate_old not quite handling images and response_formats correctly. My plan was to let those break (which I guess includes breaking the transforms) and then fix them when I dealt with their associated transforms. Maybe they need pytest skip marks then? Or maybe I need to actually fix them. Who's using the graph stuff anyways? SummarizeImages was the next thing I was gonna tackle though

Anything else we need to test to get confidence here?

The docprep IT's passed (except for the pinecone what that gets rate-limited and dies). I can try to figure out how to test random things in AddDoc - probably should know that anyway. But overall I feel like passing sycamore test suite (barring the 4 guys above) is probably good. Should make sure that AddDoc doesn't hit the bad paths in the failing ITs though.

Signed-off-by: Henry Lindeman <[email protected]>
@HenryL27
Copy link
Collaborator Author

update: was able to run Add a doc with docstore using this

Copy link
Contributor

@bsowell bsowell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your plan seems reasonable. You'll have to keep an eye on things when you push this as it potentially has a big blast radius, but I don't think I see any reason to wait.

@HenryL27 HenryL27 merged commit 73f59c0 into main Jan 30, 2025
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants