[llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. #1126

HenryL27 · 2025-01-24T00:09:36Z

This is hard to break up bc I'm changing APIs of things that already exist and have tests...

llm.generate now takes a RenderedPrompt instead of a dict
the old api exists as llm.generate_old, and it's marked as deprecated.
most operations have been switched to call llm.generate_old
implement docset.llm_map and docset.llm_map_elements
implement a EntityExtractor.as_llm_map method that converts all of the functionality of OpenAIEntityExtractor into prompts and hooks for llm_map
docset.extract_entity() now uses that to use LLMMap instead of ExtractEntity.

Also changes the prompt rendering apis to produce either a single RenderedPrompt os a sequence of them in order to support a mode of extract entity that uses a tokenizer to make token-limited chunks to try to extract from

Signed-off-by: Henry Lindeman <[email protected]>

…kwargs Signed-off-by: Henry Lindeman <[email protected]>

Signed-off-by: Henry Lindeman <[email protected]>

…prompts (to do things like adhere to token limits) Signed-off-by: Henry Lindeman <[email protected]>

Signed-off-by: Henry Lindeman <[email protected]>

…ive implementation of extract entity bc I don't want to deal with llm_filter just yet Signed-off-by: Henry Lindeman <[email protected]>

Signed-off-by: Henry Lindeman <[email protected]>

… docs say because azure disagrees and they'd better both accept 'system' Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 · 2025-01-29T01:14:47Z

afaict the ITs that are failing are mostly due to llm.generate_old() not quite converting prompt_kwargs to a RenderedPrompt correctly. I could spend a lot of time debugging that but given my goal is to delete generate_old entirely I'm tempted to just leave the tests broken and fix them when I translate the transforms they're testing to LLMMap.

bsowell

There's a lot here. For the most part it looks good, I think. Couple of things to think about:

Is there a reasonable path to replace the generate_old calls at some point, do you think? I think you are doing the right thing doing the replacement incrementally, but I have a suspicion that we are going to live with this for a while.
Anything else we need to test to get confidence here? I guess most of our demos don't use sycamore directly anymore. Does AddDoc in DocStore still work? What about the DocPrep generated scripts?

bsowell · 2025-01-29T19:58:55Z

lib/sycamore/sycamore/llms/prompts/prompts.py

@@ -113,6 +113,19 @@ def set(self, **kwargs) -> "SycamorePrompt":
                new.__dict__[k] = v
        return new

+    def is_done(self, s: str) -> bool:


Trying to understand this (and maybe it will become clearer as I get through the rest of the files). Is the idea that the sequence will be an iterator or something and you invoke this between calls? Is it used anywhere as anything other than True? I don't see an implementation in ElementListIterPrompt.

Yep, that's the idea. In extract entities with a tokenizer we set this to s != "None", which is the condition it used to use. I'm not sure that the prompt is the right place for this piece of logic to live, though I guess the prompt is also what decides whether to make a sequence of renderedprompts so maybe?

lib/sycamore/sycamore/llms/prompts/default_prompts.py

lib/sycamore/sycamore/llms/llms.py

bsowell · 2025-01-29T20:11:26Z

lib/sycamore/sycamore/llms/llms.py

+    UNKNOWN = 0
+    SYNC = 1
+    ASYNC = 2
+    BATCH = 3


Is this something done by the LLM, or are we batching by combining multiple things into a prompt before sending?

Something done by the LLM - OpenAI and Anthropic have batch apis now: Anthropic / OpenAI. I think similar things exist in bedrock/azure as well but less sure

bsowell · 2025-01-29T20:31:40Z

lib/sycamore/sycamore/transforms/base_llm.py

+
+
+class LLMMap(MapBatch):
+    """The LLMMap transform renders each Document in a docset into


Not as part of this PR, but I wonder if eventually it would make sense to have an option to attach the raw LLM request/response the document as well for debugging.

lib/sycamore/sycamore/transforms/base_llm.py

bsowell · 2025-01-29T20:35:54Z

lib/sycamore/sycamore/transforms/base_llm.py

+from sycamore.data import Document, Element
+
+
+def _infer_prompts(


I confess I'm having a hard time following all of the prompt sequence stuff. I understand the basic motivation for the token stuff, but I can't help but wonder if there is a cleaner way to do it.

One thought I had which I think is beyond the scope of this is some sort of conditional branching/looping logic in a sycamore pipeline. Like

docset.do() .llm_map(<extract entity prompt>) .map(if not json set back to 'None') .while(lambda d: d.properties.get('entity', 'None') == 'None') .continue_processing()

I'm sure that creates all sorts of theoretic problems.

But then prompts can go back to being a single prompt and we can increment a counter on the document object and stuff.

I guess we can do the to-render-object counter thing and keep prompts to single-render in a for loop in llm_map which is probably cleaner. I'll try it and see

lib/sycamore/sycamore/transforms/llm_filter.py

bsowell · 2025-01-29T20:45:27Z

lib/sycamore/sycamore/docset.py

@@ -465,6 +467,7 @@ def extract_document_structure(self, structure: DocumentStructure, **kwargs):
        document_structure = ExtractDocumentStructure(self.plan, structure=structure, **kwargs)
        return DocSet(self.context, document_structure)

+    @deprecated(version="0.1.31", reason="Use llm_map instead")


Is the plan also to deprecate extract_properties?

the plan is to deprecate just about everything on my sprint tasks

bsowell · 2025-01-29T21:03:28Z

Also, have you gone through the integ test failures to see if they are a related? I did a cursory glance and a few did look llm related.

HenryL27 · 2025-01-29T22:09:43Z

Is there a reasonable path to replace the generate_old calls at some point, do you think?

The goal is for generate_old to be deleted entirely. git grep claims there are 28 uses (10 of them are unit test mocks). Most of the rest of them are in the path of the replacements I have planned. There's 3 more in sycamore query that I'm hoping should be relatively simple to switch over, and the constant warnings should annoy vinayak into doing it (or I can if it isn't done by the time I try to remove generate_old).

Also, have you gone through the integ test failures to see if they are a related?

Yep. Of the 7 failing ITs I think 4 are directly related to these changes - with generate_old not quite handling images and response_formats correctly. My plan was to let those break (which I guess includes breaking the transforms) and then fix them when I dealt with their associated transforms. Maybe they need pytest skip marks then? Or maybe I need to actually fix them. Who's using the graph stuff anyways? SummarizeImages was the next thing I was gonna tackle though

Anything else we need to test to get confidence here?

The docprep IT's passed (except for the pinecone what that gets rate-limited and dies). I can try to figure out how to test random things in AddDoc - probably should know that anyway. But overall I feel like passing sycamore test suite (barring the 4 guys above) is probably good. Should make sure that AddDoc doesn't hit the bad paths in the failing ITs though.

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 · 2025-01-29T23:57:02Z

update: was able to run Add a doc with docstore using this

bsowell

I think your plan seems reasonable. You'll have to keep an eye on things when you push this as it potentially has a big blast radius, but I don't think I see any reason to wait.

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 added 22 commits January 16, 2025 16:40

add prompt base classes and ElementListPrompt

c2a8cfa

Signed-off-by: Henry Lindeman <[email protected]>

override .instead in ElementListPrompt to store net-new keys in self.…

21a115a

…kwargs Signed-off-by: Henry Lindeman <[email protected]>

add ElementPrompt and StaticPrompt

f94da80

Signed-off-by: Henry Lindeman <[email protected]>

add unit tests for prompts

b73c162

Signed-off-by: Henry Lindeman <[email protected]>

forgot to commit this

17b2163

Signed-off-by: Henry Lindeman <[email protected]>

address pr comments; flatten properties with flatten_data

5d145d5

Signed-off-by: Henry Lindeman <[email protected]>

support multiple user prompts

7fa2ff1

Signed-off-by: Henry Lindeman <[email protected]>

rename instead to set

abf9b0b

Signed-off-by: Henry Lindeman <[email protected]>

Merge branch 'main' of github.com:aryn-ai/sycamore into hml-llm-unify

9909c7e

add LLMMap and LLMMapElements transforms

2d1315b

Signed-off-by: Henry Lindeman <[email protected]>

Merge branch 'main' of github.com:aryn-ai/sycamore into hml-llm-unify

1853d51

move llm implementations to use RenderedPrompts

5e86e56

Signed-off-by: Henry Lindeman <[email protected]>

also this guy

27581ef

Signed-off-by: Henry Lindeman <[email protected]>

add docset methods

739b672

Signed-off-by: Henry Lindeman <[email protected]>

docstrings

73d9bdd

Signed-off-by: Henry Lindeman <[email protected]>

add llm_map unit tests

ed8785e

Signed-off-by: Henry Lindeman <[email protected]>

fix bedrock tests and chaching

523d6e3

Signed-off-by: Henry Lindeman <[email protected]>

fix anthropic and bedrock ITs

e1b3206

Signed-off-by: Henry Lindeman <[email protected]>

adjust caching to handle pydantic class response format properly

6500e1c

Signed-off-by: Henry Lindeman <[email protected]>

fix base llm unit tests

f50032d

Signed-off-by: Henry Lindeman <[email protected]>

adjust all testing mock llms to updated llm interface

c3c7ea8

Signed-off-by: Henry Lindeman <[email protected]>

deprecate extract entity and implement it with llm_map

ffaaf0f

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 requested review from baitsguy and bohou-aryn January 24, 2025 00:09

HenryL27 added 6 commits January 23, 2025 16:33

add context_params decorator to llm_map

d71cf1a

Signed-off-by: Henry Lindeman <[email protected]>

revert extract_entity docset method re-implementation

4225e11

Signed-off-by: Henry Lindeman <[email protected]>

add initial support for prompts that generate a sequence of rendered …

0d39b27

…prompts (to do things like adhere to token limits) Signed-off-by: Henry Lindeman <[email protected]>

add stuff to EntityExtractor/OpenAIEntityExtractor to convert to LLMMap

0b5ded4

Signed-off-by: Henry Lindeman <[email protected]>

make docset.extract_entity construct an LLMMap from its entity_extractor

a52f7c2

Signed-off-by: Henry Lindeman <[email protected]>

get extract entity working with tokenizer and token limit

3a9ac3c

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 added 9 commits January 28, 2025 09:43

get all extract_entity unit tests passing

befc3d0

Signed-off-by: Henry Lindeman <[email protected]>

fix llm_map_elements to deal with postprocess index

8bf42d5

Signed-off-by: Henry Lindeman <[email protected]>

add postprocess_fn unit tests for llm_map

d7ff1eb

Signed-off-by: Henry Lindeman <[email protected]>

ruff complaint

a7a2cc0

Signed-off-by: Henry Lindeman <[email protected]>

fix docset unittests

ebf721e

Signed-off-by: Henry Lindeman <[email protected]>

move a bunch of stuff back to llm.generate_old. This includes the act…

0bd2a45

…ive implementation of extract entity bc I don't want to deal with llm_filter just yet Signed-off-by: Henry Lindeman <[email protected]>

move more stuff back to llm.generate_old

95cbaaf

Signed-off-by: Henry Lindeman <[email protected]>

fix the last few mocks

ea7f0e6

Signed-off-by: Henry Lindeman <[email protected]>

Merge branch 'main' of github.com:aryn-ai/sycamore into hml-llm-unify

2e51ee1

HenryL27 marked this pull request as ready for review January 28, 2025 21:58

HenryL27 added 3 commits January 28, 2025 14:20

ruff linelength

57a4e4b

Signed-off-by: Henry Lindeman <[email protected]>

mypy!!!

a312ba3

Signed-off-by: Henry Lindeman <[email protected]>

type: ignore + line length is tricky

ebde879

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 changed the title ~~[llm unify 2/n] Implement llm_map(_elements) and move ops to it.~~ [llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. Jan 28, 2025

HenryL27 added 2 commits January 28, 2025 15:48

fix generate_old with SimplePrompts

ff5efdc

Signed-off-by: Henry Lindeman <[email protected]>

set openai system role name to system instead of developer like their…

370e2b7

… docs say because azure disagrees and they'd better both accept 'system' Signed-off-by: Henry Lindeman <[email protected]>

bsowell reviewed Jan 29, 2025

View reviewed changes

address simple pr comments

98ce6a0

Signed-off-by: Henry Lindeman <[email protected]>

bsowell approved these changes Jan 30, 2025

View reviewed changes

HenryL27 added 6 commits January 29, 2025 16:07

pickle stuff in llm caching path bc not everything is jsonifiable

1789409

Signed-off-by: Henry Lindeman <[email protected]>

rewrite llm_map to deal with iterative prompting better

8b6f085

Signed-off-by: Henry Lindeman <[email protected]>

add a b64encode-to-str to cache bc you can't put bytes in json either

763acc5

Signed-off-by: Henry Lindeman <[email protected]>

fix llm its to mimic the _llm_cache_set/get pickle/unpickle operations

0331866

Signed-off-by: Henry Lindeman <[email protected]>

fix docstrings

dfb7540

Signed-off-by: Henry Lindeman <[email protected]>

oops bad type signature

f7c06e7

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 merged commit 73f59c0 into main Jan 30, 2025
13 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. #1126

[llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. #1126

HenryL27 commented Jan 24, 2025 •

edited

Loading

HenryL27 commented Jan 29, 2025

bsowell left a comment

bsowell Jan 29, 2025

HenryL27 Jan 29, 2025

bsowell Jan 29, 2025

HenryL27 Jan 29, 2025

bsowell Jan 29, 2025

bsowell Jan 29, 2025

HenryL27 Jan 29, 2025

bsowell Jan 29, 2025

HenryL27 Jan 29, 2025

bsowell commented Jan 29, 2025

HenryL27 commented Jan 29, 2025 •

edited

Loading

HenryL27 commented Jan 29, 2025

bsowell left a comment



		class LLMMap(MapBatch):
		"""The LLMMap transform renders each Document in a docset into

		from sycamore.data import Document, Element


		def _infer_prompts(

[llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. #1126

[llm unify 2/n] Implement llm_map(_elements) and move extract_entity to it. #1126

Conversation

HenryL27 commented Jan 24, 2025 • edited Loading

HenryL27 commented Jan 29, 2025

bsowell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsowell commented Jan 29, 2025

HenryL27 commented Jan 29, 2025 • edited Loading

HenryL27 commented Jan 29, 2025

bsowell left a comment

Choose a reason for hiding this comment

HenryL27 commented Jan 24, 2025 •

edited

Loading

HenryL27 commented Jan 29, 2025 •

edited

Loading