[llm unify 7/n] Summarize #1192

HenryL27 · 2025-02-21T01:21:00Z

Again, general lack of confidence in this.

Turns summarize document from an iterative folding strategy to a heirarchical strategy.
Uses math + jinja to generate a summary every k elements for the next k elements (and then repeat with k^2 and a stride of k, etc until k^n > n_elements).

Integrates into summarize_data by slightly changing how that reduce happens. Had to make a separate (similar) prompt for that. Should probably factor most of the jinja logic out as fragments.

I probably broke something in luna but all the unittests passed so idk.

Also not sure about some of the names.

Signed-off-by: Henry Lindeman <[email protected]>

eric-anderson · 2025-02-26T19:30:36Z

lib/sycamore/sycamore/llms/prompts/default_prompts.py

+        {% endif %}
+        """
+    ),
+    user=J_GET_ELEMENT_TEXT_MACRO


Can we force the question to always be present by having a default question of "What is the summary of this information?"

This feels overly complex, and experience with FINRA was that complexity leads to weird prompts that don't quite do what you want.

lib/sycamore/sycamore/llms/prompts/default_prompts.py

eric-anderson · 2025-02-26T19:37:33Z

lib/sycamore/sycamore/query/execution/operations.py

    context: Optional[Context] = None,
+    docset_summarizer: Optional[Type[Summarizer]] = None,
+    summarizer_kwargs: dict[str, Any] = {},


Why do we need a class + kwargs rather than passing in an object?

lib/sycamore/sycamore/query/execution/operations.py

eric-anderson · 2025-02-26T19:40:15Z

lib/sycamore/sycamore/query/execution/operations.py

+            summaries_as_text=summaries_as_text,
+        )
+
+    # If data is not DocSets, text is this list here


Can we force data to always be docsets? If it somehow isn't convert it to a DocSet?

According to vinayak if it's not docsets it's a single scalar (output of Count or Math operator). You could, I guess, wrap it in a Document and wrap that in a DocSet. Seems like hunting ducks with a bazooka. Also the data will look very different so you probably can't use the same prompting anyway

Ok. That seems like part of the issue we have from the past of not all luna operators are docset -> docset.
I think we should do a check that all the items are strings or numbers, i.e. all(isinstance(r, (str, int, float)) for r in result_data) and if it's not that fail.
Also should improve the documentation.

# LuNA pipelines can return list of integers, strings, or floats, depending on the pipeline. While this should eventually be fixed, we handle it here by summarizing the information differently in that case.

eric-anderson · 2025-02-26T19:42:25Z

lib/sycamore/sycamore/query/execution/operations.py

+
+
+def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer:
+    if summarizer_cls is LLMElementTextSummarizer:


This if class thing can't be the right way to do this.

lib/sycamore/sycamore/transforms/summarize.py

eric-anderson · 2025-02-26T19:46:59Z

lib/sycamore/sycamore/transforms/summarize.py

-class DocumentSummarizer(Summarizer):
+class HeirarchicalDocumentSummarizer(Summarizer):
+    """
+    Summarizes a document by constructing a heirarchical tree of batches of elements,


I'm not sure we want this construction; I was expecting something that grouped until it ran out of context window. Do we have a reason to believe that a multi-stage summarize is better than a single stage one?

eric-anderson · 2025-02-26T21:33:33Z

lib/sycamore/sycamore/transforms/summarize.py

+        return comptransform
+
+
+class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):


Big document, spread evenly across context windows of packed with a tail?
Multiple documents/ split only at document boundary or packed.
Split at properties or not?
How to split properties if they exceed context window.
Spread documents evenly or not, e.g. with 10 docs => 5,4,1; or 4,3,3?

and across LLMs.

Signed-off-by: Henry Lindeman <[email protected]>

…zer class and all the ingredients needed to instantiate it duh Signed-off-by: Henry Lindeman <[email protected]>

Signed-off-by: Henry Lindeman <[email protected]>

eric-anderson · 2025-02-26T21:40:52Z

lib/sycamore/sycamore/transforms/summarize.py

+        return comptransform
+
+
+class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):


and across LLMs.

eric-anderson · 2025-03-01T00:17:08Z

lib/sycamore/sycamore/docset.py

-
-        summaries = Summarize(self.plan, summarizer=summarizer, **kwargs)
-        return DocSet(self.context, summaries)
+        map = summarizer.as_llm_map(self.plan, **kwargs)


from an api standpoint, if feels a little weird for the summarize to know how to turn itself into a llm map (the api seems inverted). I don't know if there's a natural way to to move that logic out and have summarize have a function that's called by llm_map. It looks like that would involve changing LLMMapElements, so I think this is a future improvement.

eric-anderson · 2025-03-01T00:21:58Z

lib/sycamore/sycamore/transforms/summarize.py

@@ -61,133 +73,324 @@ def __init__(self, llm: LLM, element_operator: Optional[Callable[[Element], bool
        self._llm = llm
        self._element_operator = element_operator

-    def summarize(self, document: Document) -> Document:
-        elements = []
+    def as_llm_map(self, child: Optional[Node], **kwargs) -> Node:


Simplify to this?

filter = self._element_operator or lambda e: True, LLMMapElements(child, TextSummarizerJinjaPrompt, output_field="summary", llm=self._llm, filter=filter)

eric-anderson · 2025-03-01T00:23:13Z

lib/sycamore/sycamore/transforms/summarize.py

    def summarize(self, document: Document) -> Document:
+        map = self.as_llm_map(None)


This is another example of the API hierarchy being inverted. summarize is running an internal function in order to execute itself.

eric-anderson · 2025-03-01T00:25:01Z

lib/sycamore/sycamore/llms/prompts/default_prompts.py

@@ -172,6 +172,15 @@ class _TextSummarizerGuidancePrompt(SimplePrompt):
    """,
 )

+TextSummarizerJinjaPrompt = JinjaElementPrompt(
+    system="You are a helpful text summarizer.",
+    user="""Write a summary of the following. Use only the information provided.


Do you want your deindent thing here?

eric-anderson · 2025-03-04T08:17:23Z

lib/sycamore/sycamore/transforms/summarize.py

+        vars = self.get_const_vars()
+        doc.properties["summary"] = doc.elements[0].properties[vars["intermediate_summary_key"]]
+        for e in doc.elements:
+            for v in vars:


e.properties.pop(v, None)

eric-anderson · 2025-03-04T08:21:26Z

lib/sycamore/sycamore/transforms/summarize.py

+                if e2.properties.get(vars["skip_me_key"], False):
+                    continue
+                this_batch.append(j)
+                tks = self.prompt.render_element(elt, doc).token_count(self.tokenizer)


If this was rendering only a single element, I would expect total_tks += self.prompt.render_element(elt, doc); to make the max_tokens check work.
If it's rendering the entire doc, it's O(n^2).

yep, turns out that n^2 for gpt-mini sized context windows is unbearably slow. fixing

eric-anderson · 2025-03-04T08:23:30Z

lib/sycamore/sycamore/transforms/summarize.py

+
+    def as_llm_map(self, child: Optional[Node], **kwargs) -> Node:
+        vars = self.get_const_vars()
+        if self.fields is not None:


self.prompt = self.prompt.fork(ignore_None=True, fields=self.fields, question=self.question, data_description=self.data_description)

eric-anderson · 2025-03-04T08:25:55Z

lib/sycamore/sycamore/transforms/summarize.py

+            last = llm_round
+        cleanup = Map(child=last, f=self.cleanup)
+        nodes.append(cleanup)
+        ct = CompositeTransform(child, [])  # type: ignore


This works but looks weird. I'd do return CompositeTransform(child, nodes=nodes) and fix the constructor.

eric-anderson · 2025-03-04T08:28:48Z

lib/sycamore/sycamore/transforms/summarize.py

+                    fields.pop()
+                    return doc
+        doc.properties[vars["numel_key"]] += 1
+        this = self.prompt.render_document(doc)


This algorithm also feels N^2. Let's chat.

Signed-off-by: Henry Lindeman <[email protected]>

…-summarize

…-one-node-in-one-document slightly less hacky Signed-off-by: Henry Lindeman <[email protected]>

Signed-off-by: Henry Lindeman <[email protected]>

…-summarize

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 added 5 commits February 20, 2025 13:23

initial heirarchical document summarize implementation

ff9ca26

Signed-off-by: Henry Lindeman <[email protected]>

ruff

53879e6

Signed-off-by: Henry Lindeman <[email protected]>

make some tests work

9dffee1

Signed-off-by: Henry Lindeman <[email protected]>

fix more tests

532f7e8

Signed-off-by: Henry Lindeman <[email protected]>

mypy

8d2b2f8

Signed-off-by: Henry Lindeman <[email protected]>

HenryL27 requested a review from baitsguy February 21, 2025 01:21

HenryL27 added 15 commits February 21, 2025 09:25

fix llm filter codegen

aa1e51f

Signed-off-by: Henry Lindeman <[email protected]>

put back collapsing summarizer

93fb7ec

Signed-off-by: Henry Lindeman <[email protected]>

fix names

41e7a19

Signed-off-by: Henry Lindeman <[email protected]>

add docset summarizer parametrization

d77df22

Signed-off-by: Henry Lindeman <[email protected]>

add roundrobin summarizer

8f10485

Signed-off-by: Henry Lindeman <[email protected]>

mypy and ruff

8066362

Signed-off-by: Henry Lindeman <[email protected]>

rename to RoundRobinOneshotDocumentSummarizer

6eb82c2

Signed-off-by: Henry Lindeman <[email protected]>

factor complicated common jinja logic to fragments

13d6041

Signed-off-by: Henry Lindeman <[email protected]>

add max tokens heirarchical summarizer

90e6e4e

Signed-off-by: Henry Lindeman <[email protected]>

ruff

a621b8d

Signed-off-by: Henry Lindeman <[email protected]>

fix unit tests

b690923

Signed-off-by: Henry Lindeman <[email protected]>

mypy

b1acb1e

Signed-off-by: Henry Lindeman <[email protected]>

add unit tests for summarizers

eb18e7e

Signed-off-by: Henry Lindeman <[email protected]>

a whole bunch of docstrings

7fce340

Signed-off-by: Henry Lindeman <[email protected]>

oops didn't mean to commit this

2c887b1

Signed-off-by: Henry Lindeman <[email protected]>

eric-anderson reviewed Feb 26, 2025

View reviewed changes

HenryL27 added 7 commits February 26, 2025 14:42

move complex prompts to be next to the complex code that sets them up

e22111a

Signed-off-by: Henry Lindeman <[email protected]>

have summmarize_data take a summarizer instance rather than a summari…

3fe1c14

…zer class and all the ingredients needed to instantiate it duh Signed-off-by: Henry Lindeman <[email protected]>

fix unit tests

687d1e4

Signed-off-by: Henry Lindeman <[email protected]>

inline get text macro since it's only used by one template

d18a03f

Signed-off-by: Henry Lindeman <[email protected]>

remove collapse document summarizer

6a373ad

Signed-off-by: Henry Lindeman <[email protected]>

apparently that lets me get rid of collapse and qasummarizer too, nice

5b267a7

Signed-off-by: Henry Lindeman <[email protected]>

mypy + tests

d156b03

Signed-off-by: Henry Lindeman <[email protected]>

ruff

9104aff

Signed-off-by: Henry Lindeman <[email protected]>

eric-anderson reviewed Mar 4, 2025

View reviewed changes

HenryL27 added 14 commits March 5, 2025 09:17

initial easy comments

a2687ea

Signed-off-by: Henry Lindeman <[email protected]>

redo MaxTokenHierarchySummarizer in more sensible way

4dd37d3

Signed-off-by: Henry Lindeman <[email protected]>

Merge branch 'main' of github.com:aryn-ai/sycamore into hml-llm-unify…

292a3ee

…-summarize

create SummarizeDocument Document subclass to make the hacky slurp-to…

e70f706

…-one-node-in-one-document slightly less hacky Signed-off-by: Henry Lindeman <[email protected]>

ruff

c9e8752

Signed-off-by: Henry Lindeman <[email protected]>

mypy

023254e

Signed-off-by: Henry Lindeman <[email protected]>

pytest

c5a9ce7

Signed-off-by: Henry Lindeman <[email protected]>

more tests

92455a9

Signed-off-by: Henry Lindeman <[email protected]>

fix token counting for OneStepSummarizer

ea0939a

Signed-off-by: Henry Lindeman <[email protected]>

drop a print statement

1798d3e

Signed-off-by: Henry Lindeman <[email protected]>

ruff and mypy. mypy sucks. python/mypy#17642

ce83d44

Signed-off-by: Henry Lindeman <[email protected]>

fix the integration tests that I think are my fault

bd77da4

Signed-off-by: Henry Lindeman <[email protected]>

Merge branch 'main' of github.com:aryn-ai/sycamore into hml-llm-unify…

e9b08aa

…-summarize

update summarize docs

d11e569

Signed-off-by: Henry Lindeman <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[llm unify 7/n] Summarize #1192

[llm unify 7/n] Summarize #1192

HenryL27 commented Feb 21, 2025 •

edited

Loading

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

HenryL27 Feb 27, 2025

eric-anderson Feb 26, 2025

HenryL27 Feb 26, 2025

eric-anderson Mar 4, 2025

eric-anderson Feb 26, 2025

HenryL27 Feb 27, 2025

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

eric-anderson Feb 26, 2025

eric-anderson Mar 1, 2025

eric-anderson Mar 1, 2025

eric-anderson Mar 1, 2025

eric-anderson Mar 1, 2025

eric-anderson Mar 4, 2025

eric-anderson Mar 4, 2025

HenryL27 Mar 5, 2025

eric-anderson Mar 4, 2025

eric-anderson Mar 4, 2025

eric-anderson Mar 4, 2025



		def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer:
		if summarizer_cls is LLMElementTextSummarizer:

		return comptransform


		class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):

		def summarize(self, document: Document) -> Document:
		map = self.as_llm_map(None)

[llm unify 7/n] Summarize #1192

Are you sure you want to change the base?

[llm unify 7/n] Summarize #1192

Conversation

HenryL27 commented Feb 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HenryL27 commented Feb 21, 2025 •

edited

Loading