Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llm unify 7/n] Summarize #1192

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open

[llm unify 7/n] Summarize #1192

wants to merge 42 commits into from

Conversation

HenryL27
Copy link
Collaborator

@HenryL27 HenryL27 commented Feb 21, 2025

Again, general lack of confidence in this.

Turns summarize document from an iterative folding strategy to a heirarchical strategy.
Uses math + jinja to generate a summary every k elements for the next k elements (and then repeat with k^2 and a stride of k, etc until k^n > n_elements).

Integrates into summarize_data by slightly changing how that reduce happens. Had to make a separate (similar) prompt for that. Should probably factor most of the jinja logic out as fragments.

I probably broke something in luna but all the unittests passed so idk.

Also not sure about some of the names.

Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
@HenryL27 HenryL27 requested a review from baitsguy February 21, 2025 01:21
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
{% endif %}
"""
),
user=J_GET_ELEMENT_TEXT_MACRO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we force the question to always be present by having a default question of "What is the summary of this information?"

This feels overly complex, and experience with FINRA was that complexity leads to weird prompts that don't quite do what you want.

context: Optional[Context] = None,
docset_summarizer: Optional[Type[Summarizer]] = None,
summarizer_kwargs: dict[str, Any] = {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a class + kwargs rather than passing in an object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duh. fixed

summaries_as_text=summaries_as_text,
)

# If data is not DocSets, text is this list here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we force data to always be docsets? If it somehow isn't convert it to a DocSet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to vinayak if it's not docsets it's a single scalar (output of Count or Math operator). You could, I guess, wrap it in a Document and wrap that in a DocSet. Seems like hunting ducks with a bazooka. Also the data will look very different so you probably can't use the same prompting anyway

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. That seems like part of the issue we have from the past of not all luna operators are docset -> docset.
I think we should do a check that all the items are strings or numbers, i.e. all(isinstance(r, (str, int, float)) for r in result_data) and if it's not that fail.
Also should improve the documentation.

# LuNA pipelines can return list of integers, strings, or floats, depending on the pipeline. While this should eventually be fixed, we handle it here by summarizing the information differently in that case.



def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer:
if summarizer_cls is LLMElementTextSummarizer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if class thing can't be the right way to do this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not.

class DocumentSummarizer(Summarizer):
class HeirarchicalDocumentSummarizer(Summarizer):
"""
Summarizes a document by constructing a heirarchical tree of batches of elements,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want this construction; I was expecting something that grouped until it ran out of context window. Do we have a reason to believe that a multi-stage summarize is better than a single stage one?

return comptransform


class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big document, spread evenly across context windows of packed with a tail?
Multiple documents/ split only at document boundary or packed.
Split at properties or not?
How to split properties if they exceed context window.
Spread documents evenly or not, e.g. with 10 docs => 5,4,1; or 4,3,3?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and across LLMs.

Signed-off-by: Henry Lindeman <[email protected]>
return comptransform


class MaxTokensHeirarchicalDocumentSummarizer(Summarizer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and across LLMs.


summaries = Summarize(self.plan, summarizer=summarizer, **kwargs)
return DocSet(self.context, summaries)
map = summarizer.as_llm_map(self.plan, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from an api standpoint, if feels a little weird for the summarize to know how to turn itself into a llm map (the api seems inverted). I don't know if there's a natural way to to move that logic out and have summarize have a function that's called by llm_map. It looks like that would involve changing LLMMapElements, so I think this is a future improvement.

@@ -61,133 +73,324 @@ def __init__(self, llm: LLM, element_operator: Optional[Callable[[Element], bool
self._llm = llm
self._element_operator = element_operator

def summarize(self, document: Document) -> Document:
elements = []
def as_llm_map(self, child: Optional[Node], **kwargs) -> Node:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify to this?

filter = self._element_operator or lambda e: True,
LLMMapElements(child, TextSummarizerJinjaPrompt, output_field="summary", llm=self._llm, filter=filter)

def summarize(self, document: Document) -> Document:
map = self.as_llm_map(None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another example of the API hierarchy being inverted. summarize is running an internal function in order to execute itself.

@@ -172,6 +172,15 @@ class _TextSummarizerGuidancePrompt(SimplePrompt):
""",
)

TextSummarizerJinjaPrompt = JinjaElementPrompt(
system="You are a helpful text summarizer.",
user="""Write a summary of the following. Use only the information provided.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want your deindent thing here?

vars = self.get_const_vars()
doc.properties["summary"] = doc.elements[0].properties[vars["intermediate_summary_key"]]
for e in doc.elements:
for v in vars:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.properties.pop(v, None)

if e2.properties.get(vars["skip_me_key"], False):
continue
this_batch.append(j)
tks = self.prompt.render_element(elt, doc).token_count(self.tokenizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was rendering only a single element, I would expect total_tks += self.prompt.render_element(elt, doc); to make the max_tokens check work.
If it's rendering the entire doc, it's O(n^2).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, turns out that n^2 for gpt-mini sized context windows is unbearably slow. fixing


def as_llm_map(self, child: Optional[Node], **kwargs) -> Node:
vars = self.get_const_vars()
if self.fields is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.prompt = self.prompt.fork(ignore_None=True, fields=self.fields, question=self.question, data_description=self.data_description)

last = llm_round
cleanup = Map(child=last, f=self.cleanup)
nodes.append(cleanup)
ct = CompositeTransform(child, []) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but looks weird. I'd do return CompositeTransform(child, nodes=nodes) and fix the constructor.

fields.pop()
return doc
doc.properties[vars["numel_key"]] += 1
this = self.prompt.render_document(doc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This algorithm also feels N^2. Let's chat.

HenryL27 added 14 commits March 5, 2025 09:17
Signed-off-by: Henry Lindeman <[email protected]>
…-one-node-in-one-document slightly less hacky

Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants