-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[llm unify 7/n] Summarize #1192
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
{% endif %} | ||
""" | ||
), | ||
user=J_GET_ELEMENT_TEXT_MACRO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we force the question to always be present by having a default question of "What is the summary of this information?"
This feels overly complex, and experience with FINRA was that complexity leads to weird prompts that don't quite do what you want.
context: Optional[Context] = None, | ||
docset_summarizer: Optional[Type[Summarizer]] = None, | ||
summarizer_kwargs: dict[str, Any] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a class + kwargs rather than passing in an object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duh. fixed
summaries_as_text=summaries_as_text, | ||
) | ||
|
||
# If data is not DocSets, text is this list here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we force data to always be docsets? If it somehow isn't convert it to a DocSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to vinayak if it's not docsets it's a single scalar (output of Count or Math operator). You could, I guess, wrap it in a Document and wrap that in a DocSet. Seems like hunting ducks with a bazooka. Also the data will look very different so you probably can't use the same prompting anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. That seems like part of the issue we have from the past of not all luna operators are docset -> docset.
I think we should do a check that all the items are strings or numbers, i.e. all(isinstance(r, (str, int, float)) for r in result_data)
and if it's not that fail.
Also should improve the documentation.
# LuNA pipelines can return list of integers, strings, or floats, depending on the pipeline. While this should eventually be fixed, we handle it here by summarizing the information differently in that case.
|
||
|
||
def _setup_docset_summarizer(summarizer_cls: Type[Summarizer], **kwargs) -> Summarizer: | ||
if summarizer_cls is LLMElementTextSummarizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if class thing can't be the right way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not.
class DocumentSummarizer(Summarizer): | ||
class HeirarchicalDocumentSummarizer(Summarizer): | ||
""" | ||
Summarizes a document by constructing a heirarchical tree of batches of elements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want this construction; I was expecting something that grouped until it ran out of context window. Do we have a reason to believe that a multi-stage summarize is better than a single stage one?
return comptransform | ||
|
||
|
||
class MaxTokensHeirarchicalDocumentSummarizer(Summarizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big document, spread evenly across context windows of packed with a tail?
Multiple documents/ split only at document boundary or packed.
Split at properties or not?
How to split properties if they exceed context window.
Spread documents evenly or not, e.g. with 10 docs => 5,4,1; or 4,3,3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and across LLMs.
Signed-off-by: Henry Lindeman <[email protected]>
…zer class and all the ingredients needed to instantiate it duh Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
return comptransform | ||
|
||
|
||
class MaxTokensHeirarchicalDocumentSummarizer(Summarizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and across LLMs.
|
||
summaries = Summarize(self.plan, summarizer=summarizer, **kwargs) | ||
return DocSet(self.context, summaries) | ||
map = summarizer.as_llm_map(self.plan, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from an api standpoint, if feels a little weird for the summarize to know how to turn itself into a llm map (the api seems inverted). I don't know if there's a natural way to to move that logic out and have summarize have a function that's called by llm_map. It looks like that would involve changing LLMMapElements, so I think this is a future improvement.
@@ -61,133 +73,324 @@ def __init__(self, llm: LLM, element_operator: Optional[Callable[[Element], bool | |||
self._llm = llm | |||
self._element_operator = element_operator | |||
|
|||
def summarize(self, document: Document) -> Document: | |||
elements = [] | |||
def as_llm_map(self, child: Optional[Node], **kwargs) -> Node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify to this?
filter = self._element_operator or lambda e: True,
LLMMapElements(child, TextSummarizerJinjaPrompt, output_field="summary", llm=self._llm, filter=filter)
def summarize(self, document: Document) -> Document: | ||
map = self.as_llm_map(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another example of the API hierarchy being inverted. summarize is running an internal function in order to execute itself.
@@ -172,6 +172,15 @@ class _TextSummarizerGuidancePrompt(SimplePrompt): | |||
""", | |||
) | |||
|
|||
TextSummarizerJinjaPrompt = JinjaElementPrompt( | |||
system="You are a helpful text summarizer.", | |||
user="""Write a summary of the following. Use only the information provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want your deindent thing here?
vars = self.get_const_vars() | ||
doc.properties["summary"] = doc.elements[0].properties[vars["intermediate_summary_key"]] | ||
for e in doc.elements: | ||
for v in vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.properties.pop(v, None)
if e2.properties.get(vars["skip_me_key"], False): | ||
continue | ||
this_batch.append(j) | ||
tks = self.prompt.render_element(elt, doc).token_count(self.tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was rendering only a single element, I would expect total_tks += self.prompt.render_element(elt, doc); to make the max_tokens check work.
If it's rendering the entire doc, it's O(n^2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, turns out that n^2 for gpt-mini sized context windows is unbearably slow. fixing
|
||
def as_llm_map(self, child: Optional[Node], **kwargs) -> Node: | ||
vars = self.get_const_vars() | ||
if self.fields is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.prompt = self.prompt.fork(ignore_None=True, fields=self.fields, question=self.question, data_description=self.data_description)
last = llm_round | ||
cleanup = Map(child=last, f=self.cleanup) | ||
nodes.append(cleanup) | ||
ct = CompositeTransform(child, []) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works but looks weird. I'd do return CompositeTransform(child, nodes=nodes) and fix the constructor.
fields.pop() | ||
return doc | ||
doc.properties[vars["numel_key"]] += 1 | ||
this = self.prompt.render_document(doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This algorithm also feels N^2. Let's chat.
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
…-one-node-in-one-document slightly less hacky Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Signed-off-by: Henry Lindeman <[email protected]>
Again, general lack of confidence in this.
Turns summarize document from an iterative folding strategy to a heirarchical strategy.
Uses math + jinja to generate a summary every k elements for the next k elements (and then repeat with k^2 and a stride of k, etc until k^n > n_elements).
Integrates into summarize_data by slightly changing how that reduce happens. Had to make a separate (similar) prompt for that. Should probably factor most of the jinja logic out as fragments.
I probably broke something in luna but all the unittests passed so idk.
Also not sure about some of the names.