Skip to content

Fix/summarizer #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Jun 11, 2025
Merged
Changes from all commits
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
695d9e1
api change
MelvinKl May 9, 2025
cae32ed
api change
MelvinKl May 9, 2025
1a7b9d7
switch to one uploader for all types
MelvinKl May 9, 2025
7e4a9d0
extractor mostly working
MelvinKl May 13, 2025
b32d7c3
it works
MelvinKl May 14, 2025
2e591c3
wip
MelvinKl May 15, 2025
cf8b892
wip
MelvinKl May 15, 2025
8ee912c
wip
MelvinKl May 15, 2025
4062537
wip
MelvinKl May 15, 2025
96b6d10
wip
MelvinKl May 15, 2025
f10aa41
wip
MelvinKl May 15, 2025
96e53e7
fix
MelvinKl May 16, 2025
a1f8fee
black
MelvinKl May 16, 2025
54f3c32
linting
MelvinKl May 16, 2025
0aa4d92
wip
MelvinKl May 16, 2025
9f99eeb
name change
MelvinKl May 16, 2025
82d27d1
lint
MelvinKl May 19, 2025
c752478
reset poetry.lock
MelvinKl May 19, 2025
ee8f3c7
fix tests
MelvinKl May 20, 2025
ef8dd20
update doc for admin api
MelvinKl May 20, 2025
a86f76c
black
MelvinKl May 20, 2025
acde7e5
extractor comments
MelvinKl May 20, 2025
5e6ca9b
fix: minor bugs
a-klos May 23, 2025
c5c537b
refactor: remove unused utility modules and tests
a-klos May 23, 2025
0133c00
docs: enhance module docstrings and method descriptions across the ad…
a-klos May 23, 2025
4bfd3f1
working sample
a-klos May 28, 2025
c07d939
refactor: improve threading model in DefaultSourceUploader and update…
a-klos May 28, 2025
a46b4fd
feat: add timeout parameter to file and source upload methods and enh…
a-klos May 28, 2025
e7599d1
feat: implement UploaderBase class and enhance document deletion logi…
a-klos Jun 2, 2025
7f1df26
refactor: add TODO for implementing timeout in thread handling for fi…
a-klos Jun 2, 2025
8a6d4f1
refactor: remove unused asyncio import from default_file_uploader_tes…
a-klos Jun 2, 2025
5af5c76
refactor: remove unused thread management documentation and example f…
a-klos Jun 2, 2025
fa2f928
chore: update poetry.lock and pyproject.toml for dependency version c…
a-klos Jun 2, 2025
8dc7990
chore: Update README.md
a-klos Jun 2, 2025
57788eb
fix: correct spelling of 'arbitrary' in README and update query param…
a-klos Jun 2, 2025
f5cf59b
Merge branch 'onapitorulethemall' of github.com:stackitcloud/rag-core…
a-klos Jun 2, 2025
d942cf7
Update README.md
a-klos Jun 2, 2025
21d10d9
refactor: remove unused import and enhance query parameter descriptio…
a-klos Jun 2, 2025
baaad61
Merge branch 'onapitorulethemall' of github.com:stackitcloud/rag-core…
a-klos Jun 2, 2025
bc503f2
refactor: remove timeout parameter from DefaultFileUploader and delet…
a-klos Jun 2, 2025
a5523fb
refactor: remove unused thread_diagnostics.py file
a-klos Jun 2, 2025
5dafd3e
feat: add SourceUploaderSettings for configurable timeout and refacto…
a-klos Jun 2, 2025
4ab029b
refactor: remove unused import of Optional in default_file_uploader.py
a-klos Jun 2, 2025
7f53875
feat: add SourceUploaderSettings to DependencyContainer and update up…
a-klos Jun 2, 2025
97bdb25
docs: update README to clarify upload behavior and default timeout co…
a-klos Jun 2, 2025
a6209da
feat: implement SitemapExtractor and SitemapLangchainDocument2Informa…
a-klos Jun 3, 2025
4465591
feat: add SitemapExtractor and SitemapLangchainDocument2InformationPi…
a-klos Jun 3, 2025
0cbd5e3
feat: add fake-useragent dependency and enhance SitemapExtractor to h…
a-klos Jun 3, 2025
e8e55fc
chore: resolve merge conflicts
a-klos Jun 3, 2025
387e465
feat: enhance SitemapExtractor to support JSON header templates and i…
a-klos Jun 4, 2025
93534ca
feat: add comprehensive test suite for SitemapExtractor class
a-klos Jun 4, 2025
3d13637
feat: enhance DependencyContainer and SitemapExtractor with custom pa…
a-klos Jun 5, 2025
4674a67
feat: enhance SitemapExtractor with improved parameter handling and c…
a-klos Jun 5, 2025
0c07986
feat: add settings parameter to DefaultSourceUploader tests for impro…
a-klos Jun 5, 2025
dad0687
refactor: improve readability of mocks setup in DefaultSourceUploader…
a-klos Jun 5, 2025
c88430f
feat: refactor page summary creation logic for improved grouping and …
a-klos Jun 6, 2025
442dda4
refactor: remove redundant summarization method and streamline summar…
a-klos Jun 6, 2025
cda5ff5
refactor: simplify summary creation logic by removing redundant varia…
a-klos Jun 6, 2025
14ca90a
refactor: remove redundant whitespace and streamline summary creation…
a-klos Jun 6, 2025
74f0034
Merge branch 'main' into fix/summarizer
a-klos Jun 10, 2025
3d03a60
chore: merge
a-klos Jun 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,6 @@ class PageSummaryEnhancer(SummaryEnhancer):
BASE64_IMAGE_KEY = "base64_image"
DEFAULT_PAGE_NR = 1

async def _acreate_summary(self, information: list[Document], config: Optional[RunnableConfig]) -> list[Document]:
# group infos by page, defaulting to page 1 if no page metadata
if self._chunker_settings:
filtered_information = [
info for info in information if len(info.page_content) > self._chunker_settings.max_size
]
else:
filtered_information = information
grouped = [
[info for info in filtered_information if info.metadata.get("page", self.DEFAULT_PAGE_NR) == page]
for page in {info_piece.metadata.get("page", self.DEFAULT_PAGE_NR) for info_piece in filtered_information}
]

summary_tasks = [self._asummarize_page(info_group, config) for info_group in tqdm(grouped)]
return await gather(*summary_tasks)

async def _asummarize_page(self, page_pieces: list[Document], config: Optional[RunnableConfig]) -> Document:
full_page_content = " ".join([piece.page_content for piece in page_pieces])
summary = await self._summarizer.ainvoke(full_page_content, config)
Expand All @@ -52,3 +36,26 @@ async def _asummarize_page(self, page_pieces: list[Document], config: Optional[R
meta["type"] = ContentType.SUMMARY.value

return Document(metadata=meta, page_content=summary)

async def _acreate_summary(self, information: list[Document], config: Optional[RunnableConfig]) -> list[Document]:
distinct_pages = []
for info in information:
if info.metadata.get("page", self.DEFAULT_PAGE_NR) not in distinct_pages:
distinct_pages.append(info.metadata.get("page", self.DEFAULT_PAGE_NR))

grouped = []
for page in distinct_pages:
group = []
for compare_info in information:
if compare_info.metadata.get("page", self.DEFAULT_PAGE_NR) == page:
group.append(compare_info)
if (
self._chunker_settings
and len(" ".join([item.page_content for item in group])) < self._chunker_settings.max_size
):
continue
grouped.append(group)

summary_tasks = [self._asummarize_page(info_group, config) for info_group in tqdm(grouped)]

return await gather(*summary_tasks)
Loading