Skip to content

Fix/summarizer #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 60 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
695d9e1
api change
MelvinKl May 9, 2025
cae32ed
api change
MelvinKl May 9, 2025
1a7b9d7
switch to one uploader for all types
MelvinKl May 9, 2025
7e4a9d0
extractor mostly working
MelvinKl May 13, 2025
b32d7c3
it works
MelvinKl May 14, 2025
2e591c3
wip
MelvinKl May 15, 2025
cf8b892
wip
MelvinKl May 15, 2025
8ee912c
wip
MelvinKl May 15, 2025
4062537
wip
MelvinKl May 15, 2025
96b6d10
wip
MelvinKl May 15, 2025
f10aa41
wip
MelvinKl May 15, 2025
96e53e7
fix
MelvinKl May 16, 2025
a1f8fee
black
MelvinKl May 16, 2025
54f3c32
linting
MelvinKl May 16, 2025
0aa4d92
wip
MelvinKl May 16, 2025
9f99eeb
name change
MelvinKl May 16, 2025
82d27d1
lint
MelvinKl May 19, 2025
c752478
reset poetry.lock
MelvinKl May 19, 2025
ee8f3c7
fix tests
MelvinKl May 20, 2025
ef8dd20
update doc for admin api
MelvinKl May 20, 2025
a86f76c
black
MelvinKl May 20, 2025
acde7e5
extractor comments
MelvinKl May 20, 2025
5e6ca9b
fix: minor bugs
a-klos May 23, 2025
c5c537b
refactor: remove unused utility modules and tests
a-klos May 23, 2025
0133c00
docs: enhance module docstrings and method descriptions across the ad…
a-klos May 23, 2025
4bfd3f1
working sample
a-klos May 28, 2025
c07d939
refactor: improve threading model in DefaultSourceUploader and update…
a-klos May 28, 2025
a46b4fd
feat: add timeout parameter to file and source upload methods and enh…
a-klos May 28, 2025
e7599d1
feat: implement UploaderBase class and enhance document deletion logi…
a-klos Jun 2, 2025
7f1df26
refactor: add TODO for implementing timeout in thread handling for fi…
a-klos Jun 2, 2025
8a6d4f1
refactor: remove unused asyncio import from default_file_uploader_tes…
a-klos Jun 2, 2025
5af5c76
refactor: remove unused thread management documentation and example f…
a-klos Jun 2, 2025
fa2f928
chore: update poetry.lock and pyproject.toml for dependency version c…
a-klos Jun 2, 2025
8dc7990
chore: Update README.md
a-klos Jun 2, 2025
57788eb
fix: correct spelling of 'arbitrary' in README and update query param…
a-klos Jun 2, 2025
f5cf59b
Merge branch 'onapitorulethemall' of github.com:stackitcloud/rag-core…
a-klos Jun 2, 2025
d942cf7
Update README.md
a-klos Jun 2, 2025
21d10d9
refactor: remove unused import and enhance query parameter descriptio…
a-klos Jun 2, 2025
baaad61
Merge branch 'onapitorulethemall' of github.com:stackitcloud/rag-core…
a-klos Jun 2, 2025
bc503f2
refactor: remove timeout parameter from DefaultFileUploader and delet…
a-klos Jun 2, 2025
a5523fb
refactor: remove unused thread_diagnostics.py file
a-klos Jun 2, 2025
5dafd3e
feat: add SourceUploaderSettings for configurable timeout and refacto…
a-klos Jun 2, 2025
4ab029b
refactor: remove unused import of Optional in default_file_uploader.py
a-klos Jun 2, 2025
7f53875
feat: add SourceUploaderSettings to DependencyContainer and update up…
a-klos Jun 2, 2025
97bdb25
docs: update README to clarify upload behavior and default timeout co…
a-klos Jun 2, 2025
a6209da
feat: implement SitemapExtractor and SitemapLangchainDocument2Informa…
a-klos Jun 3, 2025
4465591
feat: add SitemapExtractor and SitemapLangchainDocument2InformationPi…
a-klos Jun 3, 2025
0cbd5e3
feat: add fake-useragent dependency and enhance SitemapExtractor to h…
a-klos Jun 3, 2025
e8e55fc
chore: resolve merge conflicts
a-klos Jun 3, 2025
387e465
feat: enhance SitemapExtractor to support JSON header templates and i…
a-klos Jun 4, 2025
93534ca
feat: add comprehensive test suite for SitemapExtractor class
a-klos Jun 4, 2025
3d13637
feat: enhance DependencyContainer and SitemapExtractor with custom pa…
a-klos Jun 5, 2025
4674a67
feat: enhance SitemapExtractor with improved parameter handling and c…
a-klos Jun 5, 2025
0c07986
feat: add settings parameter to DefaultSourceUploader tests for impro…
a-klos Jun 5, 2025
dad0687
refactor: improve readability of mocks setup in DefaultSourceUploader…
a-klos Jun 5, 2025
c88430f
feat: refactor page summary creation logic for improved grouping and …
a-klos Jun 6, 2025
442dda4
refactor: remove redundant summarization method and streamline summar…
a-klos Jun 6, 2025
cda5ff5
refactor: simplify summary creation logic by removing redundant varia…
a-klos Jun 6, 2025
14ca90a
refactor: remove redundant whitespace and streamline summary creation…
a-klos Jun 6, 2025
74f0034
Merge branch 'main' into fix/summarizer
a-klos Jun 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 16 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# RAG Core library

This repository contains the core of the STACKIT RAG template.
It provides comprehensive document extraction capabilities including support for files (PDF, DOCX, XML), web sources via sitemaps, and Confluence pages.
It consists of the following python packages:

- [`1. Rag Core API`](#1-rag-core-api)
Expand Down Expand Up @@ -143,7 +144,7 @@ The extracted information will be summarized using a LLM. The summary, as well a
#### `/upload_source`

Loads all the content from an arbitrary non-file source using the [document-extractor](#3-extractor-api-lib).
The `type`of the source needs to correspond to an extractor in the [document-extractor](#3-extractor-api-lib).
The `type` of the source needs to correspond to an extractor in the [document-extractor](#3-extractor-api-lib). Supported types include `confluence` for Confluence pages and `sitemap` for web content via XML sitemaps.
The extracted information will be summarized using LLM. The summary, as well as the unrefined extracted document, will be uploaded to the [rag-core-api](#1-rag-core-api). An is configured. Defaults to 3600 seconds (1 hour). Can be adjusted by values in the helm chart.

### 2.3 Replaceable parts
Expand All @@ -169,8 +170,7 @@ The extracted information will be summarized using LLM. The summary, as well as

## 3. Extractor API Lib

The Extractor Library contains components that provide document parsing capabilities for various file formats. It also includes a default `dependency_container`, that is pre-configured and is a good starting point for most use-cases.
This API should not be exposed by ingress and only used for internally.
The Extractor Library contains components that provide document parsing capabilities for various file formats and web sources. It supports extracting content from PDF, DOCX, XML files, as well as web pages via sitemaps and Confluence pages. It also includes a default `dependency_container`, that is pre-configured and is a good starting point for most use-cases. This API should not be exposed by ingress and only used for internally.


The following endpoints are provided by the *extractor-api-lib*:
Expand Down Expand Up @@ -206,12 +206,21 @@ The following types of information will be extracted:
#### `/extract_from_source`

This endpoint will extract data for non-file source.
The type of information that is extracted will vary depending on the source, the following types of information can be extracted:
The type of information that is extracted will vary depending on the source. Supported sources include `confluence` for Confluence pages and `sitemap` for web pages via XML sitemaps.
The following types of information can be extracted:

- `TEXT`: plain text
- `TABLE`: data in tabular form found in the document
- `IMAGE`: image found in the document

For sitemap sources, additional parameters can be provided, e.g.:
- `web_path`: The URL of the XML sitemap to crawl
- `filter_urls`: JSON array of URL patterns to filter pages (optional)
- `header_template`: JSON object for custom HTTP headers (optional)

Technically, all parameters of the `SitemapLoader` from LangChain can be provided.


### 3.3 Replaceable parts

| Name | Type | Default | Notes |
Expand All @@ -226,6 +235,9 @@ The type of information that is extracted will vary depending on the source, the
| file_extractor | [`extractor_api_lib.api_endpoints.file_extractor.FileExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/file_extractor.py) | [`extractor_api_lib.impl.api_endpoints.default_file_extractor.DefaultFileExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/default_file_extractor.py) | Implementation of the `/extract_from_file` endpoint. Uses *general_extractor*. |
| general_source_extractor | [`extractor_api_lib.api_endpoints.source_extractor.SourceExtractor`](./extractor-api-lib/src/extractor_api_lib/api_endpoints/source_extractor.py) | [`extractor_api_lib.impl.api_endpoints.general_source_extractor.GeneralSourceExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/api_endpoints/general_source_extractor.py) | Implementation of the `/extract_from_source` endpoint. Will decide the correct extractor for the source. |
| confluence_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.confluence_extractor.ConfluenceExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/confluence_extractor.py) | Implementation of an esxtractor for the source `confluence`. |
| sitemap_extractor | [`extractor_api_lib.extractors.information_extractor.InformationExtractor`](./extractor-api-lib/src/extractor_api_lib/extractors/information_extractor.py) | [`extractor_api_lib.impl.extractors.sitemap_extractor.SitemapExtractor`](./extractor-api-lib/src/extractor_api_lib/impl/extractors/sitemap_extractor.py) | Implementation of an extractor for the source `sitemap`. Supports XML sitemap crawling with configurable parameters including URL filtering, custom headers, and crawling depth. Uses LangChain's SitemapLoader with support for custom parsing and meta functions via dependency injection. |
| sitemap_parsing_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_parser_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom parsing function for sitemap content extraction. Used by the sitemap extractor to parse HTML content from web pages. Can be replaced to customize how web page content is processed and extracted. |
| sitemap_meta_function | `dependency_injector.providers.Factory[Callable]` | [`extractor_api_lib.impl.utils.sitemap_extractor_utils.custom_sitemap_meta_function`](./extractor-api-lib/src/extractor_api_lib/impl/utils/sitemap_extractor_utils.py) | Custom meta function for sitemap content processing. Used by the sitemap extractor to extract metadata from web pages. Can be replaced to customize how metadata is extracted and structured from web content. |

## 4. RAG Core Lib

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,22 +27,6 @@ class PageSummaryEnhancer(SummaryEnhancer):
BASE64_IMAGE_KEY = "base64_image"
DEFAULT_PAGE_NR = 1

async def _acreate_summary(self, information: list[Document], config: Optional[RunnableConfig]) -> list[Document]:
# group infos by page, defaulting to page 1 if no page metadata
if self._chunker_settings:
filtered_information = [
info for info in information if len(info.page_content) > self._chunker_settings.max_size
]
else:
filtered_information = information
grouped = [
[info for info in filtered_information if info.metadata.get("page", self.DEFAULT_PAGE_NR) == page]
for page in {info_piece.metadata.get("page", self.DEFAULT_PAGE_NR) for info_piece in filtered_information}
]

summary_tasks = [self._asummarize_page(info_group, config) for info_group in tqdm(grouped)]
return await gather(*summary_tasks)

async def _asummarize_page(self, page_pieces: list[Document], config: Optional[RunnableConfig]) -> Document:
full_page_content = " ".join([piece.page_content for piece in page_pieces])
summary = await self._summarizer.ainvoke(full_page_content, config)
Expand All @@ -52,3 +36,26 @@ async def _asummarize_page(self, page_pieces: list[Document], config: Optional[R
meta["type"] = ContentType.SUMMARY.value

return Document(metadata=meta, page_content=summary)

async def _acreate_summary(self, information: list[Document], config: Optional[RunnableConfig]) -> list[Document]:
distinct_pages = []
for info in information:
if info.metadata.get("page", self.DEFAULT_PAGE_NR) not in distinct_pages:
distinct_pages.append(info.metadata.get("page", self.DEFAULT_PAGE_NR))

grouped = []
for page in distinct_pages:
group = []
for compare_info in information:
if compare_info.metadata.get("page", self.DEFAULT_PAGE_NR) == page:
group.append(compare_info)
if (
self._chunker_settings
and len(" ".join([item.page_content for item in group])) < self._chunker_settings.max_size
):
continue
grouped.append(group)

summary_tasks = [self._asummarize_page(info_group, config) for info_group in tqdm(grouped)]

return await gather(*summary_tasks)
103 changes: 91 additions & 12 deletions admin-api-lib/tests/default_source_uploader_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,31 @@ def mocks():
document_deleter.adelete_document = AsyncMock()
rag_api = MagicMock()
information_mapper = MagicMock()
return extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper
settings = MagicMock()
return (
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
)


@pytest.mark.asyncio
async def test_handle_source_upload_success(mocks):
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks
(
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
) = mocks
# Setup mocks
dummy_piece = MagicMock()
extractor_api.extract_from_source.return_value = [dummy_piece]
Expand All @@ -47,6 +66,7 @@ async def test_handle_source_upload_success(mocks):
document_deleter,
rag_api,
information_mapper,
settings=settings,
)

await uploader._handle_source_upload("source1", "type1", [])
Expand All @@ -58,7 +78,16 @@ async def test_handle_source_upload_success(mocks):

@pytest.mark.asyncio
async def test_handle_source_upload_no_info_pieces(mocks):
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks
(
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
) = mocks
extractor_api.extract_from_source.return_value = []

uploader = DefaultSourceUploader(
Expand All @@ -69,6 +98,7 @@ async def test_handle_source_upload_no_info_pieces(mocks):
document_deleter,
rag_api,
information_mapper,
settings=settings,
)
await uploader._handle_source_upload("source2", "type2", [])

Expand All @@ -79,13 +109,29 @@ async def test_handle_source_upload_no_info_pieces(mocks):

@pytest.mark.asyncio
async def test_upload_source_already_processing_raises_error(mocks):
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks
(
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
) = mocks
source_type = "typeX"
name = "Doc Name"
source_name = f"{source_type}:{sanitize_document_name(name)}"
key_value_store.get_all.return_value = [(source_name, Status.PROCESSING)]
uploader = DefaultSourceUploader(
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
)
with pytest.raises(HTTPException):
# use default timeout
Expand All @@ -95,18 +141,35 @@ async def test_upload_source_already_processing_raises_error(mocks):

@pytest.mark.asyncio
async def test_upload_source_no_timeout(mocks, monkeypatch):
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks
(
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
) = mocks
key_value_store.get_all.return_value = []
source_type = "typeZ"
name = "quick"
# patch Thread so no actual background work is done
dummy_thread = MagicMock()
monkeypatch.setattr(default_source_uploader, "Thread", lambda *args, **kwargs: dummy_thread)
uploader = DefaultSourceUploader(
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
)
# should not raise
await uploader.upload_source(source_type, name, [], timeout=1.0)
settings.timeout = 1.0
await uploader.upload_source(source_type, name, [])
# only PROCESSING status upserted, no ERROR
assert any(call.args[1] == Status.PROCESSING for call in key_value_store.upsert.call_args_list)
assert not any(call.args[1] == Status.ERROR for call in key_value_store.upsert.call_args_list)
Expand All @@ -115,7 +178,16 @@ async def test_upload_source_no_timeout(mocks, monkeypatch):

@pytest.mark.asyncio
async def test_upload_source_timeout_error(mocks, monkeypatch):
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper = mocks
(
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
) = mocks
key_value_store.get_all.return_value = []
source_type = "typeTimeout"
name = "slow"
Expand All @@ -141,11 +213,18 @@ def is_alive(self):

monkeypatch.setattr(default_source_uploader, "Thread", FakeThread)
uploader = DefaultSourceUploader(
extractor_api, key_value_store, information_enhancer, chunker, document_deleter, rag_api, information_mapper
extractor_api,
key_value_store,
information_enhancer,
chunker,
document_deleter,
rag_api,
information_mapper,
settings,
)
# no exception should be raised; timeout path sets ERROR status

await uploader.upload_source(source_type, name, [], timeout=1.0)
settings.timeout = 1.0
await uploader.upload_source(source_type, name, [])
# first call marks PROCESSING, second marks ERROR
calls = [call.args for call in key_value_store.upsert.call_args_list]
assert (source_name, Status.PROCESSING) in calls
Expand Down
14 changes: 13 additions & 1 deletion extractor-api-lib/poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion extractor-api-lib/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ per-file-ignores = """
./src/extractor_api_lib/impl/extractor_api_impl.py: B008,
./src/extractor_api_lib/container.py: CCE002,CCE001,
./src/extractor_api_lib/apis/extractor_api_base.py: WOT001,
./tests/*: S101,
./tests/*: S101,E501,
"""

[tool.black]
Expand Down Expand Up @@ -93,6 +93,7 @@ langchain-community = "^0.3.23"
atlassian-python-api = "^4.0.3"
markdownify = "^1.1.0"
langchain-core = "0.3.63"
fake-useragent = "^2.2.0"

[tool.poetry.group.dev.dependencies]
pytest = "^8.3.5"
Expand Down
24 changes: 20 additions & 4 deletions extractor-api-lib/src/extractor_api_lib/dependency_container.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,41 @@
"""Module for dependency injection container for managing application dependencies."""

from dependency_injector.containers import DeclarativeContainer
from dependency_injector.providers import List, Singleton # noqa: WOT001
from dependency_injector.providers import Factory, List, Singleton # noqa: WOT001

from extractor_api_lib.impl.api_endpoints.general_source_extractor import GeneralSourceExtractor
from extractor_api_lib.impl.extractors.confluence_extractor import ConfluenceExtractor
from extractor_api_lib.impl.extractors.file_extractors.ms_docs_extractor import MSDocsExtractor
from extractor_api_lib.impl.extractors.file_extractors.pdf_extractor import PDFExtractor
from extractor_api_lib.impl.extractors.file_extractors.xml_extractor import XMLExtractor
from extractor_api_lib.impl.api_endpoints.general_file_extractor import GeneralFileExtractor
from extractor_api_lib.impl.extractors.sitemap_extractor import SitemapExtractor
from extractor_api_lib.impl.file_services.s3_service import S3Service
from extractor_api_lib.impl.mapper.confluence_langchain_document2information_piece import (
ConfluenceLangchainDocument2InformationPiece,
)
from extractor_api_lib.impl.mapper.internal2external_information_piece import (
Internal2ExternalInformationPiece,
)
from extractor_api_lib.impl.mapper.sitemap_document2information_piece import SitemapLangchainDocument2InformationPiece
from extractor_api_lib.impl.settings.pdf_extractor_settings import PDFExtractorSettings
from extractor_api_lib.impl.settings.s3_settings import S3Settings
from extractor_api_lib.impl.table_converter.dataframe2markdown import DataFrame2Markdown
from extractor_api_lib.impl.utils.sitemap_extractor_utils import (
custom_sitemap_meta_function,
custom_sitemap_parser_function,
)


class DependencyContainer(DeclarativeContainer):
"""Dependency injection container for managing application dependencies."""

# Settings
settings_s3 = Singleton(S3Settings)
settings_pdf_extractor = Singleton(PDFExtractorSettings)
settings_s3 = S3Settings()
settings_pdf_extractor = PDFExtractorSettings()

sitemap_parsing_function = Factory(lambda: custom_sitemap_parser_function)
sitemap_meta_function = Factory(lambda: custom_sitemap_meta_function)

database_converter = Singleton(DataFrame2Markdown)
file_service = Singleton(S3Service, settings_s3)
Expand All @@ -36,13 +45,20 @@ class DependencyContainer(DeclarativeContainer):

intern2external = Singleton(Internal2ExternalInformationPiece)
langchain_document2information_piece = Singleton(ConfluenceLangchainDocument2InformationPiece)
sitemap_document2information_piece = Singleton(SitemapLangchainDocument2InformationPiece)
file_extractors = List(pdf_extractor, ms_docs_extractor, xml_extractor)

general_file_extractor = Singleton(GeneralFileExtractor, file_service, file_extractors, intern2external)
confluence_extractor = Singleton(ConfluenceExtractor, mapper=langchain_document2information_piece)

sitemap_extractor = Singleton(
SitemapExtractor,
mapper=sitemap_document2information_piece,
parsing_function=sitemap_parsing_function,
meta_function=sitemap_meta_function,
)
source_extractor = Singleton(
GeneralSourceExtractor,
mapper=intern2external,
available_extractors=List(confluence_extractor),
available_extractors=List(confluence_extractor, sitemap_extractor),
)
Loading
Loading