Conversation
… openai extra for langextract
…ability Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
Co-authored-by: Copilot <[email protected]>
Feature/langextract
* ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service
* ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * ✨ Implement LLM providers module with Ollama adapter and shared abstractions * ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider * 📝 Document Ollama provider usage via notebook demo * 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag * ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability * ✨ Enhance Ollama provider docs and DRY response building for sync/async calls * ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency * 📝 Add async examples to OllamaLLMProvider notebook * ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests * ♻️ Refactor OllamaLLMProvider to remove async client caching and streamline client instantiation
* Update .gitignore to exclude entity disambiguation experiment directories and modify Jupyter notebook execution counts and output handling * Refactor Makefile for improved service management and update .gitignore to exclude specific experiment directories. Add new Jupyter notebooks for entity disambiguation metrics and documentation. * Adjust example data for consistency in entity representation. * Refactor entity disambiguation notebooks to standardize attribute naming and improve metric evaluation. Update role attribute from 'rol' to 'role' for consistency across examples and documentation. Adjust evaluation function to return both score and metrics. * Add evaluation metrics for entity disambiguation - Introduced new metrics module for evaluating entity disambiguation performance, including functions for alias normalization, Jaccard similarity, and greedy matching. - Implemented main evaluation function to compute scores and metrics from gold and predicted entities. - Added Jupyter notebooks for practical examples and evaluation results, including normalized and non-normalized text evaluations. - Updated documentation to reflect changes in function signatures and outputs. * 🔧 Expand Makefile: add API management targets (api-run, api-stop, api-logs, api-full-run) for smoother service control * ♻️ Refactor metrics.py: clarify docstrings, align type hints, and polish logging * ✏️ Fix role attribute reference in evaluation metric documentation for consistency * 🔧 Add CanonicalEntities class to represent a collection of canonical entities * 📝 Update entity disambiguation notebooks: clean up imports, adjust paths, and streamline API calls for improved clarity and functionality --------- Co-authored-by: padonizetti Co-authored-by: jansaldo
* ✨ feat: Add Streamlit app for document summarization experiments * Add statistical analysis notebook for summarization performance evaluation( Visualized gaps in performance between CPU and CUDA models, llm alucinations) * 🎨 Quantitative and qualitative analysis of summaries: descriptive analysis by features, model comparison, gap analusis (CPU-CUDA), Garbage detection/outliers, analysis by document, visuailzations. * 🔒️ clear all outputs * 🎨 Improve Summary Analysis per document: cuda vs llama (same model), gemma vs llama (cuda), same document phi3 vs. phi4. Token per second gap. * ✨ Add YAML utility functions for loading and saving data * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <[email protected]> * ✏️ Remove incomplete comment Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: jed <[email protected]> Co-authored-by: Copilot <[email protected]> * ✨ Add GPU-enabled Ollama service to compose stack * 🔧 Add Make targets for managing Ollama service and models * 🔧 Add launch configuration and task for starting Ollama service * 🔧 Add system prompts for document summarization * 📝 Add summarization benchmark notebook * 🚚 Move statistical analysis notebook to summarization folder * ✨ Implement LLM providers module with Ollama adapter and shared abstractions * ✅ Add unit tests for LLM providers including DummyProvider and OllamaLLMProvider * 📝 Document Ollama provider usage via notebook demo * 🐛 Fix tokenizer encoding by removing unnecessary special tokens flag * ♻️ Refactor chunk handling in LLMProvider to use _append_chunk method for consistency and improved readability * ✨ Enhance Ollama provider docs and DRY response building for sync/async calls * ♻️ Refactor OllamaLLMProvider to reuse AsyncClient instance for improved efficiency * 📝 Add async examples to OllamaLLMProvider notebook * ✅ Add async coverage for OllamaLLMProvider and tighten chunking tests * ➕ Add tiktoken dependency to pyproject.toml and update version in uv.lock * 🔧 Enhance summarization prompts with additional information extraction and entity identification details * ✨ Add LLM summarization router * 📝 Add notebook for the summarization endpoint * ✏️ Fix formatting of keys in summarization defaults for consistency * ➕ Add dspy dependency and update related packages in project configuration * 🚧 WIP: Add prompt optimization notebook for summarization experiments --------- Co-authored-by: Sofi <[email protected]> Co-authored-by: jed <[email protected]> Co-authored-by: Copilot <[email protected]>
#64) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml * ✏️ Remove incomplete comment --------- * ♻️ refactor: Restructure USEM module with factory pattern and multiple encoder backends - Add BaseSentenceEncoder abstract base class for encoder interface - Implement factory pattern with EncoderType enum and create_encoder function - Add sentence-transformers encoder implementations (DistilUSE, MultilingualMiniLM) - Move TensorFlow implementation to tensorflow_encoder.py - Add lazy loading for encoder implementations via __getattr__ - Add auto-detection for Apple Silicon compatibility (defaults * 🚚 Rename test sentence encoders mac notebook * 📌 Sync dependencies ---------
* 🔧 Configure VSCode Python env and Copilot scopes * 🔧 Include resources/llm in .dockerignore * 📌 Update dependencies in pyproject.toml and uv.lock * 🔧 Update Dockerfile and devcontainer.json to install additional PDF tooling * ♻️ Refactor Makefile and docker-compose.yml for improved service configuration and flexibility * 🚧 FIXME: Remove DecisionConv1dBinRegex model from pipeline configuration for dependencies update compatibility * 🔧 Set weights_only=False for torch.load compatibility * ✨ Enhance PDF extraction with marker integration and improved text processing * 🔧 Update run_safe_text_extraction to allow indefinite timeout by default * ✨ Add warm_marker_models function to initialize marker-pdf artifacts at startup * 🔥 Remove unused environment variables and rename TRANSFORMERS_CACHE to HF_HOME * 🔧 Improve service stopping logic for Ollama and API services in Makefile * 🔖 Bump aymurai package version to 2.0.0-alpha.1 * 🔧 Update HF_HOME path and remove HF_DATASETS_CACHE variable in .env.common * 🔧 Update OLLAMA_HOST for GPU-enabled services to point to ollama-gpu * 🔧 Simplify marker model warming logic by removing error handling * ♻️ Refactor text extraction into modular format-specific extractors * ✅ Add unit tests for document extraction and error handling * ➕ Add marker-pdf stack and drop textract * 🔧 Enhance PDF extraction with caching mechanism * 📝 Improve cache utility functions with enhanced docstrings and type hints * 🔧 Enhance cache key generation in PdfExtractor for improved stability and performance * 🔖 Update aymurai package version to 2.0.0a2.dev9
* 🩹 Ensure consistent entity attributes in reformat_entity function and reorder imports * 📝 Update subcategories exploration notebook * ⚗️ Add TensorFlow deprecation experiment notebook * ♻️ Refactor entity subcategorization: Remove USEMSubcategorizer, add SentenceTransformerSubcategorizer - Removed the USEMSubcategorizer implementation from `usem.py`. - Introduced new Jupyter notebooks for testing and evaluating the SentenceTransformerSubcategorizer. - Updated the pipeline configuration to utilize SentenceTransformerSubcategorizer with local embeddings instead of remote URLs. * ♻️ Refactor download function: Replace gdown with requests for improved file downloading * 🔥 Remove empty peft model module * ➖ Remove TensorFlow and gdown dependencies from pyproject.toml * 📌 Update uv.lock * ♻️ Refactor sentence encoder module: Remove unused dependencies and streamline factory functions * 🔖 Update aymurai package version to 2.0.0a3.dev9
…bdirectories and non-IPYNB files
…used dependencies
…ambiguation options in LabelPolicy
…s across multiple modules
…ymizer/anonymizer.py for release/v1.5.0 compatibility 🔥 Removed `llm` disambiguation label policy for release/v1.5.0 compatibility
…entity_disambiguation/core.py discarding the role assignment for release/v1.5.0 compatibility 🎨 Changed aymurai/api/endpoints/routers/anonymizer/anonymizer.py discarding all the validations that had to do with LLM disambiguation for release/v1.5.0 compatibility 🎨 Minor changes in the rest of documents regarding to experimentation with the release/v1.5.0 API
…actor config passthrough and restore fixed timeout
…nto release/v1.5.0
…ents for backward compatibility
…74) * Merge dev into main for v1.1.12 (#57) * Update README.md * 🐛 bugfix: Fix XML special character escaping in DocAnonymizer * ➕ build(deps): Add python-docx package * ✨ feat: Add watermark and hyperlink functionality to document anonymization * ✨ feat: Install Archivo font in Dockerfile * 🎨 refactor: Improve Dockerfile structure and comments for clarity * ⏪ revert: Remove Archivo font installation from Dockerfile * 🔖 feat: Update aymurai package version to 1.1.11 in uv.lock * 🐛 Improve get_extension logic to fix document extraction issues on Windows and remove python-magic dependency * 🔧 Update Dockerfile to use 'bullseye' variant for Python images for improved compatibility * 🔧 Update Makefile targets for improved Docker workflow * 🔖 feat: Bump aymurai package version to 1.1.12 * ♻️ Harden get_extension with header scans and zip safeguards * 🔧 Extend document extraction timeout to 30s * 🔧 Refactor Docker workflow to build and push images using docker/build-push-action * 🔧 Fix workflow step order to correctly extract tag name before building Docker images * 🔧 Remove tag extraction step and use github.ref_name directly for Docker image builds * ⏪ Revert Docker workflow to extract tag name and use it for image versioning * Update .github/workflows/build-docker-image.yml Co-authored-by: Copilot <[email protected]> * ✏️ Remove incomplete comment Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: jed <[email protected]> Co-authored-by: Copilot <[email protected]> * WIP: feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection (#67) * feat(decision): ✨ integrate TinyEmbeddingBagClassifier for decision detection - Introduced a new model class `DecisionEmbeddingBagBinRegex` using `TinyEmbeddingBagClassifier`. - Updated model loading and saving mechanisms to support safetensors format. - Added a new training notebook for the embedding bag classifier. - Modified the pipeline configuration to include the new model. * ⚡️ Remove unidecode usage to avoid double normalization in model_input_from_text * 📝 Add type hints and docstrings for clarity in DecisionEmbeddingBagBinRegex and TinyEmbeddingBagClassifier * 🔧 Refactor import statements for safetensors to remove try-except block * 🔥 Remove Conv1dTextClassifier, Tokenizer and SpanishTokenizer implementations * 🐛 Fix gen_aymurai_entity call by removing unused category parameter * 🔖 Update aymurai package version to 2.0.0a4.dev1 * 🔀 cherry-pick(decision): modernize decision model and upgrade ML dependencies Cherry-pick TinyEmbeddingBagClassifier (safetensors) replacing Conv1d model. Remove dead deps (torchtext, pytorch-lightning), upgrade torch to 2.x and flair to 0.15.1. * 🐛 cherry-pick(fix): datapublic and anonymizer crash when use_cache is disabled * test(infra): rewrite test infrastructure with architecture guide standards - Delete old test files (test_document_extract.py, test_anonymizer_predict.py, test_datapublic_predict.py) - Create new directory structure: tests/integration/pipelines/, tests/api/routers/{anonymizer,datapublic,misc}/ - Rewrite tests/conftest.py: - Set env vars at module level (RESOURCES_BASEPATH=resources, SQLALCHEMY_DATABASE_URI=sqlite:///:memory:) - Remove torch mock and lazy loader - Direct imports from production code - Clean fixtures: db_engine (session-scoped), db_session (function-scoped), client (with dependency override) - Test data builders: build_data_item(), build_label(), build_anonymization_paragraph(), build_datapublic_paragraph() - Update pyproject.toml with [tool.pytest.ini_options]: strict-markers, integration/slow markers Verification: uv run python -c 'import tests.conftest' succeeds, pytest collection clean * test(conftest): add pipeline loading helpers and mock factories for API tests Wave 2 complete: integration pipeline conftest + API router conftest Integration pipeline conftest: - PIPELINE_CONFIGS dict for flair-anonymizer and full-paragraph - load_test_pipeline() helper with print_config=False - Session-scoped fixtures for both pipelines (expensive model loading) - build_pipeline_input() test data builder - sample_text fixture with Spanish legal text API router conftest: - build_mock_pipeline() factory with MagicMock - Mock preprocess/predict_single/postprocess methods - build_processed_data_item() test data builder - Re-exports builders from root conftest * test(api): add document extract endpoint tests with mocked extraction * test(api): add anonymizer and datapublic endpoint tests with mocked pipelines * test(integration): add pipeline integration tests for flair-anonymizer and full-paragraph * ✅ test: refactor test infrastructure and add integration tests - Reorganize test conftest files to proper hierarchy (tests/api/conftest.py) - Add pytest to dependency groups in pyproject.toml - Refactor API router tests to use centralized fixtures and builders - Add real document extraction tests with DOCX/PDF generators - Improve pipeline integration tests with fixture-based stages - Fix label serialization to use model_dump(mode="json") - Update UUID generation for datapublic tests to use uuid.uuid5 - Add cache path environment setup for integration tests - Clean up imports and remove unused dependencies - Remove empty test file (document_extract.py) This refactoring improves test maintainability, adds proper integration testing without excessive mocking, and establishes consistent test utilities across the codebase. * 👷 ci(github): add pytest workflow for CI integration - Introduced a new GitHub Actions workflow for running pytest. - Configured to trigger on pull requests and manual dispatch. - Supports multiple OS and Python versions for comprehensive testing. * 👷fix(tests): fix env variable DISKCACHE_ROOT * 👷 ci(github): remove deprecated PR tests workflow & fix env variable - Deleted the old PR tests workflow file. - This cleanup helps streamline CI processes and reduces redundancy. * ci(github): 👷 add pipeline download and integration tests to CI workflow - Introduced a new script for downloading pipelines. - Updated the pytest workflow to include running API and pipeline tests. - Enhanced test execution with improved output formatting and failure limits. * fix(tests): 🐛 avoid context manager in TestClient to skip app startup - Changed TestClient usage to prevent app lifespan startup during tests. - Ensured proper cleanup by closing the client after use. - This improves test performance and reliability. * 👷 ci(github): add RESOURCES_BASEPATH environment variable for pipeline tests - Added RESOURCES_BASEPATH to the environment variables for both downloading pipelines data and running pipeline tests. - This change ensures that the necessary resource paths are correctly set during the CI workflow execution. * 👷 ci(github): update RESOURCES_BASEPATH for pipeline data download - Changed RESOURCES_BASEPATH from /tmp to resources in the pipeline download step. - Ensures the correct path is used for resource access during tests. * chore(pyproject): 🔧 add environment markers for platform compatibility - Introduced required-environments for tool.uv to specify platform requirements. - Updated resolution-markers and required-markers in uv.lock for better dependency management. - Added tensorflow-io-gcs-filesystem with specific markers for Windows and Linux. * ci(github): 👷 configure es_AR locale for Ubuntu runners - Added steps to configure the es_AR locale on Ubuntu. - Ensures proper locale settings for tests running in the CI environment. * 👷 ci(github): add AYMURAI_CACHE_BASEPATH environment variable for pipeline tests - Introduced AYMURAI_CACHE_BASEPATH to the environment variables for both pipeline download and pipeline tests. - This change ensures that the correct cache path is utilized during the execution of the tests. * 🐛 fix(dependencies): adjust textract dependency for platform compatibility - Added conditional dependency for textract based on the operating system. - Specified different sources for textract depending on whether the platform is Windows or not. * 🔥 chore(opencode): remove opencode.json configuration file - Deleted the opencode.json file as it is no longer needed. - This change helps to clean up the repository and remove obsolete configurations. * 🚚 Update pipeline path for datapublic in scripts, notebooks and tests * 📝 docs: replace Black references with Ruff in CONTRIBUTING and Alembic hook examples * 🔧 Add backslash to default CACHE_BASEPATH value * 🔧 Update cache path retrieval to use settings for consistency * ➖ Remove textract dependencies and update documentation for extract_document function * ✅ Update integration tests and add new test cases for anonymizer and datapublic flows * 🔥 chore(test): remove legacy /test dir and standardize sample doc path to /resources/data/sample/document-01.docx * 🔧 Update UV_VERSION to latest in devcontainer Dockerfile * 🔧 Update dependency installation command to include all groups * 📌 Update uv.lock * 🐛 Fix CACHE_BASEPATH env alias resolution for CI pipeline downloads
* ✨ feat(extractors): use pymupdf layout for pdf text extraction * ✨ feat(normalization): enhance document normalization to preserve paragraph structure * 📝 docs: document default values for extractor and normalization helpers * 🩹 fix(extractors): use pymupdf4llm.to_text with page_chunks for pdf paragraphs * ♻️ Add DOCX and PDF anonymizer modules - Implemented DocxAnonymizer class to handle anonymization of DOCX documents by replacing sensitive data with label tokens. This includes functionality for unzipping documents, parsing XML, editing content, and adding watermarks. - Developed PdfAnonymizer class for anonymizing PDF documents, utilizing pymupdf for document manipulation. This includes layout parsing, font caching, redaction operations, and watermarking. * 🔧 Enhance PDF and DOCX handling in anonymization process * 📝 Update backend module references for document rendering in README * ✅ Update tests to use DOCX format for document anonymization and enhance mock behavior * ✨ Add end-to-end PDF anonymization notebook with PyMuPDF and AymurAI API * ♻️ Rework PDF anonymization for precise spans and widget handling * 🔧 Update model_dump calls to exclude None values for improved data handling * 📝 Add docstrings to label replacement functions * ♻️ Refactor watermark handling and optimize PDF token aliasing * ✅ Add integration tests for merging fragmented numeric labels and excluding null alt attributes in PDF anonymization * ➖ Remove opencv-python-headless dependency from project requirements * ♻️ Implement paragraph splitting function to enhance document text extraction * 🔧 Update dependency installation command to prevent Python downloads * 🔥 Remove redundant tests for merging fragmented numeric labels and PDF anonymization * ♻️ Refactor anonymizer tests to use DOCX format and enhance mock functionality * 🔧 Add xfail marker for PDF extraction test on Windows due to tensor type issue * ✨ Enhance PDF anonymization by adding cleanup rects, removing overlapping links, and scrubbing metadata * 🔧 Remove redundant return statement in _label_replacement_text function * ♻️ Refactor anonymization module: split pdf and docx internals by format * ✅ Add integration tests for PDF and DOCX anonymizers, including metadata scrubbing and link preservation * ✨ Add watermark layout adjustments to avoid footer content overlap in PDF anonymization * ✅ Add integration test to ensure watermark is positioned away from footer content in PDF anonymization * 🩹 Fix: read docx xml as utf-8 across platforms * ✅ Add Windows-specific xfail marker for PDF tests and implement UTF-8 XML reading test
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| ErrorProne | 3 high |
| Security | 3 medium 1 minor 93 high |
🟢 Metrics 1119 complexity · 27 duplication
Metric Results Complexity 1119 Duplication 27
TIP This summary will be updated as you push new changes. Give us feedback
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.