Pr/1 by sologuy · Pull Request #2 · sologuy/BookmarkSummarizer

sologuy · 2025-12-01T14:51:04Z

No description provided.

…Chrome-based browsers: Chromium, Ungoogled-chromium, Brave, etc). Signed-off-by: Stephen L. <[email protected]>

…the app Signed-off-by: Stephen L. <[email protected]>

Signed-off-by: Stephen L. <[email protected]>

…ripping + translate to English Signed-off-by: Stephen L. <[email protected]>

…Chinese to English Signed-off-by: Stephen L. <[email protected]>

Signed-off-by: Stephen L. <[email protected]>

…ompt is in English and improved to avoid filler sentences Signed-off-by: Stephen L. <[email protected]>

Signed-off-by: Stephen L. <[email protected]>

… module) + add command-line entry points Signed-off-by: Stephen L. <[email protected]>

…ltiplatform detection + Fetch bookmarks from all installed browsers. Optional arguments to specify a single browser to fetch bookmarks, or a custom profile path. Signed-off-by: Stephen L. <[email protected]>

…atforms support Signed-off-by: Stephen L. <[email protected]>

…figuration files + provide default toml config file (using ollama and gemma3:1b) gemma3:1b was chosen as the default model as it works well for summarization and it can run on pretty much any machine. Signed-off-by: Stephen L. <[email protected]>

… in search engine Signed-off-by: Stephen L. <[email protected]>

…aunch of search engine Signed-off-by: Stephen L. <[email protected]>

…w total number of skipped keys in search engine Signed-off-by: Stephen L. <[email protected]>

Indexing in search engine did not correctly deduplicate (all bookmarks were skipped) because guid and id were used even when set to a "N/A" string. Now this string value is considered as empty. Signed-off-by: Stephen L. <[email protected]>

…hash check) + --limit pertains to only new bookmarks This allows for incremental updates. The original behavior, rebuilding the whole index each time, is still available with --rebuild Signed-off-by: Stephen L. <[email protected]>

Now crawling can be interrupted and continued by just restarting crawl.py Signed-off-by: Stephen L. <[email protected]>

…ving intermediate results in crawl.py Signed-off-by: Stephen L. <[email protected]>

…ehavior than before) Use --no-update to restore the previous default behavior of not updating the index (ie, use the existing index, faster when there is a big pending bookmarks json update and user is in a hurry). Signed-off-by: Stephen L. <[email protected]>

Summaries can all be recomputed with argument --force-recompute-summaries Signed-off-by: Stephen L. <[email protected]>

…y + time-based flushing (every minute by default) Signed-off-by: Stephen L. <[email protected]>

Signed-off-by: Stephen L. <[email protected]>

…earch results Signed-off-by: Stephen L. <[email protected]>

…cial processing for specific conditions defined in the modules (can be based on URL, title, content, etc) + add YouTube custom parser (fetches subtitles/transcript as content) + include all bookmarks by default even if content unreachable (before they were silently skipped and logged to failed_urls.json - behavior can be restored with `--skip-unreachable`) + future-proof dependencies by not freezing to a specific version each requirement Signed-off-by: Stephen L. <[email protected]>

…ushing and graceful exit Signed-off-by: Stephen L. <[email protected]>

In `.github/workflows/releases-ci-cd.yml`, restricts the artifact upload step to run only on the `ubuntu-latest` job within the matrix strategy. This prevents the HTTP 409 Conflict error ("an artifact with this name already exists") that occurs when multiple parallel jobs (Ubuntu, Windows, macOS) attempt to upload an artifact with the same name (`artifact`) in `actions/upload-artifact@v4`. Since the package is pure Python, the built distribution is identical across platforms, so a single upload is sufficient.

Because with the new upload-artifact, apparently the artifacts are shared across OS environments, so only one can upload now.

Signed-off-by: Stephen L. <[email protected]>

- Added `tests/test_crawl_extended.py` covering utility functions and LMDB operations in `crawl.py`. - Added `tests/test_fuzzy_bookmark_search_extended.py` covering fuzzy search, LMDB loading, and API endpoints. - Added `tests/test_index_extended.py` covering bookmark extraction logic. - Added `tests/test_zhihu_parser_extended.py` covering the Zhihu custom parser. - Added `tests/test_build_app_extended.py` covering the build script. - Improved mocking strategy to isolate tests from file system and network.

…mark-summarizer) Signed-off-by: Stephen L. <[email protected]>

Signed-off-by: Stephen L. <[email protected]>

- Added `sys.path` modification to all new test files to ensure root modules can be imported. - Updated `tests/test_crawl_extended.py` to correctly handle file locking tests on Windows by conditionally patching `msvcrt.locking` instead of `fcntl.flock`. - Improved `tests/test_index_extended.py` to avoid patching `builtins.dir` by properly configuring mock module attributes. - Refactored `tests/test_build_app_extended.py` to remove empty test blocks and cover PyInstaller installation logic. - Added comprehensive unit tests for `crawl.py`, `fuzzy_bookmark_search.py`, `index.py`, `custom_parsers/zhihu.py`, and `build_app.py` achieving significantly higher coverage.

Prevents dependency confusion by installing dependencies from PyPI before installing the package from TestPyPi. This avoids picking up broken or malicious packages (e.g., FASTAPI 1.0) from TestPyPi.

Signed-off-by: Stephen L. <[email protected]>

- Added `tests/test_suspended_tabs_parser.py` covering `custom_parsers/a_suspended_tabs.py`. - Updated `tests/test_youtube_parser.py` with corrected regex matching (11-char ID) and improved mocking for `TextFormatter`. - Updated `tests/test_crawl_extended.py` to cover `load_custom_parsers`, `ModelConfig`, `call_ollama_api`, `call_qwen_api`, `call_deepseek_api`, `resize_lmdb_database`, and `init_lmdb`. - Updated `tests/test_fuzzy_bookmark_search_extended.py` to test failure scenarios for `safe_lmdb_operation`. - Updated `tests/test_index_extended.py` to cover `get_bookmarks` sorting and logic. - Updated `tests/test_build_app_extended.py` to cover `install_pyinstaller` failure path and `build_executable` edge cases. - Improved overall branch coverage significantly.

Project coverage at 69.71%

This commit introduces `tests/test_fuzzy_coverage.py` to significantly increase the branch coverage of `fuzzy_bookmark_search.py`. The new tests cover: - Edge cases in LMDB initialization and error handling (including specific exception types). - Fallback mechanisms when the database is unavailable or corrupt. - Pagination and error handling in the search API. - Indexing logic, including updates and duplicate detection. - The main execution flow and CLI argument parsing. This brings the coverage of `fuzzy_bookmark_search.py` to approximately 92%.

Added `tests/test_crawl_advanced.py` which includes tests for: - Data sanitization and pickling with recursion handling. - Disk space and LMDB existence checks. - LMDB backup functionality (including platform-specific locking mocks). - Custom parser loading and filtering. - Signal handling. - Encoding fixes. - Selenium fetching (Zhihu and general cases). - `fetch_webpage_content` logic including deduplication and error handling. - `main` execution flow with arguments. - Secondary index updates. This brings `crawl.py` coverage to 72% when combined with existing tests. Fixed issues with mocking global `lmdb_env` by mocking `safe_lmdb_operation` directly. Handled platform-specific constants (`HAS_MSVC`) in tests.

Project coverage is 77.82%

Added `tests/test_crawl_expert.py` covering: - `parallel_fetch_bookmarks` with synchronous execution to verify flushing logic and item processing. - `init_webdriver` and `prepare_webdriver` execution paths (previously mocked out). - `fix_encoding` heuristics with detailed cases. - `apply_custom_parsers` logic. - `test_api_connection` branches for different models. - Full `main` execution flow with mocked components. - `resize_lmdb_database` retry logic. Refined `tests/test_crawl_advanced.py` to fix mocking issues with `lmdb_env` global state and argument order in patches. Combined coverage increased to ~77% (statement coverage).

Project coverage is 82.07%

… structure in test_crawl_advanced.py The issue was that the recursion limit in `safe_pickle` wasn't high enough for deeply nested structures on your local machine (Windows Python 3.12.7), even though it worked in cloud environments. Changes: 1. **Increased recursion limit** in `crawl.py` `safe_pickle` function from 10000 to 20000 to handle deeper recursion. 2. **Reduced test depth** in `tests/test_crawl_advanced.py` `test_safe_pickle_recursion` from 2000 to 1000 levels to make the test more reasonable while still testing recursion limit adjustment. This resolves the platform-specific recursion issue.

- Enhanced .gitignore with categorized rules for config, DB, backups, logs, and IDE files - Comprehensive update to Chinese documentation (README-CN.md): * Added multi-browser support details (Chrome, Firefox, Edge, Safari, etc.) * Added installation options (binary, PyPI, from source) * Added fuzzy search feature documentation * Updated output files description (LMDB, Whoosh index) * Added custom parser architecture explanation * Added author info and recommended third-party tools - Fixed fuzzy search command documentation in README.MD 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

lrq3000 added 30 commits November 11, 2025 18:32

add: -p argument to specify a custom profile path (allows supporting …

3b48534

…Chrome-based browsers: Chromium, Ungoogled-chromium, Brave, etc). Signed-off-by: Stephen L. <[email protected]>

add: run_with_env.py to easily use a .env config file before running …

c3254a8

…the app Signed-off-by: Stephen L. <[email protected]>

docs: clarify to remove inline comments in env file

17c652b

Signed-off-by: Stephen L. <[email protected]>

fix: env file runner now supports inline comments and blank spaces st…

fb912d2

…ripping + translate to English Signed-off-by: Stephen L. <[email protected]>

docs: start translating comments and progress texts in crawl.py from …

6c14f9b

…Chinese to English Signed-off-by: Stephen L. <[email protected]>

docs: update suggested model

a004a86

Signed-off-by: Stephen L. <[email protected]>

fix: translate all strings from Chinese to English + summarization pr…

f1398cb

…ompt is in English and improved to avoid filler sentences Signed-off-by: Stephen L. <[email protected]>

add: summarization prompt autodetects language from webpage's content

0c7eaf5

Signed-off-by: Stephen L. <[email protected]>

feat: add local fuzzy search web app with ultra fast indexing via Whoosh

533312d

docs: update README.MD to mention the new search app

888cc21

Signed-off-by: Stephen L. <[email protected]>

feat: add measure scripts

80145e2

Signed-off-by: Stephen L. <[email protected]>

feat: add pyproject.toml (make the project into a standardized Python…

4056adf

… module) + add command-line entry points Signed-off-by: Stephen L. <[email protected]>

feat: Add support for other browsers beyond Chrome, with automatic mu…

a51ce41

…ltiplatform detection + Fetch bookmarks from all installed browsers. Optional arguments to specify a single browser to fetch bookmarks, or a custom profile path. Signed-off-by: Stephen L. <[email protected]>

docs: update README.MD to reflect the new cross-browsers and cross-pl…

da92feb

…atforms support Signed-off-by: Stephen L. <[email protected]>

feat: add pagination and show total number of results and lookup time…

ee1761a

… in search engine Signed-off-by: Stephen L. <[email protected]>

feat: always print the total number of bookmarks in the terminal at l…

06732a3

…aunch of search engine Signed-off-by: Stephen L. <[email protected]>

feat: skip already processed keys (deduplicate when indexing) and sho…

75a13d5

…w total number of skipped keys in search engine Signed-off-by: Stephen L. <[email protected]>

fix: deduplicating indexing in search engine

22bf9e0

Indexing in search engine did not correctly deduplicate (all bookmarks were skipped) because guid and id were used even when set to a "N/A" string. Now this string value is considered as empty. Signed-off-by: Stephen L. <[email protected]>

feat: flush intermediate indexing results regularly in crawl.py

51d1bcc

Now crawling can be interrupted and continued by just restarting crawl.py Signed-off-by: Stephen L. <[email protected]>

feat: add argument --flush-batch-size to control batch size before sa…

9028319

…ving intermediate results in crawl.py Signed-off-by: Stephen L. <[email protected]>

feat: skip existing summaries by default in crawl.py

5bd3b0c

Summaries can all be recomputed with argument --force-recompute-summaries Signed-off-by: Stephen L. <[email protected]>

fix: intermediate flushing of indexing now works correctly in crawl.p…

f63b7c8

…y + time-based flushing (every minute by default) Signed-off-by: Stephen L. <[email protected]>

fix: rewrite flushing to avoid the use of nonlocal variables

bfeffa2

Signed-off-by: Stephen L. <[email protected]>

chores: better comment

896ac18

Signed-off-by: Stephen L. <[email protected]>

feat: add buttons to display full summaries and content and json in s…

393e1cb

…earch results Signed-off-by: Stephen L. <[email protected]>

fix: crawling can now be interrupted despite parallel workers with fl…

b1e7f74

…ushing and graceful exit Signed-off-by: Stephen L. <[email protected]>

google-labs-jules Bot and others added 29 commits November 30, 2025 23:40

Fix CI artifact name conflict in releases workflow (#9)

8bb8c00

Because with the new upload-artifact, apparently the artifacts are shared across OS environments, so only one can upload now.

chore: bump v0.4.3.post3

2e463fd

Signed-off-by: Stephen L. <[email protected]>

build: update to install the correct pythonic name for this app (book…

2fd5008

…mark-summarizer) Signed-off-by: Stephen L. <[email protected]>

chore: bump version 0.4.3.post4

6f4994d

Signed-off-by: Stephen L. <[email protected]>

Merge branch 'main' into fix-test-imports-and-coverage

5302159

Fix test import errors and improve coverage (#11)

b291566

Merge branch 'main' into unit-tests-coverage

7bc58e9

Fix dependency confusion in TestPyPi CI job

6a51afa

Fix TestPyPi installation failure in CI (#12)

7f8c1b8

Prevents dependency confusion by installing dependencies from PyPI before installing the package from TestPyPi. This avoids picking up broken or malicious packages (e.g., FASTAPI 1.0) from TestPyPi.

chore: bump version 0.4.3.post5

dc01796

Signed-off-by: Stephen L. <[email protected]>

docs: remove redundant license shield

d5e84b4

Signed-off-by: Stephen L. <[email protected]>

Merge branch 'main' into unit-tests-coverage

0c48a19

Add unit tests to increase branch coverage (#10)

89dd821

Merge branch 'main' into increase-coverage-more-tests

6ed890c

Add more unit tests to further increase coverage (#13)

260c137

Add comprehensive unit tests for crawl.py to increase branch coverage

c8e7077

Add unit tests for crawl.py coverage improvement (#14)

0f5da0c

Project coverage at 69.71%

Increase test coverage for fuzzy_bookmark_search.py to 92% (#16)

a11af67

Increase crawl.py test coverage with advanced unit tests (#17)

0fae51f

Project coverage is 77.82%

Add expert tests to maximize crawl.py coverage (#18)

95f89b3

Project coverage is 82.07%

sologuy merged commit 7710379 into main Dec 1, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr/1#2

Pr/1#2
sologuy merged 171 commits into
mainfrom
pr/1

sologuy commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sologuy commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants