Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
171 commits
Select commit Hold shift + click to select a range
3b48534
add: -p argument to specify a custom profile path (allows supporting …
lrq3000 Nov 11, 2025
c3254a8
add: run_with_env.py to easily use a .env config file before running …
lrq3000 Nov 11, 2025
17c652b
docs: clarify to remove inline comments in env file
lrq3000 Nov 11, 2025
fb912d2
fix: env file runner now supports inline comments and blank spaces st…
lrq3000 Nov 11, 2025
6c14f9b
docs: start translating comments and progress texts in crawl.py from …
lrq3000 Nov 11, 2025
a004a86
docs: update suggested model
lrq3000 Nov 11, 2025
f1398cb
fix: translate all strings from Chinese to English + summarization pr…
lrq3000 Nov 11, 2025
0c7eaf5
add: summarization prompt autodetects language from webpage's content
lrq3000 Nov 11, 2025
533312d
feat: add local fuzzy search web app with ultra fast indexing via Whoosh
lrq3000 Nov 13, 2025
888cc21
docs: update README.MD to mention the new search app
lrq3000 Nov 13, 2025
80145e2
feat: add measure scripts
lrq3000 Nov 13, 2025
4056adf
feat: add pyproject.toml (make the project into a standardized Python…
lrq3000 Nov 13, 2025
a51ce41
feat: Add support for other browsers beyond Chrome, with automatic mu…
lrq3000 Nov 13, 2025
da92feb
docs: update README.MD to reflect the new cross-browsers and cross-pl…
lrq3000 Nov 13, 2025
22ea1fd
feat: remove the need for environment variables, replaced by TOML con…
lrq3000 Nov 13, 2025
ee1761a
feat: add pagination and show total number of results and lookup time…
lrq3000 Nov 14, 2025
06732a3
feat: always print the total number of bookmarks in the terminal at l…
lrq3000 Nov 14, 2025
75a13d5
feat: skip already processed keys (deduplicate when indexing) and sho…
lrq3000 Nov 14, 2025
22bf9e0
fix: deduplicating indexing in search engine
lrq3000 Nov 14, 2025
35423b6
feat: crawl.py deduplicates/skip already processed URLs or contents (…
lrq3000 Nov 14, 2025
51d1bcc
feat: flush intermediate indexing results regularly in crawl.py
lrq3000 Nov 14, 2025
9028319
feat: add argument --flush-batch-size to control batch size before sa…
lrq3000 Nov 14, 2025
5a6e4ed
feat: search engine defaults to updating the index (inverse default b…
lrq3000 Nov 14, 2025
5bd3b0c
feat: skip existing summaries by default in crawl.py
lrq3000 Nov 14, 2025
f63b7c8
fix: intermediate flushing of indexing now works correctly in crawl.p…
lrq3000 Nov 15, 2025
bfeffa2
fix: rewrite flushing to avoid the use of nonlocal variables
lrq3000 Nov 15, 2025
896ac18
chores: better comment
lrq3000 Nov 15, 2025
393e1cb
feat: add buttons to display full summaries and content and json in s…
lrq3000 Nov 15, 2025
9b52034
feat: major update: add custom parsers modular architecture to do spe…
lrq3000 Nov 15, 2025
b1e7f74
fix: crawling can now be interrupted despite parallel workers with fl…
lrq3000 Nov 15, 2025
3a27bbd
feat: Major update: migrate to ZODB instead of JSON for crawling inde…
lrq3000 Nov 15, 2025
8de1c9e
chores: forgot to remove `run_with_env.py`
lrq3000 Nov 15, 2025
4114073
docs: add TODO.md
lrq3000 Nov 15, 2025
f57f00f
fix: ZODB crashing when no database present + bump v0.2.1
lrq3000 Nov 15, 2025
f28cd49
fix: various issues with ZODB implementation in crawl.py (now it work…
lrq3000 Nov 15, 2025
754d9b0
fix: summaries generation skips incomplete bookmarks (missing content…
lrq3000 Nov 15, 2025
738c6fe
fix: ZODB flushing issues again
lrq3000 Nov 16, 2025
f820f6a
feat: migrate from ZODB to LMDB, allows constant time insertions and …
lrq3000 Nov 16, 2025
957f294
feat: improve summarization prompt to be more concise (avoid starting…
lrq3000 Nov 16, 2025
58f88cb
docs: update generated files and mention custom parsers and context w…
lrq3000 Nov 16, 2025
8c0d838
chores: bump v0.2.4
lrq3000 Nov 16, 2025
99757c7
chores: update prints and inline comments to mention LMDB instead of …
lrq3000 Nov 16, 2025
de66cab
feat: modularize zhihu content parsing to a custom parser (so that no…
lrq3000 Nov 16, 2025
ff90406
fix: increase batch size in search engine for increased speed
lrq3000 Nov 16, 2025
d144332
feat: Improve robustness and configurability of LMDB
lrq3000 Nov 16, 2025
3b05a7d
chores: bump v0.2.5
lrq3000 Nov 16, 2025
448705b
feat: adding advanced search criteria: bookmark creation date and dom…
lrq3000 Nov 16, 2025
fe590a8
chores: bump v0.2.6
lrq3000 Nov 16, 2025
ca7b0a0
feat: Systematize custom parsers call order by alphabetical order
lrq3000 Nov 17, 2025
73b201e
feat: add suspended tabs custom parser (Chrome only)
lrq3000 Nov 17, 2025
f202edb
chores: bump v0.2.7
lrq3000 Nov 17, 2025
f15ba9e
fix: sanitize bookmarks to remove non serializable objects such as se…
lrq3000 Nov 17, 2025
ccea0ff
chore: bump version 0.2.8
lrq3000 Nov 17, 2025
3cc488a
feat: add automatic dynamic resizing LMDB database in crawl.py, reduc…
lrq3000 Nov 22, 2025
94c55ca
chores: bump version 0.2.9b
lrq3000 Nov 22, 2025
155c0cc
fix: maximum recursion error preventing flushing LMDB on-disk because…
lrq3000 Nov 23, 2025
d1677d9
feat: keep bookmark's title by default (with user modifications) and …
lrq3000 Nov 23, 2025
16e2160
chores: bump version 0.2.10b
lrq3000 Nov 23, 2025
ecf17a4
fix: console printing errors
lrq3000 Nov 23, 2025
46f500b
chores: bump version 0.3.0
lrq3000 Nov 23, 2025
8c9ab60
docs: add authorship information in README.MD
lrq3000 Nov 23, 2025
8fbfb65
feat: dynamic LMDB resizing in fuzzy_bookmark_search.py
lrq3000 Nov 22, 2025
5b1fd03
Revert "feat: dynamic LMDB resizing in fuzzy_bookmark_search.py"
lrq3000 Nov 23, 2025
2376d30
fix: fuzzy_bookmark_search.py opens the LMDB database in read-only mode
lrq3000 Nov 23, 2025
2f17108
fix: LMDB cursor now accesses bookmark data instead of unpickling met…
lrq3000 Nov 24, 2025
5f7898f
chores: bump version 0.3.1
lrq3000 Nov 24, 2025
92a340b
docs: mention modular architecture and scaling in README.MD
lrq3000 Nov 24, 2025
8147b9f
build: add pyInstaller and GitHub Workflow automatic builds (for Wind…
lrq3000 Nov 24, 2025
a1f4391
docs: add build instructions in BUILD.md
lrq3000 Nov 24, 2025
69b0b96
feat: skip Selenium fetching if Chrome/Chromium browser absent
lrq3000 Nov 24, 2025
682453b
chores: bump version 0.3.2
lrq3000 Nov 24, 2025
c4f421f
build: drop py 3.6-3.8 and add py 3.11-3.12 in ci
lrq3000 Nov 24, 2025
4a0d541
build: drop py 3.12, add windows target platform
lrq3000 Nov 24, 2025
b5214bc
Update build system to generate multiple executables
google-labs-jules[bot] Nov 24, 2025
57a92fa
Fix ChromeDriver initialization in frozen environment
google-labs-jules[bot] Nov 24, 2025
a22d4b0
Fix Windows execution issues for frozen binaries
google-labs-jules[bot] Nov 24, 2025
c82199c
refactor: fuzzy_bookmark_search.py only opens LMDB db in read-only, r…
lrq3000 Nov 24, 2025
3a60faf
Merge pull request #1 from lrq3000/multi-binaries-build
lrq3000 Nov 24, 2025
2db6be3
chore: bump version 0.3.3
lrq3000 Nov 24, 2025
a94f6cf
refactor: no globals variables in fuzzy_bookmark_search.py (refactore…
lrq3000 Nov 24, 2025
c9a4992
fix: fuzzy_bookmark_search.py successfully opens lmdb after refactoring
lrq3000 Nov 25, 2025
4324d8c
chore: bump version 0.3.4
lrq3000 Nov 25, 2025
195862e
fix: index.py failing to extract bookmarks from Firefox because of No…
lrq3000 Nov 26, 2025
58e74aa
feat: dynamically retrieve the list of supported browsers by browser-…
lrq3000 Nov 26, 2025
292ed95
chore: bump version 0.3.5
lrq3000 Nov 26, 2025
6117c06
fix: youtube transcripts are now searchable in the Whoosh index (and …
lrq3000 Nov 26, 2025
3140125
chore: bump version 0.3.6
lrq3000 Nov 26, 2025
9cbec7f
feat: implement configurable semi-random delays in crawl.py
lrq3000 Nov 26, 2025
12256b1
docs: update TODO.md
lrq3000 Nov 26, 2025
45a7f77
chore: bump version 0.3.7
lrq3000 Nov 26, 2025
0464506
fix: print waiting time in crawl.py
lrq3000 Nov 26, 2025
a3c73b7
chore: bump version 0.3.8
lrq3000 Nov 26, 2025
c44c911
Fix: Improve parallelism in bookmark crawling
google-labs-jules[bot] Nov 26, 2025
f96f521
Feat: Add worker thread IDs to crawl logs and fix parallelism
google-labs-jules[bot] Nov 26, 2025
62fa0ba
Fix: Ensure true parallelism in bookmark crawling
google-labs-jules[bot] Nov 26, 2025
1fdd7db
Fix: Resolve parallelism bottlenecks in bookmark crawler
google-labs-jules[bot] Nov 26, 2025
25faf9a
Fix: Refactor for true parallelism in bookmark crawler (true fix for …
google-labs-jules[bot] Nov 26, 2025
87c9a7e
chore: bump version 0.4.0
lrq3000 Nov 26, 2025
4300fb8
Merge pull request #2 from lrq3000/fix-parallel-crawl
lrq3000 Nov 26, 2025
8f15c7e
docs: update install instructions
lrq3000 Nov 26, 2025
a420858
docs: multi browsers support
lrq3000 Nov 27, 2025
ad52762
chores: add AGENTS.md
lrq3000 Nov 30, 2025
c9a9432
fix: continue parallel processing refactoring, delete/refactor stubs …
lrq3000 Nov 30, 2025
e800b01
chore: bump v0.4.1
lrq3000 Nov 30, 2025
249de7e
build: add continuous integration, code coverage (codecov) and contin…
lrq3000 Nov 30, 2025
c84a180
fix: initalize variable to avoid scope issues
lrq3000 Nov 30, 2025
880a310
build: fix ci build bugs
lrq3000 Nov 30, 2025
6040f55
build: rename build.yml -> build-exe.yml + build for latest python re…
lrq3000 Nov 30, 2025
b9339b8
build: workflows deduplication and renaming for clarity
lrq3000 Nov 30, 2025
ae7a04a
build: add coverage to the test dependencies for github actions to work
lrq3000 Nov 30, 2025
7e2787e
chore: comment
lrq3000 Nov 30, 2025
7394f34
fix: change default summarization model from gemma3:1b (hallucinates …
lrq3000 Nov 30, 2025
9fd3127
fix: print when no subtitles found in youtube parser
lrq3000 Nov 30, 2025
05a2b38
build: do not skip installing the module
lrq3000 Nov 30, 2025
97bcc46
build: clarify step name
lrq3000 Nov 30, 2025
62c9066
feat: rework youtube transcripts parser to try English first then oth…
lrq3000 Nov 30, 2025
34c8899
chore: bump version 0.4.2
lrq3000 Nov 30, 2025
48cb145
Fix unit tests failures and coverage configuration
google-labs-jules[bot] Nov 30, 2025
ff22429
docs: add 3rd-party tools + default ollama model
lrq3000 Nov 30, 2025
1e80753
feat: add CLI param --parsers to allow to exclude select custom parsers
lrq3000 Nov 30, 2025
5b0ebcb
Fix YouTube parser transcript fetching by using a browser User-Agent
google-labs-jules[bot] Nov 30, 2025
bf8c40b
Fix custom parsers loading in frozen environment (PyInstaller)
google-labs-jules[bot] Nov 30, 2025
ec994d4
Address PR feedback: Fix import order and exception raising
google-labs-jules[bot] Nov 30, 2025
dd15c35
Refactor tests: improve clarity and unused variable handling
google-labs-jules[bot] Nov 30, 2025
1161e35
Fix custom parsers detection and loading in frozen environments (#7)
lrq3000 Nov 30, 2025
e329109
Merge branch 'main' into fix-unit-tests-coverage
lrq3000 Nov 30, 2025
e2f7190
Fix unit tests isolation and coverage config (#5)
lrq3000 Nov 30, 2025
315d191
Refactor YouTube parser to use context manager for requests.Session
google-labs-jules[bot] Nov 30, 2025
f35323a
build: install `build` dependency
lrq3000 Nov 30, 2025
20069c8
build: install twine dependency
lrq3000 Nov 30, 2025
aaea17b
Resolve merge conflicts in custom_parsers/youtube.py and update logic
google-labs-jules[bot] Nov 30, 2025
6c94210
build: add validate-pyproject dependency
lrq3000 Nov 30, 2025
02de074
Merge branch 'main' into fix-youtube-parser-ua
lrq3000 Nov 30, 2025
44e47eb
Fix YouTube parser blocking issue by adding User-Agent (#6)
lrq3000 Nov 30, 2025
e722540
CI: configure bandit to fail only on high severity issues
google-labs-jules[bot] Nov 30, 2025
f9215b0
Update CI to fail bandit only on high severity issues (#8)
lrq3000 Nov 30, 2025
0d2c5df
chore: bump version 0.4.3
lrq3000 Nov 30, 2025
9d2dbc3
add as co-author, not just maintainer at this point
lrq3000 Nov 30, 2025
029c136
build: update upload/download-artifact to v4
lrq3000 Nov 30, 2025
ec8cedc
chore: bump version 0.4.3.post1
lrq3000 Nov 30, 2025
0c98fd7
build: fix coverage issues
lrq3000 Nov 30, 2025
65a63e7
chore: bump version 0.4.3.post2
lrq3000 Nov 30, 2025
772a377
Fix artifact upload conflict in CI release workflow
google-labs-jules[bot] Nov 30, 2025
8bb8c00
Fix CI artifact name conflict in releases workflow (#9)
lrq3000 Nov 30, 2025
2e463fd
chore: bump v0.4.3.post3
lrq3000 Nov 30, 2025
d3cdac8
Add comprehensive unit tests to increase branch coverage
google-labs-jules[bot] Nov 30, 2025
2fd5008
build: update to install the correct pythonic name for this app (book…
lrq3000 Dec 1, 2025
6f4994d
chore: bump version 0.4.3.post4
lrq3000 Dec 1, 2025
bcec996
Fix import errors in tests and increase code coverage
google-labs-jules[bot] Dec 1, 2025
5302159
Merge branch 'main' into fix-test-imports-and-coverage
lrq3000 Dec 1, 2025
b291566
Fix test import errors and improve coverage (#11)
lrq3000 Dec 1, 2025
7bc58e9
Merge branch 'main' into unit-tests-coverage
lrq3000 Dec 1, 2025
6a51afa
Fix dependency confusion in TestPyPi CI job
google-labs-jules[bot] Dec 1, 2025
7f8c1b8
Fix TestPyPi installation failure in CI (#12)
lrq3000 Dec 1, 2025
dc01796
chore: bump version 0.4.3.post5
lrq3000 Dec 1, 2025
d5e84b4
docs: remove redundant license shield
lrq3000 Dec 1, 2025
a549e88
Add extensive unit tests for core modules and fix parser tests
google-labs-jules[bot] Dec 1, 2025
0c48a19
Merge branch 'main' into unit-tests-coverage
lrq3000 Dec 1, 2025
89dd821
Add unit tests to increase branch coverage (#10)
lrq3000 Dec 1, 2025
6ed890c
Merge branch 'main' into increase-coverage-more-tests
lrq3000 Dec 1, 2025
260c137
Add more unit tests to further increase coverage (#13)
lrq3000 Dec 1, 2025
c8e7077
Add comprehensive unit tests for crawl.py to increase branch coverage
google-labs-jules[bot] Dec 1, 2025
0f5da0c
Add unit tests for crawl.py coverage improvement (#14)
lrq3000 Dec 1, 2025
628d2bd
Add comprehensive unit tests for fuzzy_bookmark_search.py
google-labs-jules[bot] Dec 1, 2025
a11af67
Increase test coverage for fuzzy_bookmark_search.py to 92% (#16)
lrq3000 Dec 1, 2025
758c88f
Add comprehensive unit tests for crawl.py to increase coverage
google-labs-jules[bot] Dec 1, 2025
0fae51f
Increase crawl.py test coverage with advanced unit tests (#17)
lrq3000 Dec 1, 2025
e966085
Add expert-level unit tests to crawl.py to maximize coverage
google-labs-jules[bot] Dec 1, 2025
95f89b3
Add expert tests to maximize crawl.py coverage (#18)
lrq3000 Dec 1, 2025
e3924a4
fix: Increase recursion limit in safe_pickle and adjust test for deep…
lrq3000 Dec 1, 2025
993aed3
docs: Update documentation and improve .gitignore rules
sologuy Dec 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions .github/workflows/build-exe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Build Executable App Across OSes

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
workflow_dispatch:

jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "*" # "*" = last stable python version

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pyinstaller

- name: Build executable
run: python build_app.py

- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: BookmarkSummarizer-Binaries-${{ matrix.os }}
path: dist/
76 changes: 76 additions & 0 deletions .github/workflows/ci-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# This workflow will install Python dependencies and run tests with a variety of Python versions
# It uses the Python Package GitHub Actions workflow.
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
# and https://www.youtube.com/watch?v=l6fV09z5XHk

name: Continuous integration for each commit and pull request

on:
push:
branches:
- main # $default-branch only works in Workflows templates, not in Workflows, see https://stackoverflow.com/questions/64781462/github-actions-default-branch-variable
pull_request:
branches:
- main
workflow_dispatch:

jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
python-version: ["*"] # check the list of versions: https://github.com/actions/python-versions/releases and https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md -- note that "*" represents the latest stable version of Python
os: [ ubuntu-latest, windows-latest, macos-latest ] # jobs that run on Windows and macOS runners that GitHub hosts consume minutes at 2 and 10 times the rate that jobs on Linux runners consume respectively. But it's free for public OSS repositories.
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
# You can test your matrix by printing the current Python version
- name: Display Python version
run: |
python -c "import sys; print(sys.version)"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
- name: Install this Python app
run: |
python -m pip install --upgrade --editable .[test] --verbose --use-pep517
- name: Test with pytest
run: |
#coverage run --branch -m pytest . -v # Do NOT do that, because coverage is already run in pytest as specified in pyproject.toml, so this calls two nested instances of coverage, hence this will glitch out!
pytest -v # run tests with coverage (as specified in pyproject.toml) and save the coverage as html and xml
coverage report -m
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }} # now required even for public repos, and also advised to avoid rate-limiting API by GitHub which makes the upload fails randomly: https://community.codecov.com/t/upload-issues-unable-to-locate-build-via-github-actions-api/3954/9 and https://github.com/codecov/codecov-action/issues/598
#directory: ./coverage/reports/
env_vars: OS,PYTHON
fail_ci_if_error: false
#files: ./coverage1.xml,./coverage2.xml
flags: unittests
name: codecov-umbrella
verbose: true
- name: Build sdist (necessary for the other tests below)
if: ${{ matrix.python-version == '*' }}
run: |
pip install --upgrade build
python -sBm build
- name: Twine check
if: ${{ matrix.python-version == '*' }}
run: |
pip install --upgrade twine
twine check "dist/*"
- name: pyproject.toml validity
if: ${{ matrix.python-version == '*' }}
run: |
pip install --upgrade validate-pyproject
validate-pyproject pyproject.toml -v
- name: Check for potential security issues
run: |
pip install --upgrade bandit
bandit -r . -x ./tests -lll
42 changes: 0 additions & 42 deletions .github/workflows/python-ci.yml

This file was deleted.

123 changes: 123 additions & 0 deletions .github/workflows/releases-ci-cd.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# This workflow will test the module and then upload to PyPi, when triggered by the creation of a new GitHub Release
# It uses the Python Package GitHub Actions workflow.
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
# and https://www.youtube.com/watch?v=l6fV09z5XHk
# and https://py-pkgs.org/08-ci-cd#uploading-to-testpypi-and-pypi

name: Releases test, coverage and upload to Test PyPi and PyPi

# Build only on creation of new releases
on:
# push: # build on every commit push
# pull_request: # build on every pull request
release: # build on every releases
types:
- published # use published, not released and prereleased, because prereleased is not triggered if created from a draft: https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#release
workflow_dispatch:

jobs:
testbuild:
name: Unit test and building
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
python-version: ["*"] # check the list of versions: https://github.com/actions/python-versions/releases and https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md -- note that "*" represents the latest stable version of Python
os: [ ubuntu-latest, windows-latest, macos-latest ] # jobs that run on Windows and macOS runners that GitHub hosts consume minutes at 2 and 10 times the rate that jobs on Linux runners consume respectively. But it's free for public OSS repositories.
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
# You can test your matrix by printing the current Python version
- name: Display Python version
run: |
python -c "import sys; print(sys.version)"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
#python -m pip install pytest pytest-cov # done in setup.cfg for Py2 or pyproject.toml for Py3
#if [ ${{ matrix.python-version }} <= 3.7 ]; then python -m pip install 'coverage<4'; else python -m pip install coverage; fi
#if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Install this module
#if: ${{ matrix.python-version >= 3 }} # does not work on dynamic versions, see: https://github.com/actions/setup-python/issues/644
# Do not import testmeta, they make the build fails somehow, because some dependencies are unavailable on Py2
run: |
#python -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple --upgrade --editable .[test] --verbose --use-pep517
# Here we do NOT build against test.pypi.org but only the real pypi because we want to test before shipping whether users with a normal pypi version can install our package!
python -m pip install --upgrade --editable .[test] --verbose --use-pep517
- name: Test with pytest
run: |
#coverage run --branch -m pytest . -v # Do NOT do that, because coverage is already run in pytest as specified in pyproject.toml, so this calls two nested instances of coverage, hence this will glitch out!
pytest -v
coverage report -m
- name: Build source distribution and wheel
run: |
python -m pip install --upgrade build
python -sBm build
- name: Save dist/ content for reuse in other GitHub Workflow blocks
if: matrix.os == 'ubuntu-latest'
uses: actions/upload-artifact@v4
with:
path: dist/*

upload_test_pypi: # Upload to TestPyPi first to ensure that the release is OK (we will try to download it and install it afterwards), as recommended in https://py-pkgs.org/08-ci-cd#uploading-to-testpypi-and-pypi
name: Upload to TestPyPi
needs: [testbuild]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Unpack default artifact into dist/
uses: actions/download-artifact@v4
with:
# unpacks default artifact into dist/
# if `name: artifact` is omitted, the action will create extra parent dir
name: artifact
path: dist

- name: Upload to TestPyPi
uses: pypa/[email protected]
with:
user: __token__
password: ${{ secrets.TEST_PYPI_API_TOKEN }}
repository_url: https://test.pypi.org/legacy/
# To test: repository_url: https://test.pypi.org/legacy/ # and also change token: ${{ secrets.PYPI_API_TOKEN }} to secrets.TEST_PYPI_API_TOKEN # for more infos on registering and using TestPyPi, read: https://py-pkgs.org/08-ci-cd#uploading-to-testpypi-and-pypi -- remove the repository_url to upload to the real PyPi

- name: Test install from TestPyPI
run: |
python -m pip install --upgrade pip
# First install dependencies from the real PyPI by installing the local package
# This avoids dependency confusion attacks (e.g. FASTAPI 1.0 on TestPyPI)
python -m pip install .
# Then uninstall the local package but keep dependencies
python -m pip uninstall bookmark-summarizer -y
# Finally install the package from TestPyPI without dependencies (since they are already installed)
python -m pip install \
--index-url https://test.pypi.org/simple/ \
--no-deps \
bookmark-summarizer

upload_pypi: # Upload to the real PyPi if everything else worked before, as suggested in: https://py-pkgs.org/08-ci-cd#uploading-to-testpypi-and-pypi
name: Upload to the real PyPi
needs: [testbuild, upload_test_pypi]
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
with:
# unpacks default artifact into dist/
# if `name: artifact` is omitted, the action will create extra parent dir
name: artifact
path: dist

- uses: pypa/[email protected]
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}

- name: Test install from PyPI
run: |
python -m pip install --upgrade pip
pip uninstall bookmark-summarizer -y
pip install --upgrade bookmark-summarizer
33 changes: 33 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -181,5 +181,38 @@ failed_urls*.json
failed_urls.txt
.env

# 配置文件(包含API密钥,必须忽略)
*.toml
!default_config.toml
!pyproject.toml

# 数据库和索引文件
*.lmdb
*.lmdb/
bookmark_index.lmdb/
whoosh_index/

# 备份目录
backups/
backup/
*.backup

# 临时和调试文件
check_index.py
debug_*.py
test_*.py
!tests/test_*.py

# 日志文件
crawl_errors.log
*.log

# IDE和编辑器
.vscode/
.idea/
*.swp
*.swo
*~

# 不忽略的文件
!requirements.txt
28 changes: 28 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Apply these instructions in any language; translate into the appropriate language before responding.

Before answering, consider what senior expert knowledge would best fit, then adopt the persona of the most relevant human expert for the question, and explicitly mention which expert you chose. For example, for relationship issues, become a couples therapist. You can combine personas if both are highly relevant.

When you are asked to solve a problem but there is no straightforward solution, offer to be creative to find multiple innovative solutions.

Be extremely detailed and comprehensive. Err on the side of including too much information rather than too little, unless the user has requested brevity. Provide background, logic, alternatives, implications, and expert context in your answers.

Be honest, transparent, and thorough. Assume the user needs highly reliable, decision-critical information, so take the time to check for gaps, biases, or false assumptions.

When the user asks for a solution, be innovative but pragmatic and mindful of minimizing algorithmic complexity, and you can suggest multiple alternatives if there is no obviously optimal solution that is well established for this type of problem.

Always check whether it is impossible to achieve what the user wants to do. In this case, clearly state so, then adopt a creative persona, and offer multiple alterative solutions for the underlying problem, then ask the user which solution they would prefer.

Always try to minimize the changes to the bare minimum. Avoid any unnecessary changes, except if they improve readability or functionality. For example, if changing a function's name would not improve either readability nor functionality, just keep it as it is.

To achieve minimization, always think about multiple different ways to reach your objective, as there are not only different conceptual ways, but also once a conceptual way is chosen, there are multiple implementations possible to achieve the same purpose. Always try to choose the implementation that would lead to the least changes in the codebase, unless the user states this approach was already tried and failed.

The user likes literate programming, hence add as many pertinent and non-trivial comments as possible to your changes.

In case of bugs:
* feel free to experiment with the API directly yourself via command-line to check if it works as you expect,
* and always check whether the variables used indeed exist and contain the values they are supposed to at run-time.

Try to be innovative, and to think in a first principles way. Suggest several options when brainstorming solutions or when the solution to a problem is not obvious.

When orchestrating a new plan of action, first investigate the cause of the stated problem and how to best fix it by reading the source files and potentially by running a few CLI commands (no more than 3), make a detailed plan with one or several solutions offered, and ask the user to validate it before doing any edit.

3 changes: 3 additions & 0 deletions BUILD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# BUILD INSTRUCTIONS

To build native executables locally (for your own OS, eg, Windows): Run python build_app.py. For cross-platform builds, push to GitHub to trigger the Actions workflow.
25 changes: 24 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 更新日志
# Changelog

所有对 BookmarkSummarizer 项目的显著更改都将记录在此文件中。

Expand All @@ -7,6 +7,29 @@

## [未发布]

### 0.4.1

Refactored crawl.py for parallel processing.

There was an intentionally sequential path that was triggered when a --limit was set, which was the primary cause of the non-parallel behavior. It was replaced with a single, unified parallel implementation that now correctly handles both limited and unlimited crawls.

* **Parallel Bookmark Processing:** The processing logic now resides in the `_crawl_bookmark` worker function, which is called for every bookmark within the `ThreadPoolExecutor`. This ensures all bookmarks are processed concurrently.
* **Partial Flushing:** The periodic flushing is handled within the main `for future in as_completed(futures):` loop. It checks the time elapsed since the last flush and writes the latest batch of results to disk, preserving the exact same data-saving functionality as before.

### 0.3.1

Big bundle of updates, with various new features and bugfixes:
* Translates the whole project from Chinese to English, including the summarization prompt, but language autodetection was added so that the summary is in the webpage's content language.
* Add support for other browsers, and in addition, bookmarks are by default imported from all installed browsers (hence we import from multiple browsers at once). A single browser can still be specified using an argument.
* Add a very fast fuzzy search engine with a GUI web app with pagination support. It is blazingly fast and scalable both for the indexing and lookup, it is intended to scale to millions of bookmarks, everything is stored on-disk so RAM is not an issue.
* Indexing resuming and deduplication (also implemented for summarization) and atomic intermediate flushing, so we can do incremental updates of the database or interrupt and continue. This is especially important for those with a LOT of bookmarks (like me! Because I use bookmarks as a past browsing sessions saver/dump).
* Pythonic packaging pyproject.toml, so this app can be published on pypi and easily installed through pip install.
* CLI entrypoints are created on pip install for the main scripts: index.py, crawl.py and fuzzy_bookmark_search.py.
* A LMDB database for the content crawling and the summaries, and a Whoosh database for fast fuzzy searching. Both databases scale dynamically along with the number of bookmarks (the crawling database is multiplied by 2 in size each time the bookmarks' content reach too close to the database total size). The LMDB is out-of-core, so it is extremely scalable as it can grow in size much beyond the current RAM available on the user's system, and only a fraction of RAM is necessary to create a view to access the LMDB, so the RAM footprint remains very minimal (a few dozens to hundreds of MB) even when the database is dozens of GB (and a few GB RAM to access a multi-TB database).
* Changed the default settings for the summaries to use ollama and qwen3:1.7b, it is very effective. Alternatively, qwen3:0.6b produces acceptable summaries too albeit less accurate and with a shorter context window.
* Modular architecture: custom parsers can be added without modifying the core logic by adding python files in custom_parsers. For example, custom parsers are provided to extract YouTube transcripts as content to summarize, and suspended tabs that got bookmarked are transparently unsuspended to fetch the true target page content.
* A lot of bugfixes here and there, and additional verbose outputs.

### 新增
- 初始版本开发
- 支持从 Chrome 书签提取 URL
Expand Down
Loading