(fix): Bunkrr child item fix and improve scraping performance#1689
(fix): Bunkrr child item fix and improve scraping performance#1689Was213xzc wants to merge 59 commits into
Conversation
`/redirect/?to=<b64_link>`
- fix 404 not found errors
refactor: add `nextjs` flight data parsing utils
feat: add upload.ee support
refactor: simplify css iget API
fix: update CDN overrides (bunkr)
feat: add filester support
fix: download of multipage profile albums (Chevereto)
feat: use original name (turbovid)
fix: handle additional redirect links (Xenforo)
merge from master
merge from dev
fix: series name and chapter selection (Toonily)
fix(bunkr): fixes bunkr 404 not found, switch download API from id-based to slug-based endpoint
There was a problem hiding this comment.
Pull request overview
This PR fixes Bunkrr album scraping correctness (child items + pagination) while improving throughput on large albums, and adds performance optimizations to history lookups plus regression coverage for Fileditch Turnstile handling.
Changes:
- Add Bunkrr album pagination discovery with batched parallel page fetching and slug de-duping.
- Speed up history lookups by adding SQLite indexes and optimizing referer-completion checks.
- Add/extend tests for Bunkrr pagination behavior, history-table indexing, and Fileditch Turnstile detection.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
cyberdrop_dl/crawlers/bunkrr.py |
Adds album page discovery/batching, pagination URL normalization, and download-src resolution changes. |
cyberdrop_dl/crawlers/fileditch.py |
Detects Cloudflare Turnstile challenge pages and raises DDOSGuardError. |
cyberdrop_dl/database/tables/history.py |
Creates history-table indexes on startup; optimizes check_complete_by_referer query. |
cyberdrop_dl/database/tables/definitions.py |
Defines create_history_indexes DDL script. |
tests/test_bunkrr.py |
Unit coverage for new Bunkrr helpers + pagination/behavior regressions. |
tests/crawlers/test_cases/bunkr.py |
Adds an additional Bunkrr crawler integration test case for album child-count coverage. |
tests/test_fileditch_turnstile.py |
Adds regression tests for Turnstile challenge detection. |
tests/test_history_table.py |
Verifies index creation + referer completion check behavior. |
.gitignore |
Ignores pytest/ruff caches and local crawler debugging artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
This is a good change. Can you please move these database changes to a different PR?
| if domain is None: | ||
| query = "SELECT completed FROM media WHERE referer = ?" | ||
| query = "SELECT 1 FROM media WHERE referer = ? and completed != 0 LIMIT 1" | ||
| params = (str(referer),) | ||
| else: | ||
| query = "SELECT completed FROM media WHERE referer = ? and domain = ?" | ||
| query = "SELECT 1 FROM media WHERE referer = ? and domain = ? and completed != 0 LIMIT 1" | ||
| params = str(referer), domain | ||
|
|
||
| cursor = await self.db_conn.execute(query, params) | ||
| if domain is None: | ||
| rows = await cursor.fetchall() | ||
| else: | ||
| row = await cursor.fetchone() | ||
| if row is None: | ||
| return False | ||
| rows = [row] | ||
| return bool(rows and any(row[0] != 0 for row in rows)) | ||
| return await cursor.fetchone() is not None |
There was a problem hiding this comment.
This is great as well. Please move to a database PR
There was a problem hiding this comment.
I don't understand why you made pagination changes to bunkr. We can get all files in an album with a single request using the advanced query param, which is what CDL already does.
No need for pagination
There was a problem hiding this comment.
did not realize there was an advanced query param that did this, I was encountering errors with it failing to scrape items on other pages
| soup = await self.request_soup(scrape_item.url) | ||
| _check_turnstile(soup) |
There was a problem hiding this comment.
This is no longer required but if you want to force it, CDL has a dedicated module for that
| soup = await self.request_soup(scrape_item.url) | |
| _check_turnstile(soup) | |
| soup = await self.request_soup(scrape_item.url) | |
| from cyberdrop_dl import ddos_guard | |
| await ddos_guard.check(soup) |
|
I will cherry pick the database changes on a new PR |
Fixes
?page=2,?page=3, etc. tourls.txt.Features
Why
Large Bunkrr albums can contain downloadable items across child album links and many paginated pages. Previously, users had to manually enumerate page URLs, and large
urls.txtruns could spend too much time scraping/queueing due to repeated page and history lookups.Notes