Skip to content

(fix): Bunkrr child item fix and improve scraping performance#1689

Draft
Was213xzc wants to merge 59 commits into
Cyberdrop-DL:mainfrom
Was213xzc:bunkrr-child-item-fix
Draft

(fix): Bunkrr child item fix and improve scraping performance#1689
Was213xzc wants to merge 59 commits into
Cyberdrop-DL:mainfrom
Was213xzc:bunkrr-child-item-fix

Conversation

@Was213xzc
Copy link
Copy Markdown
Contributor

@Was213xzc Was213xzc commented Apr 17, 2026

Fixes

  • Fix Bunkrr child album discovery so downloadable items nested under album links are scraped correctly.
  • Fix Bunkrr album pagination so users do not need to manually add ?page=2, ?page=3, etc. to urls.txt.
  • Preserve Fileditch Turnstile handling with regression coverage.

Features

  • Add batched/parallel Bunkrr album page discovery to improve scraping throughput on large albums.
  • Add history-table indexes to speed up repeated scrape/queue lookups.
  • Optimize completed-referer checks to avoid scanning every matching history row.

Why

Large Bunkrr albums can contain downloadable items across child album links and many paginated pages. Previously, users had to manually enumerate page URLs, and large urls.txt runs could spend too much time scraping/queueing due to repeated page and history lookups.

Notes

  • This PR intentionally combines the child-item fix and Bunkrr performance work because the batching/parallelism depends on the pagination and child-album discovery changes.
  • I don't have hard numbers for the performance improvements, but while testing, there was a significantly noticeable improvement over the baseline.

jbsparrow and others added 19 commits March 13, 2026 11:55
refactor: add `nextjs` flight data parsing utils
fix: download of multipage profile albums (Chevereto)
fix: handle additional redirect links (Xenforo)
merge from dev
fix: series name and chapter selection (Toonily)
fix(bunkr): fixes bunkr 404 not found, switch download API from id-based to slug-based endpoint
@Was213xzc Was213xzc changed the title (fix): Bunkrr child item fix (fix): Bunkrr child item fix and improve scraping performance Apr 18, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Bunkrr album scraping correctness (child items + pagination) while improving throughput on large albums, and adds performance optimizations to history lookups plus regression coverage for Fileditch Turnstile handling.

Changes:

  • Add Bunkrr album pagination discovery with batched parallel page fetching and slug de-duping.
  • Speed up history lookups by adding SQLite indexes and optimizing referer-completion checks.
  • Add/extend tests for Bunkrr pagination behavior, history-table indexing, and Fileditch Turnstile detection.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cyberdrop_dl/crawlers/bunkrr.py Adds album page discovery/batching, pagination URL normalization, and download-src resolution changes.
cyberdrop_dl/crawlers/fileditch.py Detects Cloudflare Turnstile challenge pages and raises DDOSGuardError.
cyberdrop_dl/database/tables/history.py Creates history-table indexes on startup; optimizes check_complete_by_referer query.
cyberdrop_dl/database/tables/definitions.py Defines create_history_indexes DDL script.
tests/test_bunkrr.py Unit coverage for new Bunkrr helpers + pagination/behavior regressions.
tests/crawlers/test_cases/bunkr.py Adds an additional Bunkrr crawler integration test case for album child-count coverage.
tests/test_fileditch_turnstile.py Adds regression tests for Turnstile challenge detection.
tests/test_history_table.py Verifies index creation + referer completion check behavior.
.gitignore Ignores pytest/ruff caches and local crawler debugging artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cyberdrop_dl/crawlers/bunkrr.py Outdated
@NTFSvolume NTFSvolume self-assigned this Apr 22, 2026
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. Can you please move these database changes to a different PR?

Comment on lines 132 to +140
if domain is None:
query = "SELECT completed FROM media WHERE referer = ?"
query = "SELECT 1 FROM media WHERE referer = ? and completed != 0 LIMIT 1"
params = (str(referer),)
else:
query = "SELECT completed FROM media WHERE referer = ? and domain = ?"
query = "SELECT 1 FROM media WHERE referer = ? and domain = ? and completed != 0 LIMIT 1"
params = str(referer), domain

cursor = await self.db_conn.execute(query, params)
if domain is None:
rows = await cursor.fetchall()
else:
row = await cursor.fetchone()
if row is None:
return False
rows = [row]
return bool(rows and any(row[0] != 0 for row in rows))
return await cursor.fetchone() is not None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great as well. Please move to a database PR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you made pagination changes to bunkr. We can get all files in an album with a single request using the advanced query param, which is what CDL already does.

No need for pagination

Copy link
Copy Markdown
Contributor Author

@Was213xzc Was213xzc Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not realize there was an advanced query param that did this, I was encountering errors with it failing to scrape items on other pages

Comment on lines 42 to +43
soup = await self.request_soup(scrape_item.url)
_check_turnstile(soup)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer required but if you want to force it, CDL has a dedicated module for that

Suggested change
soup = await self.request_soup(scrape_item.url)
_check_turnstile(soup)
soup = await self.request_soup(scrape_item.url)
from cyberdrop_dl import ddos_guard
await ddos_guard.check(soup)

@NTFSvolume
Copy link
Copy Markdown
Member

I will cherry pick the database changes on a new PR

@NTFSvolume NTFSvolume changed the base branch from dev to main April 29, 2026 04:21
@NTFSvolume NTFSvolume marked this pull request as draft April 30, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants