Skip to content

refactor: Refactor Download Management to Prevent Task Starvation#1713

Open
Barbarella6666666 wants to merge 17 commits into
Cyberdrop-DL:mainfrom
Barbarella6666666:fix-immediate-downloads
Open

refactor: Refactor Download Management to Prevent Task Starvation#1713
Barbarella6666666 wants to merge 17 commits into
Cyberdrop-DL:mainfrom
Barbarella6666666:fix-immediate-downloads

Conversation

@Barbarella6666666
Copy link
Copy Markdown
Contributor

This is my take on trying to solve #1702

Replaces the TaskGroup in scrape_mapper.py with an unbounded asyncio. Queue and a background Task Dispatcher. This architectural shift ensures that file downloads are scheduled immediately, solving the starvation issues encountered when scraping high volumes of data.

Instead of batching downloads within a TaskGroup, we now pipe them into an internal queue. A dedicated dispatcher task consumes this queue in a loop, spawning a new independent asyncio.Task for every download.

  • Immediate Start: Downloads begin the moment the scraper identifies a file.
  • Fair Scheduling: Each download is an independent task, preventing the scraper loop from "starving" the download processes.
  • Controlled Concurrency: We still rely on the existing three-level semaphore system (server, domain, global) to handle rate limiting.
  • Graceful Shutdown: The system uses a None sentinel to signal the dispatcher to stop only after the queue is empty.
  • Minimal Footprint: Changes are localized to scrape_mapper.py.

Discarded Approaches

  • Fixed Worker Pool: Rejected because a few workers stuck on a single domain's semaphore would block downloads for all other domains (head-of-line blocking).
  • asyncio.PriorityQueue: Rejected because asyncio doesn't support native task-level priority. Reimplementing the scheduler would be overly invasive and could lead to scraper starvation.
  • Scraping Semaphore: Rejected because limiting scraping tasks reduces overall throughput without fixing the underlying scheduling conflict.
  • Thread Pool: Rejected because downloads are I/O-bound and already use aiohttp. Adding threads would break integration with existing asyncio semaphores and introduce unnecessary complexity.

Downloads were starved because the asyncio scheduler prioritized the
many scraping tasks over download tasks in the same event loop.
Introduce an asyncio.Queue as the handoff mechanism so download
coroutines are dispatched independently of scraping pressure.
A single dispatcher task drains the asyncio.Queue and spawns each
download as its own asyncio.Task. This preserves the original
concurrency model where semaphores (server/domain/global) control
parallelism, while ensuring downloads are no longer starved by
scraping tasks.
Replace the downloads TaskGroup context manager with the new
dispatcher task. On shutdown, a None sentinel is sent to the queue
so the dispatcher drains remaining downloads before exiting.
The TaskGroups dataclass no longer needs a downloads field.
wait_until_scrape_is_done is a UI notification, not a download.
It was piggybacking on the downloads TaskGroup; now it runs as an
independent asyncio.create_task so it does not occupy the download
queue.
@NTFSvolume
Copy link
Copy Markdown
Member

Yeah, this is the approach I mentioned it on #1702.

I didn't explain why but the reason I didn't what to use a queue is cause we lose all the error handling that the task group does for us on cancellation (which we never do, but the user could hit Ctrl+C in the middle of a run) or error. It will also use significantly more memory cause we will have a bunch of free flying coros on the queue not attached to any task until the next loop iteration.

This will work though, so I'm not completely against it. I will do a proper review and give it a try later today

We will need to add logic to cancel all tasks in the queue on error to get similar behavior to a taskgroup (Maybe use a taskgroup within?) to prevent the classic Coroutine was never awaited error.

Ideally, we should use a queue of MediaItem objects instead of coros, and create the coro+task from within the queue consumer task, but that will require a bigger refactor.

@NTFSvolume NTFSvolume self-requested a review April 25, 2026 12:34
Copy link
Copy Markdown
Member

@NTFSvolume NTFSvolume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not getting the desired behavior for me. A bunch of pages get scraped before downloads start. I think we need eager tasks for this to work.

Also, as I expected, it is swallowing any `CTRL + C I hit

Comment thread cyberdrop_dl/scrape_mapper.py Outdated
Comment on lines +166 to +175
while True:
coro = await self._pending_downloads.get()
if coro is None:
break
task = asyncio.create_task(coro)
active.add(task)
task.add_done_callback(active.discard)

if active:
await asyncio.gather(*active, return_exceptions=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should wait on done event. We can actually use a taskgroup

Suggested change
while True:
coro = await self._pending_downloads.get()
if coro is None:
break
task = asyncio.create_task(coro)
active.add(task)
task.add_done_callback(active.discard)
if active:
await asyncio.gather(*active, return_exceptions=True)
while not self._done.is_set():
coro = await self._pending_downloads.get()
if coro is None:
break
task = asyncio.create_task(coro)
active.add(task)
task.add_done_callback(active.discard)
if active:
async with asyncio.TaskGroup() as tg:
for pending in active:
tg.create_task(pending)

Comment thread cyberdrop_dl/scrape_mapper.py Outdated
Comment on lines +223 to +231
@@ -211,7 +227,8 @@ async def wait_until_scrape_is_done() -> None:
(crawler.DOMAIN, count) for crawler in self._factory if (count := len(crawler._scraped_items))
)

self.create_download_task(wait_until_scrape_is_done())
task = asyncio.create_task(wait_until_scrape_is_done())
background_tasks.add(task)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move the hide_scrape_panel call to _download_dispatcher

@NTFSvolume
Copy link
Copy Markdown
Member

Added a commit so we can simulate downloads without actually downloading anything.

I tested with https://boards.4chan.org/p/ but any supported site that has pagination should be good for testing. I got hundreds of pages scraped before any download begins.

Check self._done.is_set() in loop condition instead of while True.
Drain remaining queued coroutines after loop exits.
Use TaskGroup instead of asyncio.gather for final active tasks,
restoring proper error propagation and Ctrl+C handling.
The dispatcher already knows when scraping is done (receives the None
sentinel). Move the UI callback there instead of a standalone task.
Collect url_count stats after the scrape loop in run().
On Python 3.12+, set asyncio.eager_task_factory on the event loop so
that tasks created via asyncio.create_task() begin executing immediately
instead of waiting for the next scheduler cycle. This is the key fix
for downloads not starting while scraping is active. On Python 3.11,
falls back to default behavior (no eager scheduling available).
On CancelledError or KeyboardInterrupt, cancel all active download
tasks and close any unawaited coroutines still in the queue to avoid
'coroutine was never awaited' warnings.
asyncio.shield() returns a Future, not a coroutine, so
tg.create_task(asyncio.shield(task)) raises TypeError.
The active set contains Tasks already running, not coroutines,
so we simply gather them directly.
The _fake_download method simulated download progress with a fixed
10GB size, preventing actual downloads from running. Removed the
method, its call in _download, and the random import that was only
used by it.
The finally block always drained all active downloads before exiting,
effectively swallowing KeyboardInterrupt. Split into except/else/finally:
on error or Ctrl+C, cancel the dispatcher and propagate immediately;
on normal exit, drain the queue and wait for downloads to finish.
Instead of relying on eager_task_factory (Python 3.12+ only), spawn
a fixed pool of download workers that consume from the queue. The
number of workers is read from max_simultaneous_downloads config.
This works on Python 3.11+ and respects the user's concurrency limit.
The download queue counter only counted items waiting on semaphores
inside the downloader. Items still in the pending queue waiting for
a worker were not counted, causing the UI to show the worker count
instead of the actual queue size.
@Barbarella6666666
Copy link
Copy Markdown
Contributor Author

  • Replaced the single dispatcher loop with a pool of download workers that consume from the queue, so downloads start immediately. The number of workers matches max_simultaneous_downloads from the config.
  • Fixed Ctrl+C handling — split the finally block in __call__ into except/else/finally so the dispatcher is cancelled immediately on interrupt instead of draining all active downloads.
  • Removed leftover _fake_download test method from crawler.py.

@NTFSvolume NTFSvolume self-assigned this Apr 28, 2026
@NTFSvolume NTFSvolume self-requested a review April 28, 2026 23:17
Copy link
Copy Markdown
Member

@NTFSvolume NTFSvolume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use workers either cause now the numbers of concurrent downloads will have the most relevance in the concurrency decision, when it should be the least relevant

For example, if we use config max of 10 concurrent downloads and the first 10 downloads CDL finds are from bunkr and they are all from the same bunkr CDN, CDL will only download 1 file at a time.

All the download workers will be blocked by the bunkr lock. Downloads from other sites won't start until all downloads from bunkr are finished.

@NTFSvolume
Copy link
Copy Markdown
Member

NTFSvolume commented May 1, 2026

This is the current logic to start downloads:

server = (media_item.debrid_link or media_item.url).host
server_limit, domain_limit, global_limit = (
self.client.server_limiter(media_item.domain, server),
self._semaphore,
self.manager.client_manager.global_download_slots,
)
async with server_limit, domain_limit, global_limit:
self.processed_items.add(media_item.db_path)
self.waiting_items -= 1
yield

The locks go from most restrictive (narrow scope) to least restrictive (wide scope):
site server lock > site concurrent downloads limit -> config concurrent downloads limit

Putting all downloads on the same queue with a fixed number of workers will make downloads from different sites block each other

@NTFSvolume
Copy link
Copy Markdown
Member

I thinking of just dropping python 3.11 support. We can force eager tasks with 3.12+ and just let the event loop manage them as best as it can, instead of trying to fix this ourselfs.

@jbsparrow, do think is OK if we drop p3.11?

@jbsparrow
Copy link
Copy Markdown
Member

jbsparrow commented May 3, 2026

I thinking of just dropping python 3.11 support. We can force eager tasks with 3.12+ and just let the event loop manage them as best as it can, instead of trying to fix this ourselfs.

@jbsparrow, do think is OK if we drop p3.11?

Honestly I think that this is the best solution. It's unfortunate to drop support for a Python version, but I think that our users are fine upgrading. We don't want to implement some hacky solution for a few versions when we will inevitably be upgrading eventually anyways.

This solution is much simpler and easier for us. I think it should be implemented in v10.0, along with the database migration system. I will work on that more today and upload the branch.

@Barbarella6666666
Copy link
Copy Markdown
Contributor Author

I thinking of just dropping python 3.11 support. We can force eager tasks with 3.12+ and just let the event loop manage them as best as it can, instead of trying to fix this ourselfs.

@jbsparrow, do think is OK if we drop p3.11?

That is the best solution for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants