Enforce strict NCBI request spacing and persist downloaded sizes by dkoslicki · Pull Request #18 · KoslickiLab/known_microbe_hashes

dkoslicki · 2026-01-26T22:55:50Z

Motivation

Ensure accurate bookkeeping by recording the byte size for streamed downloads and terminal/EMPTY efetch responses so the files state DB size column is populated.
Prevent bursts that exceed the configured NCBI requests-per-second by switching to strict request spacing to avoid rate-limit errors.
Support large max_total_workers counts while guaranteeing a hard cap on request rate via centralized throttling and connection limits.

Description

Add a small package for the accession sketcher with db.py, scheduler.py, worker.py, and utils.py, plus CLI scaffolding and a sample config.yaml.
Implement a DB.update_size(file_id, size) helper and use it from the scheduler to persist download byte counts.
Change download_accession in worker.py to accumulate total_bytes while streaming, return that value on success, and attach size to EmptyResultError for terminal efetch responses.
Replace the token-bucket limiter with a strict spacing RequestRateLimiter in utils.py (one request per interval), and ensure concurrency is also bounded by a network semaphore and aiohttp connector limits so requests_per_second is enforced even with many workers.

Testing

No automated tests were run because exercising efetch behavior is network-dependent and was not executed in this environment.

### Motivation - The `size` column in the `files` state DB was left empty for streamed downloads and EMPTY efetch responses, making monitoring and bookkeeping inaccurate. - When a download is streamed rather than fully saved first, the byte count should still be recorded (character/byte length of the stream). - EMPTY/terminal efetch responses should record their downloaded size so terminal rows in the `files` table are populated and useful for diagnostics. ### Description - Add `update_size` to `db.py` which updates the `size` and `updated_at` fields for an existing row in the `files` table via `UPDATE` rather than re-upserting. - Modify `download_accession` in `worker.py` to accumulate `total_bytes` while streaming, return that value on success, and attach `size` to `EmptyResultError` when the response is terminal. - Update `scheduler.py` to call `db.update_size(file_id, size_bytes)` after a successful download and to persist `exc.size` when catching `EmptyResultError`, then mark the job `EMPTY`. - Preserve existing cleanup and error semantics (temp/out file removal, `RateLimitError` handling, and marking exhausted retries as `FAILED`). ### Testing - No automated tests were run because exercising `efetch` behavior requires network-dependent calls and was not executed in this environment.

### Motivation - Ensure accurate bookkeeping by recording the byte size for streamed downloads and terminal/EMPTY efetch responses so the `files` state DB `size` column is populated. - Prevent bursts that exceed the configured NCBI requests-per-second by switching to strict request spacing to avoid rate-limit errors. - Support large `max_total_workers` counts while guaranteeing a hard cap on request rate via centralized throttling and connection limits. ### Description - Add a small package for the accession sketcher with `db.py`, `scheduler.py`, `worker.py`, and `utils.py`, plus CLI scaffolding and a sample `config.yaml`. - Implement a `DB.update_size(file_id, size)` helper and use it from the scheduler to persist download byte counts. - Change `download_accession` in `worker.py` to accumulate `total_bytes` while streaming, return that value on success, and attach `size` to `EmptyResultError` for terminal efetch responses. - Replace the token-bucket limiter with a strict spacing `RequestRateLimiter` in `utils.py` (one request per interval), and ensure concurrency is also bounded by a network semaphore and `aiohttp` connector limits so `requests_per_second` is enforced even with many workers. ### Testing - No automated tests were run because exercising `efetch` behavior is network-dependent and was not executed in this environment.

### Motivation - Reduce NCBI request load by allowing a single `efetch` call to request multiple accessions so large queues can be processed under a strict requests-per-second cap. - Persist precise byte sizes for streamed downloads at the per-accession level so the `files` DB `size` column is populated for bookkeeping and retry logic. - Keep high sketching throughput with many workers while guaranteeing a hard cap on request rate via centralized throttling and controlled concurrency. ### Description - Add `DB.claim_batch(limit, ...)` and `DB.get_tries(file_id)` to atomically claim multiple `PENDING` (or eligible `ERROR`) rows and to inspect try counts. - Replace single-accession downloader with `download_accessions(...)` which issues one `efetch` for a comma-separated batch, streams and parses FASTA records, writes per-accession `.fasta` files, computes per-accession sizes, and returns `paths`, `sizes`, and per-accession `missing` reasons. - Update the scheduler worker to claim batches, build `accession_paths`, call `download_accessions(...)`, update sizes via `DB.update_size`, mark `EMPTY`/`ERROR`/`SKETCHING`/`DONE` per accession, and run `sourmash` per file; add `batch_size` config and default. - Introduce `RequestRateLimiter` and ensure each efetch is gated through it, add package scaffolding (`__init__.py`, `__main__.py`, `main.py`), `utils.py`, `requirements.txt`, sample `accession` list, `run.sh`, and updated `config.yaml` with `batch_size`. ### Testing - No automated tests were executed because exercising `efetch` behavior is network-dependent and was not run in this environment.

dkoslicki · 2026-01-27T19:39:18Z

This works, but will not be used as there are 52M accessions and NCBI rate limits mean it won't be feasible to get every human accession. Last commit 69aff00 borked things. Worked with previous commit

dkoslicki added the codex label Jan 26, 2026 — with ChatGPT Codex Connector

dkoslicki added 2 commits January 27, 2026 11:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enforce strict NCBI request spacing and persist downloaded sizes#18

Enforce strict NCBI request spacing and persist downloaded sizes#18
dkoslicki wants to merge 3 commits intomainfrom
codex/create-accession_sketcher-package-bddatp

dkoslicki commented Jan 26, 2026

Uh oh!

dkoslicki commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dkoslicki commented Jan 26, 2026

Motivation

Description

Testing

Uh oh!

dkoslicki commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant