Enforce strict NCBI request spacing and persist downloaded sizes#18
Open
Enforce strict NCBI request spacing and persist downloaded sizes#18
Conversation
### Motivation - The `size` column in the `files` state DB was left empty for streamed downloads and EMPTY efetch responses, making monitoring and bookkeeping inaccurate. - When a download is streamed rather than fully saved first, the byte count should still be recorded (character/byte length of the stream). - EMPTY/terminal efetch responses should record their downloaded size so terminal rows in the `files` table are populated and useful for diagnostics. ### Description - Add `update_size` to `db.py` which updates the `size` and `updated_at` fields for an existing row in the `files` table via `UPDATE` rather than re-upserting. - Modify `download_accession` in `worker.py` to accumulate `total_bytes` while streaming, return that value on success, and attach `size` to `EmptyResultError` when the response is terminal. - Update `scheduler.py` to call `db.update_size(file_id, size_bytes)` after a successful download and to persist `exc.size` when catching `EmptyResultError`, then mark the job `EMPTY`. - Preserve existing cleanup and error semantics (temp/out file removal, `RateLimitError` handling, and marking exhausted retries as `FAILED`). ### Testing - No automated tests were run because exercising `efetch` behavior requires network-dependent calls and was not executed in this environment.
### Motivation - Ensure accurate bookkeeping by recording the byte size for streamed downloads and terminal/EMPTY efetch responses so the `files` state DB `size` column is populated. - Prevent bursts that exceed the configured NCBI requests-per-second by switching to strict request spacing to avoid rate-limit errors. - Support large `max_total_workers` counts while guaranteeing a hard cap on request rate via centralized throttling and connection limits. ### Description - Add a small package for the accession sketcher with `db.py`, `scheduler.py`, `worker.py`, and `utils.py`, plus CLI scaffolding and a sample `config.yaml`. - Implement a `DB.update_size(file_id, size)` helper and use it from the scheduler to persist download byte counts. - Change `download_accession` in `worker.py` to accumulate `total_bytes` while streaming, return that value on success, and attach `size` to `EmptyResultError` for terminal efetch responses. - Replace the token-bucket limiter with a strict spacing `RequestRateLimiter` in `utils.py` (one request per interval), and ensure concurrency is also bounded by a network semaphore and `aiohttp` connector limits so `requests_per_second` is enforced even with many workers. ### Testing - No automated tests were run because exercising `efetch` behavior is network-dependent and was not executed in this environment.
### Motivation - Reduce NCBI request load by allowing a single `efetch` call to request multiple accessions so large queues can be processed under a strict requests-per-second cap. - Persist precise byte sizes for streamed downloads at the per-accession level so the `files` DB `size` column is populated for bookkeeping and retry logic. - Keep high sketching throughput with many workers while guaranteeing a hard cap on request rate via centralized throttling and controlled concurrency. ### Description - Add `DB.claim_batch(limit, ...)` and `DB.get_tries(file_id)` to atomically claim multiple `PENDING` (or eligible `ERROR`) rows and to inspect try counts. - Replace single-accession downloader with `download_accessions(...)` which issues one `efetch` for a comma-separated batch, streams and parses FASTA records, writes per-accession `.fasta` files, computes per-accession sizes, and returns `paths`, `sizes`, and per-accession `missing` reasons. - Update the scheduler worker to claim batches, build `accession_paths`, call `download_accessions(...)`, update sizes via `DB.update_size`, mark `EMPTY`/`ERROR`/`SKETCHING`/`DONE` per accession, and run `sourmash` per file; add `batch_size` config and default. - Introduce `RequestRateLimiter` and ensure each efetch is gated through it, add package scaffolding (`__init__.py`, `__main__.py`, `main.py`), `utils.py`, `requirements.txt`, sample `accession` list, `run.sh`, and updated `config.yaml` with `batch_size`. ### Testing - No automated tests were executed because exercising `efetch` behavior is network-dependent and was not run in this environment.
Member
Author
|
This works, but will not be used as there are 52M accessions and NCBI rate limits mean it won't be feasible to get every human accession. Last commit 69aff00 borked things. Worked with previous commit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
filesstate DBsizecolumn is populated.max_total_workerscounts while guaranteeing a hard cap on request rate via centralized throttling and connection limits.Description
db.py,scheduler.py,worker.py, andutils.py, plus CLI scaffolding and a sampleconfig.yaml.DB.update_size(file_id, size)helper and use it from the scheduler to persist download byte counts.download_accessioninworker.pyto accumulatetotal_byteswhile streaming, return that value on success, and attachsizetoEmptyResultErrorfor terminal efetch responses.RequestRateLimiterinutils.py(one request per interval), and ensure concurrency is also bounded by a network semaphore andaiohttpconnector limits sorequests_per_secondis enforced even with many workers.Testing
efetchbehavior is network-dependent and was not executed in this environment.Codex Task