Skip to content

Enforce strict NCBI request spacing and persist downloaded sizes#18

Open
dkoslicki wants to merge 3 commits intomainfrom
codex/create-accession_sketcher-package-bddatp
Open

Enforce strict NCBI request spacing and persist downloaded sizes#18
dkoslicki wants to merge 3 commits intomainfrom
codex/create-accession_sketcher-package-bddatp

Conversation

@dkoslicki
Copy link
Copy Markdown
Member

Motivation

  • Ensure accurate bookkeeping by recording the byte size for streamed downloads and terminal/EMPTY efetch responses so the files state DB size column is populated.
  • Prevent bursts that exceed the configured NCBI requests-per-second by switching to strict request spacing to avoid rate-limit errors.
  • Support large max_total_workers counts while guaranteeing a hard cap on request rate via centralized throttling and connection limits.

Description

  • Add a small package for the accession sketcher with db.py, scheduler.py, worker.py, and utils.py, plus CLI scaffolding and a sample config.yaml.
  • Implement a DB.update_size(file_id, size) helper and use it from the scheduler to persist download byte counts.
  • Change download_accession in worker.py to accumulate total_bytes while streaming, return that value on success, and attach size to EmptyResultError for terminal efetch responses.
  • Replace the token-bucket limiter with a strict spacing RequestRateLimiter in utils.py (one request per interval), and ensure concurrency is also bounded by a network semaphore and aiohttp connector limits so requests_per_second is enforced even with many workers.

Testing

  • No automated tests were run because exercising efetch behavior is network-dependent and was not executed in this environment.

Codex Task

### Motivation
- The `size` column in the `files` state DB was left empty for streamed downloads and EMPTY efetch responses, making monitoring and bookkeeping inaccurate.
- When a download is streamed rather than fully saved first, the byte count should still be recorded (character/byte length of the stream).
- EMPTY/terminal efetch responses should record their downloaded size so terminal rows in the `files` table are populated and useful for diagnostics.

### Description
- Add `update_size` to `db.py` which updates the `size` and `updated_at` fields for an existing row in the `files` table via `UPDATE` rather than re-upserting.
- Modify `download_accession` in `worker.py` to accumulate `total_bytes` while streaming, return that value on success, and attach `size` to `EmptyResultError` when the response is terminal.
- Update `scheduler.py` to call `db.update_size(file_id, size_bytes)` after a successful download and to persist `exc.size` when catching `EmptyResultError`, then mark the job `EMPTY`.
- Preserve existing cleanup and error semantics (temp/out file removal, `RateLimitError` handling, and marking exhausted retries as `FAILED`).

### Testing
- No automated tests were run because exercising `efetch` behavior requires network-dependent calls and was not executed in this environment.
### Motivation
- Ensure accurate bookkeeping by recording the byte size for streamed downloads and terminal/EMPTY efetch responses so the `files` state DB `size` column is populated.
- Prevent bursts that exceed the configured NCBI requests-per-second by switching to strict request spacing to avoid rate-limit errors.
- Support large `max_total_workers` counts while guaranteeing a hard cap on request rate via centralized throttling and connection limits.

### Description
- Add a small package for the accession sketcher with `db.py`, `scheduler.py`, `worker.py`, and `utils.py`, plus CLI scaffolding and a sample `config.yaml`.
- Implement a `DB.update_size(file_id, size)` helper and use it from the scheduler to persist download byte counts.
- Change `download_accession` in `worker.py` to accumulate `total_bytes` while streaming, return that value on success, and attach `size` to `EmptyResultError` for terminal efetch responses.
- Replace the token-bucket limiter with a strict spacing `RequestRateLimiter` in `utils.py` (one request per interval), and ensure concurrency is also bounded by a network semaphore and `aiohttp` connector limits so `requests_per_second` is enforced even with many workers.

### Testing
- No automated tests were run because exercising `efetch` behavior is network-dependent and was not executed in this environment.
### Motivation
- Reduce NCBI request load by allowing a single `efetch` call to request multiple accessions so large queues can be processed under a strict requests-per-second cap.
- Persist precise byte sizes for streamed downloads at the per-accession level so the `files` DB `size` column is populated for bookkeeping and retry logic.
- Keep high sketching throughput with many workers while guaranteeing a hard cap on request rate via centralized throttling and controlled concurrency.

### Description
- Add `DB.claim_batch(limit, ...)` and `DB.get_tries(file_id)` to atomically claim multiple `PENDING` (or eligible `ERROR`) rows and to inspect try counts.
- Replace single-accession downloader with `download_accessions(...)` which issues one `efetch` for a comma-separated batch, streams and parses FASTA records, writes per-accession `.fasta` files, computes per-accession sizes, and returns `paths`, `sizes`, and per-accession `missing` reasons.
- Update the scheduler worker to claim batches, build `accession_paths`, call `download_accessions(...)`, update sizes via `DB.update_size`, mark `EMPTY`/`ERROR`/`SKETCHING`/`DONE` per accession, and run `sourmash` per file; add `batch_size` config and default.
- Introduce `RequestRateLimiter` and ensure each efetch is gated through it, add package scaffolding (`__init__.py`, `__main__.py`, `main.py`), `utils.py`, `requirements.txt`, sample `accession` list, `run.sh`, and updated `config.yaml` with `batch_size`.

### Testing
- No automated tests were executed because exercising `efetch` behavior is network-dependent and was not run in this environment.
@dkoslicki
Copy link
Copy Markdown
Member Author

This works, but will not be used as there are 52M accessions and NCBI rate limits mean it won't be feasible to get every human accession. Last commit 69aff00 borked things. Worked with previous commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant