Treat NCBI EMPTY RESULT as terminal; exit scheduler when no retryable work remains by dkoslicki · Pull Request #16 · KoslickiLab/known_microbe_hashes

dkoslicki · 2026-01-23T18:36:13Z

Motivation

Some NCBI efetch responses are terminal (e.g. EMPTY RESULT) and should be recorded as a terminal EMPTY status rather than retried and recorded as ERROR.
The scheduler could hang indefinitely when only terminal errors remained because it counted all ERROR rows as pending work and kept waiting for retryable work.

Description

Add EmptyResultError in worker.py and raise it when the downloaded response is empty or contains EMPTY RESULT / FAILURE payloads, and preserve existing RateLimitError semantics.
Update scheduler.py to catch EmptyResultError, mark the job EMPTY, clean up temp/output files, and stop retrying that accession.
Add count_claimable to db.py to return the number of truly claimable jobs (pending plus retryable ERROR entries) and use it in the scheduler loop to decide when to exit.
Mark jobs that exhaust their retry budget as FAILED so they are no longer considered retryable, and adjust the pending/claimable calculation used by the monitor/exit logic.

Testing

No automated tests were run because end-to-end validation requires network efetch calls which are environment-dependent and were not executed here.

### Motivation - Provide a pipeline analogous to the existing `stream_sketcher` that ingests lists of NCBI accessions and runs the download->sketch->discard pipeline on each accession. - Support highly-parallel, restartable, rate-limited downloads from NCBI (with retries, backoff, and graceful failure handling). - Accept accession lists derived from files like `nucl_gb.accession2taxid`/`nucl_wgs.accession2taxid` and use `efetch` (HTTP API) to stream FASTA payloads for sketching. ### Description - Add a new package `shared_scripts/accession_sketcher` containing an `accession_sketcher` module with `scheduler.py`, `worker.py`, `db.py`, and `utils.py` to mirror the architecture of `stream_sketcher` and coordinate ingestion, state, and workers. - Implement `download_accession` which uses NCBI `efetch` via HTTP to stream FASTA, validate FASTA starts with `>`, and a `RequestRateLimiter` to enforce configurable requests-per-second (defaults to 10 rps when `ncbi_api_key` is present, otherwise 3 rps). - Add an SQLite-backed job state DB (`DB`) with atomic claiming, reset of stuck jobs, upsert, and status tracking, and worker logic that sketches using `sourmash` with retries, backoff, and output sharding. - Include sample `config.yaml`, an example accession list, `requirements.txt`, and `run.sh` to run the new pipeline; configuration supports `ncbi_api_key`, `requests_per_second`, concurrency, timeouts, and sourmash params. ### Testing - Installed `sourmash` via `pip install sourmash` successfully to enable sketching calls. - Ran the pipeline with `python -m accession_sketcher --config config.yaml`, which successfully loaded and enqueued 16 sample accessions and started workers as evidenced by logs showing `Loaded 16 accessions` and worker monitor messages. - The run could not complete end-to-end in this environment because HTTP requests to `eutils.ncbi.nlm.nih.gov` failed with `ClientConnectorError: Network is unreachable`, causing download attempts to retry and surface network errors in logs. - The code was exercised up to the network I/O stage (rate limiting and retry logic invoked) and no local exceptions were observed in parsing, DB operations, or the sketch invocation scaffolding prior to the network failure.

… work remains ### Motivation - Some NCBI `efetch` responses are terminal (e.g. `EMPTY RESULT`) and should be recorded as a terminal `EMPTY` status rather than retried and recorded as `ERROR`. - The scheduler could hang indefinitely when only terminal errors remained because it counted all `ERROR` rows as pending work and kept waiting for retryable work. ### Description - Add `EmptyResultError` in `worker.py` and raise it when the downloaded response is empty or contains `EMPTY RESULT` / `FAILURE` payloads, and preserve existing `RateLimitError` semantics. - Update `scheduler.py` to catch `EmptyResultError`, mark the job `EMPTY`, clean up temp/output files, and stop retrying that accession. - Add `count_claimable` to `db.py` to return the number of truly claimable jobs (pending plus retryable `ERROR` entries) and use it in the scheduler loop to decide when to exit. - Mark jobs that exhaust their retry budget as `FAILED` so they are no longer considered retryable, and adjust the pending/claimable calculation used by the monitor/exit logic. ### Testing - No automated tests were run because end-to-end validation requires network `efetch` calls which are environment-dependent and were not executed here.

dkoslicki · 2026-01-27T19:38:44Z

This works, but will not be used as there are 52M accessions and NCBI rate limits mean it won't be feasible to get every human accession.

dkoslicki added the codex label Jan 23, 2026 — with ChatGPT Codex Connector

dkoslicki added 2 commits January 23, 2026 13:52

add the human accession config and modify the run #15

8252b00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Treat NCBI EMPTY RESULT as terminal; exit scheduler when no retryable work remains#16

Treat NCBI EMPTY RESULT as terminal; exit scheduler when no retryable work remains#16
dkoslicki wants to merge 3 commits intomainfrom
codex/create-accession_sketcher-package

dkoslicki commented Jan 23, 2026

Uh oh!

dkoslicki commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dkoslicki commented Jan 23, 2026

Motivation

Description

Testing

Uh oh!

dkoslicki commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant