Treat NCBI EMPTY RESULT as terminal; exit scheduler when no retryable work remains#16
Draft
Treat NCBI EMPTY RESULT as terminal; exit scheduler when no retryable work remains#16
Conversation
### Motivation - Provide a pipeline analogous to the existing `stream_sketcher` that ingests lists of NCBI accessions and runs the download->sketch->discard pipeline on each accession. - Support highly-parallel, restartable, rate-limited downloads from NCBI (with retries, backoff, and graceful failure handling). - Accept accession lists derived from files like `nucl_gb.accession2taxid`/`nucl_wgs.accession2taxid` and use `efetch` (HTTP API) to stream FASTA payloads for sketching. ### Description - Add a new package `shared_scripts/accession_sketcher` containing an `accession_sketcher` module with `scheduler.py`, `worker.py`, `db.py`, and `utils.py` to mirror the architecture of `stream_sketcher` and coordinate ingestion, state, and workers. - Implement `download_accession` which uses NCBI `efetch` via HTTP to stream FASTA, validate FASTA starts with `>`, and a `RequestRateLimiter` to enforce configurable requests-per-second (defaults to 10 rps when `ncbi_api_key` is present, otherwise 3 rps). - Add an SQLite-backed job state DB (`DB`) with atomic claiming, reset of stuck jobs, upsert, and status tracking, and worker logic that sketches using `sourmash` with retries, backoff, and output sharding. - Include sample `config.yaml`, an example accession list, `requirements.txt`, and `run.sh` to run the new pipeline; configuration supports `ncbi_api_key`, `requests_per_second`, concurrency, timeouts, and sourmash params. ### Testing - Installed `sourmash` via `pip install sourmash` successfully to enable sketching calls. - Ran the pipeline with `python -m accession_sketcher --config config.yaml`, which successfully loaded and enqueued 16 sample accessions and started workers as evidenced by logs showing `Loaded 16 accessions` and worker monitor messages. - The run could not complete end-to-end in this environment because HTTP requests to `eutils.ncbi.nlm.nih.gov` failed with `ClientConnectorError: Network is unreachable`, causing download attempts to retry and surface network errors in logs. - The code was exercised up to the network I/O stage (rate limiting and retry logic invoked) and no local exceptions were observed in parsing, DB operations, or the sketch invocation scaffolding prior to the network failure.
… work remains ### Motivation - Some NCBI `efetch` responses are terminal (e.g. `EMPTY RESULT`) and should be recorded as a terminal `EMPTY` status rather than retried and recorded as `ERROR`. - The scheduler could hang indefinitely when only terminal errors remained because it counted all `ERROR` rows as pending work and kept waiting for retryable work. ### Description - Add `EmptyResultError` in `worker.py` and raise it when the downloaded response is empty or contains `EMPTY RESULT` / `FAILURE` payloads, and preserve existing `RateLimitError` semantics. - Update `scheduler.py` to catch `EmptyResultError`, mark the job `EMPTY`, clean up temp/output files, and stop retrying that accession. - Add `count_claimable` to `db.py` to return the number of truly claimable jobs (pending plus retryable `ERROR` entries) and use it in the scheduler loop to decide when to exit. - Mark jobs that exhaust their retry budget as `FAILED` so they are no longer considered retryable, and adjust the pending/claimable calculation used by the monitor/exit logic. ### Testing - No automated tests were run because end-to-end validation requires network `efetch` calls which are environment-dependent and were not executed here.
Member
Author
|
This works, but will not be used as there are 52M accessions and NCBI rate limits mean it won't be feasible to get every human accession. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
efetchresponses are terminal (e.g.EMPTY RESULT) and should be recorded as a terminalEMPTYstatus rather than retried and recorded asERROR.ERRORrows as pending work and kept waiting for retryable work.Description
EmptyResultErrorinworker.pyand raise it when the downloaded response is empty or containsEMPTY RESULT/FAILUREpayloads, and preserve existingRateLimitErrorsemantics.scheduler.pyto catchEmptyResultError, mark the jobEMPTY, clean up temp/output files, and stop retrying that accession.count_claimabletodb.pyto return the number of truly claimable jobs (pending plus retryableERRORentries) and use it in the scheduler loop to decide when to exit.FAILEDso they are no longer considered retryable, and adjust the pending/claimable calculation used by the monitor/exit logic.Testing
efetchcalls which are environment-dependent and were not executed here.Codex Task