Skip to content

Treat NCBI EMPTY RESULT as terminal; exit scheduler when no retryable work remains#16

Draft
dkoslicki wants to merge 3 commits intomainfrom
codex/create-accession_sketcher-package
Draft

Treat NCBI EMPTY RESULT as terminal; exit scheduler when no retryable work remains#16
dkoslicki wants to merge 3 commits intomainfrom
codex/create-accession_sketcher-package

Conversation

@dkoslicki
Copy link
Copy Markdown
Member

Motivation

  • Some NCBI efetch responses are terminal (e.g. EMPTY RESULT) and should be recorded as a terminal EMPTY status rather than retried and recorded as ERROR.
  • The scheduler could hang indefinitely when only terminal errors remained because it counted all ERROR rows as pending work and kept waiting for retryable work.

Description

  • Add EmptyResultError in worker.py and raise it when the downloaded response is empty or contains EMPTY RESULT / FAILURE payloads, and preserve existing RateLimitError semantics.
  • Update scheduler.py to catch EmptyResultError, mark the job EMPTY, clean up temp/output files, and stop retrying that accession.
  • Add count_claimable to db.py to return the number of truly claimable jobs (pending plus retryable ERROR entries) and use it in the scheduler loop to decide when to exit.
  • Mark jobs that exhaust their retry budget as FAILED so they are no longer considered retryable, and adjust the pending/claimable calculation used by the monitor/exit logic.

Testing

  • No automated tests were run because end-to-end validation requires network efetch calls which are environment-dependent and were not executed here.

Codex Task

### Motivation
- Provide a pipeline analogous to the existing `stream_sketcher` that ingests lists of NCBI accessions and runs the download->sketch->discard pipeline on each accession.
- Support highly-parallel, restartable, rate-limited downloads from NCBI (with retries, backoff, and graceful failure handling).
- Accept accession lists derived from files like `nucl_gb.accession2taxid`/`nucl_wgs.accession2taxid` and use `efetch` (HTTP API) to stream FASTA payloads for sketching.

### Description
- Add a new package `shared_scripts/accession_sketcher` containing an `accession_sketcher` module with `scheduler.py`, `worker.py`, `db.py`, and `utils.py` to mirror the architecture of `stream_sketcher` and coordinate ingestion, state, and workers.
- Implement `download_accession` which uses NCBI `efetch` via HTTP to stream FASTA, validate FASTA starts with `>`, and a `RequestRateLimiter` to enforce configurable requests-per-second (defaults to 10 rps when `ncbi_api_key` is present, otherwise 3 rps).
- Add an SQLite-backed job state DB (`DB`) with atomic claiming, reset of stuck jobs, upsert, and status tracking, and worker logic that sketches using `sourmash` with retries, backoff, and output sharding.
- Include sample `config.yaml`, an example accession list, `requirements.txt`, and `run.sh` to run the new pipeline; configuration supports `ncbi_api_key`, `requests_per_second`, concurrency, timeouts, and sourmash params.

### Testing
- Installed `sourmash` via `pip install sourmash` successfully to enable sketching calls.
- Ran the pipeline with `python -m accession_sketcher --config config.yaml`, which successfully loaded and enqueued 16 sample accessions and started workers as evidenced by logs showing `Loaded 16 accessions` and worker monitor messages.
- The run could not complete end-to-end in this environment because HTTP requests to `eutils.ncbi.nlm.nih.gov` failed with `ClientConnectorError: Network is unreachable`, causing download attempts to retry and surface network errors in logs.
- The code was exercised up to the network I/O stage (rate limiting and retry logic invoked) and no local exceptions were observed in parsing, DB operations, or the sketch invocation scaffolding prior to the network failure.
… work remains

### Motivation
- Some NCBI `efetch` responses are terminal (e.g. `EMPTY RESULT`) and should be recorded as a terminal `EMPTY` status rather than retried and recorded as `ERROR`.
- The scheduler could hang indefinitely when only terminal errors remained because it counted all `ERROR` rows as pending work and kept waiting for retryable work.

### Description
- Add `EmptyResultError` in `worker.py` and raise it when the downloaded response is empty or contains `EMPTY RESULT` / `FAILURE` payloads, and preserve existing `RateLimitError` semantics.
- Update `scheduler.py` to catch `EmptyResultError`, mark the job `EMPTY`, clean up temp/output files, and stop retrying that accession.
- Add `count_claimable` to `db.py` to return the number of truly claimable jobs (pending plus retryable `ERROR` entries) and use it in the scheduler loop to decide when to exit.
- Mark jobs that exhaust their retry budget as `FAILED` so they are no longer considered retryable, and adjust the pending/claimable calculation used by the monitor/exit logic.

### Testing
- No automated tests were run because end-to-end validation requires network `efetch` calls which are environment-dependent and were not executed here.
@dkoslicki
Copy link
Copy Markdown
Member Author

This works, but will not be used as there are 52M accessions and NCBI rate limits mean it won't be feasible to get every human accession.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant