Release v1.5.2 with metrics by gaurav · Pull Request #253 · NCATSTranslator/NameResolution

gaurav · 2026-04-07T19:37:37Z

Adds some metrics for tracking performance on the Solr database:

Adds a ?full=true mode to the /status endpoint which provides more detailed information from Solr, including memory/CPU information and cache information.
Maintains a "recent times" deque which is used to continuously report on recent queries and to inform people when query time is going up (reimplements part of PR Track recent times #231).
Logs queries along with time taken for tracking performance in the long term (incorporates PR Add logging #230, which has been merged into branch master but not into branch ci).

Also some unrelated changes:

Renames reverse_lookup() to curie_lookup() and lookup_names_(get|post) with synonyms_(get|post) for clarity.
Fixed a typo in data-loading/README.md.

Error message should include variables we've already computed. Could be useful. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Improved comment. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

api/server.py

- /status now fetches JVM heap, CPU load, and OS memory from Solr's admin/info/system endpoint (in parallel with existing core status call) - /status now fetches filterCache and queryResultCache hit ratios and eviction counts from Solr's MBeans endpoint - recent_queries in /status now includes p50/p95/p99 percentiles alongside the existing mean, to distinguish GC pauses from sustained load - lookup() now emits WARNING instead of INFO for queries exceeding SLOW_QUERY_THRESHOLD_MS (default 500ms, configurable via env var) - Added documentation/Performance.md with a diagnostic decision tree explaining how to use these metrics to identify CPU vs memory pressure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tracks query start timestamps in a separate deque (RECENT_QUERY_TIMESTAMPS_COUNT, default 50k entries) independent of the latency deque. /status now reports queries_last_60s, queries_per_second_last_60s, queries_last_300s, and queries_per_second_last_300s under recent_queries.rate. The large deque size ensures rate estimates remain meaningful at high query rates (500 qps fills 1000 entries in 2s but covers 100s with 50k entries). Rate is computed by scanning from newest to oldest and stopping at the window boundary, so it's O(window_size) not O(deque_size). Updated documentation/Performance.md to reflect all implemented metrics and added query rate as Step 1 in the diagnostic decision tree. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Index stats (numDocs, segmentCount, size, etc.), jvm, os, and cache are now nested under 'solr' in the response, making it clear which fields come from the database vs. the Python frontend. 'recent_queries' remains at the top level as it is tracked by the Python process. Updated documentation/Performance.md to reflect the new structure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the two separate deques (recent_query_times and recent_query_timestamps) with a single query_log deque of (timestamp_s, duration_ms) tuples, controlled by QUERY_LOG_SIZE (default 50,000). Expands recent_queries.rate with: - history_span_seconds: how much history the log covers - time_since_last_query_seconds: staleness indicator - queries_last_10s / queries_per_second_last_10s: spike detection - inter_arrival_ms: mean, median, min, max, p95 gaps between queries recent_times_ms is now capped at the 1000 most recent entries in the response to avoid large payloads from the larger log. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The raw list of durations is redundant now that mean, p50, p95, and p99 are reported, and the full data is available in query_log for deeper analysis. Keeping a list of up to 1000 floats inline in a status response is noisy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR prepares the v1.5.2 release by extending /status with performance/diagnostic metrics, adding operational documentation, and updating release/version metadata.

Changes:

Expand /status to include recent_queries metrics and additional Solr JVM/OS/cache diagnostics.
Add a performance diagnostics guide documenting /status fields and log interpretation.
Bump OpenAPI version to 1.5.2 and adjust the release workflow trigger.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`api/server.py`	Adds query timing/logging, query-rate/latency aggregation, and Solr sysinfo/cache collection surfaced via `/status`.
`tests/test_status.py`	Adds a `/status` endpoint test.
`documentation/Performance.md`	New guide explaining `/status` metrics and operational troubleshooting steps.
`api/resources/openapi.yml`	Version bump to 1.5.2.
`.github/workflows/release-name-resolution.yml`	Adds `pull_request` trigger to the release workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T21:35:51Z

tests/test_status.py

+    assert status['version'] > 1
+    assert status['size'] != ''
+    assert status['startTime']
+
+    # Count the specific number of test documents we load.
+    assert status['numDocs'] == 89
+    assert status['maxDoc'] == 89
+    assert status['deletedDocs'] == 0


This test asserts legacy top-level /status fields (e.g., numDocs/maxDoc/deletedDocs/version/size/startTime). In the updated /status response those fields are nested under the solr key, so these assertions will fail. Update the test to read from status['solr'][...] (and adjust the version assertion accordingly).

Suggested change

assert status['version'] > 1

assert status['size'] != ''

assert status['startTime']

# Count the specific number of test documents we load.

assert status['numDocs'] == 89

assert status['maxDoc'] == 89

assert status['deletedDocs'] == 0

assert 'solr' in status

assert status['solr']['version'] != ''

assert status['solr']['size'] != ''

assert status['solr']['startTime']

# Count the specific number of test documents we load.

assert status['solr']['numDocs'] == 89

assert status['solr']['maxDoc'] == 89

assert status['solr']['deletedDocs'] == 0

api/server.py

Copilot · 2026-04-07T21:35:52Z

.github/workflows/release-name-resolution.yml

 on:
+    pull_request:
    release:
        types: [published]


The release workflow now triggers on pull_request, but the job always logs into GHCR and runs docker/build-push-action with push: true. On PRs this can publish images unintentionally and also references github.event.release.target_commitish, which won't exist for PR events. Consider removing the pull_request trigger or gating the push/build-args to release events only (e.g., if: github.event_name == 'release').

api/server.py

Concurrent requests complete in a different order than they started, so insertion order in query_log does not reflect arrival order. Sort the snapshot by timestamp before computing inter-arrival gaps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- inter_arrival_ms: guard requires >= 3 timestamps (>= 2 gaps) since statistics.quantiles needs at least 2 data points; previously crashed with exactly 2 log entries - test_status.py: update field access to match new 'solr' key structure; fix status['version'] > 1 (version is a string, not an int) - release workflow: remove accidental pull_request trigger that would have run the Docker publish job on every PR Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Calculates used percentage from free/total values, returning None when either value is missing or invalid (zero total, negative free, free > total). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Introduces SolrClient with static parse_* methods (parse_jvm, parse_os, parse_cache, parse_index) and async fetch_* methods, plus a high-level fetch_status() that orchestrates all three Solr calls concurrently. status() in server.py shrinks from ~200 to ~100 lines. Also fixes the broken cache stat extraction: Solr stores stats under namespaced keys (CACHE.searcher.<name>.hitratio) not bare keys, and there is no maxSize stat. Adds lookups and hits fields instead. 14 new unit tests in tests/test_solr.py cover all parsers without requiring a running Solr instance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Reports what fraction of logged queries fall into each performance tier: ideal (<100ms), fine (<SLOW_QUERY_THRESHOLD_MS), slow (<1000ms), and very_slow (>=1000ms). Includes slow_threshold_ms so the split point is self-documenting in the response. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

By default /status now makes only one Solr call (cores endpoint), returning basic index stats (numDocs, startTime, etc.) with jvm, os, and cache as null. Pass ?full=true to restore the previous behaviour of fetching all three endpoints concurrently. This makes the default path suitable for frequent Kubernetes liveness probes without hammering Solr with sysinfo and mbeans requests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace four separate generator-sum passes over durations with a single for-loop (O(4n) → O(n)). Extract magic numbers 100 and 1000 as IDEAL_QUERY_THRESHOLD_MS and VERY_SLOW_QUERY_THRESHOLD_MS to match the existing SLOW_QUERY_THRESHOLD_MS constant pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Performance.md: replace `size/maxSize` cache table row with the three fields actually returned by parse_cache() (size, lookups, hits); maxSize is not in the API response - data-loading/README.md: fix typo "repical" → "replica" in backup path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gaurav and others added 14 commits April 7, 2026 13:30

Renamed LOGGER to logger.

3d37f2e

Added logging to /synonyms API endpoint.

4d271de

Added logging to Lookup.

096f79c

Added logging for bulk_lookup().

923e079

Added NameRes version to /status.

c8ae693

Added BIOLINK_MODEL_TAG.

bc02650

Added basic tests for /status.

9c255ec

Improved tests.

27f9f4b

Attempted to fix issues.

d4d23b1

Fixed key name.

b19316c

Update api/server.py

7f02500

Error message should include variables we've already computed. Could be useful. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update api/server.py

3a2bbc9

Improved comment. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Added a "recent times" that allows us to track query times.

f7267dc

Improved name.

9c89abf

github-project-automation bot added this to NameRes sprints Apr 7, 2026

github-project-automation bot moved this to Backlog in NameRes sprints Apr 7, 2026

gaurav commented Apr 7, 2026

View reviewed changes

api/server.py Outdated Show resolved Hide resolved

gaurav and others added 8 commits April 7, 2026 15:38

Apply suggestion from @gaurav

a0707a6

Added on:pull_request trigger for testing.

fc168b8

Incremented version to v1.5.2.

688737e

gaurav requested a review from Copilot April 7, 2026 21:32

Copilot started reviewing on behalf of gaurav April 7, 2026 21:33 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

gaurav and others added 2 commits April 7, 2026 16:21

gaurav and others added 11 commits April 8, 2026 11:27

Add physical_memory_used_pct to /status Solr OS metrics

9fcb4e8

Calculates used percentage from free/total values, returning None when either value is missing or invalid (zero total, negative free, free > total). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Added on:pull_request trigger for testing.

3624ca6

Document ?full query parameter in FastAPI schema for /status

36c7f11

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fixed bug: changed default infores to NameRes, not NodeNorm.

85cb011

Replaced time_ns() with perf_counter_ns().

da3f360

Removed on:pull_request trigger after testing.

9a805d8

gaurav marked this pull request as ready for review April 8, 2026 20:09

gaurav added 2 commits April 8, 2026 14:11

Tweaked tests to get them to pass.

ba2a9b9

Oops, messed with the wrong test.

0e9b80e

gaurav merged commit 9788e6f into ci Apr 8, 2026
1 check passed

github-project-automation bot moved this from Backlog to Done in NameRes sprints Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v1.5.2 with metrics#253

Release v1.5.2 with metrics#253
gaurav merged 37 commits intocifrom
release-v1.5.2-with-metrics

gaurav commented Apr 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gaurav commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gaurav commented Apr 7, 2026 •

edited

Loading