Skip to content

Release v1.5.2 with metrics#253

Merged
gaurav merged 37 commits intocifrom
release-v1.5.2-with-metrics
Apr 8, 2026
Merged

Release v1.5.2 with metrics#253
gaurav merged 37 commits intocifrom
release-v1.5.2-with-metrics

Conversation

@gaurav
Copy link
Copy Markdown
Collaborator

@gaurav gaurav commented Apr 7, 2026

Adds some metrics for tracking performance on the Solr database:

  • Adds a ?full=true mode to the /status endpoint which provides more detailed information from Solr, including memory/CPU information and cache information.
  • Maintains a "recent times" deque which is used to continuously report on recent queries and to inform people when query time is going up (reimplements part of PR Track recent times #231).
  • Logs queries along with time taken for tracking performance in the long term (incorporates PR Add logging #230, which has been merged into branch master but not into branch ci).

Also some unrelated changes:

  • Renames reverse_lookup() to curie_lookup() and lookup_names_(get|post) with synonyms_(get|post) for clarity.
  • Fixed a typo in data-loading/README.md.

gaurav and others added 8 commits April 7, 2026 15:38
- /status now fetches JVM heap, CPU load, and OS memory from Solr's
  admin/info/system endpoint (in parallel with existing core status call)
- /status now fetches filterCache and queryResultCache hit ratios and
  eviction counts from Solr's MBeans endpoint
- recent_queries in /status now includes p50/p95/p99 percentiles
  alongside the existing mean, to distinguish GC pauses from sustained load
- lookup() now emits WARNING instead of INFO for queries exceeding
  SLOW_QUERY_THRESHOLD_MS (default 500ms, configurable via env var)
- Added documentation/Performance.md with a diagnostic decision tree
  explaining how to use these metrics to identify CPU vs memory pressure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tracks query start timestamps in a separate deque (RECENT_QUERY_TIMESTAMPS_COUNT,
default 50k entries) independent of the latency deque. /status now reports
queries_last_60s, queries_per_second_last_60s, queries_last_300s, and
queries_per_second_last_300s under recent_queries.rate.

The large deque size ensures rate estimates remain meaningful at high query
rates (500 qps fills 1000 entries in 2s but covers 100s with 50k entries).
Rate is computed by scanning from newest to oldest and stopping at the window
boundary, so it's O(window_size) not O(deque_size).

Updated documentation/Performance.md to reflect all implemented metrics and
added query rate as Step 1 in the diagnostic decision tree.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Index stats (numDocs, segmentCount, size, etc.), jvm, os, and cache are
now nested under 'solr' in the response, making it clear which fields
come from the database vs. the Python frontend. 'recent_queries' remains
at the top level as it is tracked by the Python process.

Updated documentation/Performance.md to reflect the new structure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the two separate deques (recent_query_times and
recent_query_timestamps) with a single query_log deque of
(timestamp_s, duration_ms) tuples, controlled by QUERY_LOG_SIZE
(default 50,000).

Expands recent_queries.rate with:
- history_span_seconds: how much history the log covers
- time_since_last_query_seconds: staleness indicator
- queries_last_10s / queries_per_second_last_10s: spike detection
- inter_arrival_ms: mean, median, min, max, p95 gaps between queries

recent_times_ms is now capped at the 1000 most recent entries in the
response to avoid large payloads from the larger log.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The raw list of durations is redundant now that mean, p50, p95, and p99
are reported, and the full data is available in query_log for deeper
analysis. Keeping a list of up to 1000 floats inline in a status
response is noisy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prepares the v1.5.2 release by extending /status with performance/diagnostic metrics, adding operational documentation, and updating release/version metadata.

Changes:

  • Expand /status to include recent_queries metrics and additional Solr JVM/OS/cache diagnostics.
  • Add a performance diagnostics guide documenting /status fields and log interpretation.
  • Bump OpenAPI version to 1.5.2 and adjust the release workflow trigger.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
api/server.py Adds query timing/logging, query-rate/latency aggregation, and Solr sysinfo/cache collection surfaced via /status.
tests/test_status.py Adds a /status endpoint test.
documentation/Performance.md New guide explaining /status metrics and operational troubleshooting steps.
api/resources/openapi.yml Version bump to 1.5.2.
.github/workflows/release-name-resolution.yml Adds pull_request trigger to the release workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +21 to +28
assert status['version'] > 1
assert status['size'] != ''
assert status['startTime']

# Count the specific number of test documents we load.
assert status['numDocs'] == 89
assert status['maxDoc'] == 89
assert status['deletedDocs'] == 0
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts legacy top-level /status fields (e.g., numDocs/maxDoc/deletedDocs/version/size/startTime). In the updated /status response those fields are nested under the solr key, so these assertions will fail. Update the test to read from status['solr'][...] (and adjust the version assertion accordingly).

Suggested change
assert status['version'] > 1
assert status['size'] != ''
assert status['startTime']
# Count the specific number of test documents we load.
assert status['numDocs'] == 89
assert status['maxDoc'] == 89
assert status['deletedDocs'] == 0
assert 'solr' in status
assert status['solr']['version'] != ''
assert status['solr']['size'] != ''
assert status['solr']['startTime']
# Count the specific number of test documents we load.
assert status['solr']['numDocs'] == 89
assert status['solr']['maxDoc'] == 89
assert status['solr']['deletedDocs'] == 0

Copilot uses AI. Check for mistakes.
Comment on lines 3 to 6
on:
pull_request:
release:
types: [published]
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The release workflow now triggers on pull_request, but the job always logs into GHCR and runs docker/build-push-action with push: true. On PRs this can publish images unintentionally and also references github.event.release.target_commitish, which won't exist for PR events. Consider removing the pull_request trigger or gating the push/build-args to release events only (e.g., if: github.event_name == 'release').

Copilot uses AI. Check for mistakes.
gaurav and others added 2 commits April 7, 2026 16:21
Concurrent requests complete in a different order than they started,
so insertion order in query_log does not reflect arrival order. Sort
the snapshot by timestamp before computing inter-arrival gaps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- inter_arrival_ms: guard requires >= 3 timestamps (>= 2 gaps) since
  statistics.quantiles needs at least 2 data points; previously crashed
  with exactly 2 log entries
- test_status.py: update field access to match new 'solr' key structure;
  fix status['version'] > 1 (version is a string, not an int)
- release workflow: remove accidental pull_request trigger that would
  have run the Docker publish job on every PR

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gaurav and others added 11 commits April 8, 2026 11:27
Calculates used percentage from free/total values, returning None
when either value is missing or invalid (zero total, negative free,
free > total).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces SolrClient with static parse_* methods (parse_jvm, parse_os,
parse_cache, parse_index) and async fetch_* methods, plus a high-level
fetch_status() that orchestrates all three Solr calls concurrently.
status() in server.py shrinks from ~200 to ~100 lines.

Also fixes the broken cache stat extraction: Solr stores stats under
namespaced keys (CACHE.searcher.<name>.hitratio) not bare keys, and
there is no maxSize stat. Adds lookups and hits fields instead.

14 new unit tests in tests/test_solr.py cover all parsers without
requiring a running Solr instance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reports what fraction of logged queries fall into each performance tier:
ideal (<100ms), fine (<SLOW_QUERY_THRESHOLD_MS), slow (<1000ms), and
very_slow (>=1000ms). Includes slow_threshold_ms so the split point is
self-documenting in the response.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
By default /status now makes only one Solr call (cores endpoint),
returning basic index stats (numDocs, startTime, etc.) with jvm, os,
and cache as null. Pass ?full=true to restore the previous behaviour
of fetching all three endpoints concurrently.

This makes the default path suitable for frequent Kubernetes liveness
probes without hammering Solr with sysinfo and mbeans requests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace four separate generator-sum passes over durations with a single
for-loop (O(4n) → O(n)). Extract magic numbers 100 and 1000 as
IDEAL_QUERY_THRESHOLD_MS and VERY_SLOW_QUERY_THRESHOLD_MS to match the
existing SLOW_QUERY_THRESHOLD_MS constant pattern.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Performance.md: replace `size/maxSize` cache table row with the three
  fields actually returned by parse_cache() (size, lookups, hits);
  maxSize is not in the API response
- data-loading/README.md: fix typo "repical" → "replica" in backup path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gaurav gaurav marked this pull request as ready for review April 8, 2026 20:09
@gaurav gaurav merged commit 9788e6f into ci Apr 8, 2026
1 check passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in NameRes sprints Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants