Reconcile registry model states on auth changes#2121
Reconcile registry model states on auth changes#2121destinoantagonista-wq wants to merge 3 commits intorouter-for-me:mainfrom
Conversation
Add Manager.ReconcileRegistryModelStates to clear stale per-model runtime failures for models currently registered in the global model registry. The method finds models supported for an auth, resets non-clean ModelState entries, updates aggregated availability, persists changes, and pushes a snapshot to the scheduler. Introduce modelStateIsClean helper to determine when a model state needs resetting. Call ReconcileRegistryModelStates from Service paths that register/refresh models (applyCoreAuthAddOrUpdate and refreshModelRegistrationForAuth) to keep the scheduler and global registry aligned after model re-registration.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a mechanism to maintain consistency between the scheduler and the global model registry, particularly after authentication-related model re-registrations. By actively reconciling model states, it prevents scenarios where models might appear available in the registry but remain blocked by the scheduler due to outdated failure states, thereby improving the reliability and responsiveness of model availability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a reconciliation mechanism for model states in the auth manager to prevent stale failure states from blocking available models. The implementation is generally sound, but I've identified a potential data inconsistency issue due to an ignored error during persistence, and a minor logic simplification that would improve code clarity. The new ReconcileRegistryModelStates function is correctly integrated into the service's auth update and refresh flows.
sdk/cliproxy/auth/conductor.go
Outdated
| auth.Status = StatusActive | ||
| } | ||
| auth.UpdatedAt = now | ||
| _ = m.persist(ctx, auth) |
There was a problem hiding this comment.
The error returned by m.persist(ctx, auth) is being ignored. If persistence fails, the in-memory state of the auth manager will be out of sync with the backing store. This could lead to inconsistent behavior or loss of state changes after a service restart. The error should be handled, for example by logging it.
if err := m.persist(ctx, auth); err != nil {
logEntryWithRequestID(ctx).WithField("auth_id", auth.ID).Warnf("failed to persist auth changes during model state reconciliation: %v", err)
}| baseModel := canonicalModelKey(modelKey) | ||
| if baseModel == "" { | ||
| baseModel = strings.TrimSpace(modelKey) | ||
| } |
There was a problem hiding this comment.
The logic to determine baseModel has a potentially confusing fallback. If canonicalModelKey(modelKey) returns an empty string, it falls back to strings.TrimSpace(modelKey). The supported map is keyed by canonical model keys. If canonicalModelKey fails (returns empty), it's likely that the model key is invalid or not in a form that can be canonicalized. Falling back to the raw (but trimmed) modelKey will likely result in a lookup miss in the supported map anyway if the key contains suffixes or other non-canonical parts. This makes the code harder to reason about. It would be clearer to skip reconciliation for model keys that cannot be canonicalized.
| baseModel := canonicalModelKey(modelKey) | |
| if baseModel == "" { | |
| baseModel = strings.TrimSpace(modelKey) | |
| } | |
| baseModel := canonicalModelKey(modelKey) | |
| if baseModel == "" { | |
| continue | |
| } |
xkonjin
left a comment
There was a problem hiding this comment.
Code Review
Summary: Clean and well-motivated fix. The reconciliation logic correctly handles the case where models reappear in the registry after a refresh but the scheduler still has stale failure states for them. A few issues worth addressing:
🔴 Bug: persist error silently swallowed
_ = m.persist(ctx, auth)The persist call is inside the lock section and the error is discarded entirely. If persistence fails (e.g. disk I/O, serialization error), the in-memory state gets reset but the on-disk state is left stale. On restart, the old failure state would be restored, defeating the reconciliation.
Suggestion: log the error at warning level at minimum, even if you don't want to return it:
if err := m.persist(ctx, auth); err != nil {
m.log.Warn("reconcile: persist failed", "authID", authID, "err", err)
}🟡 Race: m.persist called under lock
Calling m.persist while holding m.mu is risky if persist does any I/O. This can block other operations that need the lock. The pattern elsewhere in this codebase (e.g. applyCoreAuthAddOrUpdate) typically clones first and persists outside the lock. Consider:
m.mu.Lock()
// ... mutate auth ...
snapshot = auth.Clone()
m.mu.Unlock()
if err := m.persist(ctx, snapshot); err != nil {
// log
}
// then upsert scheduler🟡 context.Background() in refreshModelRegistrationForAuth
The two new calls in service.go use context.Background() instead of propagating a real context. If the calling context has a deadline or cancellation, it won't flow through to persist. Use the caller's context where available, or at least context.WithTimeout.
💡 canonicalModelKey double-call on same modelKey
baseModel := canonicalModelKey(modelKey)
if baseModel == "" {
baseModel = strings.TrimSpace(modelKey)
}If canonicalModelKey already calls strings.TrimSpace internally (which it likely does for normalization), this fallback is a no-op. Worth verifying the invariants — if canonicalModelKey can legitimately return empty for a valid key, that case should be documented.
💡 No test coverage for the reconcile path
The PR lacks tests for ReconcileRegistryModelStates. Given the subtlety of the state transitions (quota reset, error reset, status back to Active), a table-driven test covering the main cases (no supported models, already-clean states, mixed clean/dirty, persist error) would add meaningful confidence.
Overall: the fix is correct in principle, but the persist-under-lock + silent error pattern needs addressing before merge.
luispater
left a comment
There was a problem hiding this comment.
Summary
This moves model-state cleanup into the auth re-registration paths, which is the right direction. However, there is still a stale-state hole when the failed model disappears or is renamed during registry refresh.
Key findings
- Blocking:
ReconcileRegistryModelStates()only resets entries for models still returned byGetModelsForClient(). If the stale failure belongs to a model that was removed or renamed during refresh, the loop skips it,changedstays false, and auth-level availability/status is never recomputed. That leaves staleUnavailable/StatusMessage/LastErroron the auth even though the refreshed registry snapshot is healthy. This still affects management output and websocket availability fallback paths.
Test plan
- Reviewed the changed paths in
sdk/cliproxy/auth/conductor.goandsdk/cliproxy/service.go - Ran
go test ./sdk/cliproxy/auth ./sdk/cliproxy/...
Follow-ups
- I think this should either prune unsupported
ModelStatesduring reconciliation or recompute aggregated auth state using only the currently supported registry models before persisting.
|
Updated this PR with the follow-up fixes from review. What changed:
Validation:
I did not apply the canonicalModelKey simplification suggestion because canonicalModelKey already returns the trimmed raw model when suffix parsing does not change it, and the current branch intentionally prunes unsupported entries instead of skipping them. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e08f68ed7c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if !hasModelError(auth, now) { | ||
| auth.LastError = nil | ||
| auth.StatusMessage = "" | ||
| auth.Status = StatusActive | ||
| } |
There was a problem hiding this comment.
Keep disabled auths from being marked active
When ReconcileRegistryModelStates removes or resets stale ModelStates, it unconditionally sets auth.Status = StatusActive if no model errors remain. This path is now invoked from applyCoreAuthAddOrUpdate/refreshModelRegistrationForAuth after model registration, including cases where an auth is intentionally disabled (auth.Disabled == true), so a disabled auth can end up persisted/reporting as active despite still being disabled. That status regression is user-visible in management output and can break workflows that rely on status semantics; this reset should be skipped when the auth is disabled (or already StatusDisabled).
Useful? React with 👍 / 👎.
Add Manager.ReconcileRegistryModelStates to clear stale per-model runtime failures for models currently registered in the global model registry. The method finds models supported for an auth, resets non-clean ModelState entries, updates aggregated availability, persists changes, and pushes a snapshot to the scheduler. Introduce modelStateIsClean helper to determine when a model state needs resetting. Call ReconcileRegistryModelStates from Service paths that register/refresh models (applyCoreAuthAddOrUpdate and refreshModelRegistrationForAuth) to keep the scheduler and global registry aligned after model re-registration.