Skip to content

Reconcile registry model states on auth changes#2121

Open
destinoantagonista-wq wants to merge 3 commits intorouter-for-me:mainfrom
destinoantagonista-wq:main
Open

Reconcile registry model states on auth changes#2121
destinoantagonista-wq wants to merge 3 commits intorouter-for-me:mainfrom
destinoantagonista-wq:main

Conversation

@destinoantagonista-wq
Copy link

Add Manager.ReconcileRegistryModelStates to clear stale per-model runtime failures for models currently registered in the global model registry. The method finds models supported for an auth, resets non-clean ModelState entries, updates aggregated availability, persists changes, and pushes a snapshot to the scheduler. Introduce modelStateIsClean helper to determine when a model state needs resetting. Call ReconcileRegistryModelStates from Service paths that register/refresh models (applyCoreAuthAddOrUpdate and refreshModelRegistrationForAuth) to keep the scheduler and global registry aligned after model re-registration.

Add Manager.ReconcileRegistryModelStates to clear stale per-model runtime failures for models currently registered in the global model registry. The method finds models supported for an auth, resets non-clean ModelState entries, updates aggregated availability, persists changes, and pushes a snapshot to the scheduler. Introduce modelStateIsClean helper to determine when a model state needs resetting. Call ReconcileRegistryModelStates from Service paths that register/refresh models (applyCoreAuthAddOrUpdate and refreshModelRegistrationForAuth) to keep the scheduler and global registry aligned after model re-registration.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a mechanism to maintain consistency between the scheduler and the global model registry, particularly after authentication-related model re-registrations. By actively reconciling model states, it prevents scenarios where models might appear available in the registry but remain blocked by the scheduler due to outdated failure states, thereby improving the reliability and responsiveness of model availability.

Highlights

  • New reconciliation logic: Introduced Manager.ReconcileRegistryModelStates to clear stale per-model runtime failures for models registered in the global model registry, ensuring alignment between the scheduler and registry after model re-registration.
  • State cleanliness check: Added a modelStateIsClean helper function to determine when a model state requires resetting.
  • Integration into service paths: Integrated calls to ReconcileRegistryModelStates within applyCoreAuthAddOrUpdate and refreshModelRegistrationForAuth service paths to proactively reconcile model states upon auth changes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • sdk/cliproxy/auth/conductor.go
    • Added ReconcileRegistryModelStates method to clear stale model states.
    • Added modelStateIsClean helper function to check if a model state is clean.
  • sdk/cliproxy/service.go
    • Called s.coreManager.ReconcileRegistryModelStates in applyCoreAuthAddOrUpdate.
    • Called s.coreManager.ReconcileRegistryModelStates in refreshModelRegistrationForAuth (twice).
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a reconciliation mechanism for model states in the auth manager to prevent stale failure states from blocking available models. The implementation is generally sound, but I've identified a potential data inconsistency issue due to an ignored error during persistence, and a minor logic simplification that would improve code clarity. The new ReconcileRegistryModelStates function is correctly integrated into the service's auth update and refresh flows.

auth.Status = StatusActive
}
auth.UpdatedAt = now
_ = m.persist(ctx, auth)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error returned by m.persist(ctx, auth) is being ignored. If persistence fails, the in-memory state of the auth manager will be out of sync with the backing store. This could lead to inconsistent behavior or loss of state changes after a service restart. The error should be handled, for example by logging it.

            if err := m.persist(ctx, auth); err != nil {
                logEntryWithRequestID(ctx).WithField("auth_id", auth.ID).Warnf("failed to persist auth changes during model state reconciliation: %v", err)
            }

Comment on lines +279 to +282
baseModel := canonicalModelKey(modelKey)
if baseModel == "" {
baseModel = strings.TrimSpace(modelKey)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine baseModel has a potentially confusing fallback. If canonicalModelKey(modelKey) returns an empty string, it falls back to strings.TrimSpace(modelKey). The supported map is keyed by canonical model keys. If canonicalModelKey fails (returns empty), it's likely that the model key is invalid or not in a form that can be canonicalized. Falling back to the raw (but trimmed) modelKey will likely result in a lookup miss in the supported map anyway if the key contains suffixes or other non-canonical parts. This makes the code harder to reason about. It would be clearer to skip reconciliation for model keys that cannot be canonicalized.

Suggested change
baseModel := canonicalModelKey(modelKey)
if baseModel == "" {
baseModel = strings.TrimSpace(modelKey)
}
baseModel := canonicalModelKey(modelKey)
if baseModel == "" {
continue
}

Copy link

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Summary: Clean and well-motivated fix. The reconciliation logic correctly handles the case where models reappear in the registry after a refresh but the scheduler still has stale failure states for them. A few issues worth addressing:

🔴 Bug: persist error silently swallowed

_ = m.persist(ctx, auth)

The persist call is inside the lock section and the error is discarded entirely. If persistence fails (e.g. disk I/O, serialization error), the in-memory state gets reset but the on-disk state is left stale. On restart, the old failure state would be restored, defeating the reconciliation.

Suggestion: log the error at warning level at minimum, even if you don't want to return it:

if err := m.persist(ctx, auth); err != nil {
    m.log.Warn("reconcile: persist failed", "authID", authID, "err", err)
}

🟡 Race: m.persist called under lock

Calling m.persist while holding m.mu is risky if persist does any I/O. This can block other operations that need the lock. The pattern elsewhere in this codebase (e.g. applyCoreAuthAddOrUpdate) typically clones first and persists outside the lock. Consider:

m.mu.Lock()
// ... mutate auth ...
snapshot = auth.Clone()
m.mu.Unlock()

if err := m.persist(ctx, snapshot); err != nil {
    // log
}
// then upsert scheduler

🟡 context.Background() in refreshModelRegistrationForAuth

The two new calls in service.go use context.Background() instead of propagating a real context. If the calling context has a deadline or cancellation, it won't flow through to persist. Use the caller's context where available, or at least context.WithTimeout.

💡 canonicalModelKey double-call on same modelKey

baseModel := canonicalModelKey(modelKey)
if baseModel == "" {
    baseModel = strings.TrimSpace(modelKey)
}

If canonicalModelKey already calls strings.TrimSpace internally (which it likely does for normalization), this fallback is a no-op. Worth verifying the invariants — if canonicalModelKey can legitimately return empty for a valid key, that case should be documented.

💡 No test coverage for the reconcile path

The PR lacks tests for ReconcileRegistryModelStates. Given the subtlety of the state transitions (quota reset, error reset, status back to Active), a table-driven test covering the main cases (no supported models, already-clean states, mixed clean/dirty, persist error) would add meaningful confidence.

Overall: the fix is correct in principle, but the persist-under-lock + silent error pattern needs addressing before merge.

Copy link
Collaborator

@luispater luispater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary
This moves model-state cleanup into the auth re-registration paths, which is the right direction. However, there is still a stale-state hole when the failed model disappears or is renamed during registry refresh.

Key findings

  • Blocking: ReconcileRegistryModelStates() only resets entries for models still returned by GetModelsForClient(). If the stale failure belongs to a model that was removed or renamed during refresh, the loop skips it, changed stays false, and auth-level availability/status is never recomputed. That leaves stale Unavailable / StatusMessage / LastError on the auth even though the refreshed registry snapshot is healthy. This still affects management output and websocket availability fallback paths.

Test plan

  • Reviewed the changed paths in sdk/cliproxy/auth/conductor.go and sdk/cliproxy/service.go
  • Ran go test ./sdk/cliproxy/auth ./sdk/cliproxy/...

Follow-ups

  • I think this should either prune unsupported ModelStates during reconciliation or recompute aggregated auth state using only the currently supported registry models before persisting.

@destinoantagonista-wq
Copy link
Author

destinoantagonista-wq commented Mar 14, 2026

Updated this PR with the follow-up fixes from review.

What changed:

  • prune unsupported ModelStates during reconciliation so renamed/removed models cannot keep stale auth-level status behind
  • recompute and clear aggregated auth availability when reconciliation leaves no active model states
  • stop swallowing persist errors in the reconcile path; persist failures now log a warning with request_id/auth_id context

Validation:

  • go test ./sdk/cliproxy/auth ./sdk/cliproxy/...

I did not apply the canonicalModelKey simplification suggestion because canonicalModelKey already returns the trimmed raw model when suffix parsing does not change it, and the current branch intentionally prunes unsupported entries instead of skipping them.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e08f68ed7c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +295 to +299
if !hasModelError(auth, now) {
auth.LastError = nil
auth.StatusMessage = ""
auth.Status = StatusActive
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep disabled auths from being marked active

When ReconcileRegistryModelStates removes or resets stale ModelStates, it unconditionally sets auth.Status = StatusActive if no model errors remain. This path is now invoked from applyCoreAuthAddOrUpdate/refreshModelRegistrationForAuth after model registration, including cases where an auth is intentionally disabled (auth.Disabled == true), so a disabled auth can end up persisted/reporting as active despite still being disabled. That status regression is user-visible in management output and can break workflows that rely on status semantics; this reset should be skipped when the auth is disabled (or already StatusDisabled).

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants