-
Notifications
You must be signed in to change notification settings - Fork 47
A/B Testing #480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
A/B Testing #480
Changes from 33 commits
fc46481
55fc892
e2c94c7
c2678fd
323433c
7633ec0
b6694ca
d63cd79
48059e8
d234402
050e7f9
6420f70
9f5042a
8c756f0
59b11c2
f4dd773
1950644
d1de658
56e8053
8610164
2490364
f95a3a1
ad2f270
555b818
578fee7
30af401
5b4c876
918cbe2
cb4ea9f
f35b257
36a03af
dabe308
9fb333a
a5e3931
f88a549
15c5c9e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # Submit76 deployment config | ||
| # Deploy with: | ||
| # ./configs/submit76/deploy-submit76.sh --live | ||
| # | ||
| # Target: mohoney@submit76.mit.edu | ||
| # Ollama must be running on submit76 at localhost:7870 with gpt-oss:120b pulled. | ||
|
|
||
| name: my_archi | ||
|
|
||
| services: | ||
| chat_app: | ||
| agent_class: CMSCompOpsAgent | ||
| agents_dir: examples/agents | ||
| default_provider: local | ||
| default_model: "qwen3:32b" | ||
| providers: | ||
| local: | ||
| enabled: true | ||
| base_url: http://localhost:7870 | ||
| mode: ollama | ||
| default_model: "qwen3:32b" | ||
| models: | ||
| - "gpt-oss:120b" | ||
| - "qwen3:32b" | ||
| port: 7865 | ||
| external_port: 7865 | ||
| ab_testing: | ||
| enabled: true | ||
| pool: | ||
| champion: default | ||
| variants: | ||
| - name: default | ||
| provider: local | ||
| model: "qwen3:32b" | ||
| - name: gpt-oss-120b | ||
| provider: local | ||
| model: "gpt-oss:120b" | ||
| postgres: | ||
| port: 5435 | ||
| data_manager: | ||
| port: 7878 | ||
| external_port: 7878 | ||
| auth: | ||
| enabled: true | ||
|
|
||
| data_manager: | ||
| sources: | ||
| jira: | ||
| enabled: true | ||
| max_tickets: 10 | ||
| url: https://its.cern.ch/jira/ | ||
| projects: | ||
| - "CMSPROD" | ||
| links: | ||
| input_lists: | ||
| - /home/submit/pmlugato/random_configs/lists/sso_git.list | ||
| redmine: | ||
| url: https://cleo.mit.edu | ||
| project: emails-to-ticket | ||
| projects: | ||
| - emails-to-ticket | ||
| embedding_name: HuggingFaceEmbeddings |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -97,6 +97,36 @@ services: | |
| - alerts:manage | ||
| ``` | ||
|
|
||
| #### `services.chat_app.auth` | ||
|
|
||
| Authentication can be enabled with SSO or basic auth. | ||
|
|
||
| For RBAC-managed admin access, use SSO plus `auth_roles`. Basic auth supports identity-only login, but it does not assign RBAC roles. | ||
|
|
||
| ```yaml | ||
| services: | ||
| chat_app: | ||
| auth: | ||
| enabled: true | ||
| basic: | ||
| enabled: true | ||
| auth_roles: | ||
| default_role: base-user | ||
| roles: | ||
| base-user: | ||
| permissions: | ||
| - chat:query | ||
| - ab:participate | ||
| ab-reviewer: | ||
| permissions: | ||
| - documents:view | ||
| - ab:view | ||
| - ab:metrics | ||
| ab-admin: | ||
| permissions: | ||
| - ab:manage | ||
| ``` | ||
|
|
||
| #### Provider Configuration | ||
|
|
||
| ```yaml | ||
|
|
@@ -257,6 +287,93 @@ data_manager: | |
|
|
||
| --- | ||
|
|
||
| ## A/B Testing Pool | ||
|
|
||
| Archi supports champion/challenger A/B testing via a server-side variant pool. When configured, the system automatically pairs the champion agent against a random challenger for each comparison. Users vote on which response is better, and aggregate metrics are tracked per variant. | ||
|
|
||
| Configure A/B testing under `services.chat_app.ab_testing`: | ||
|
|
||
| ```yaml | ||
| services: | ||
| chat_app: | ||
| ab_testing: | ||
| enabled: true | ||
| sample_rate: 0.25 | ||
| disclosure_mode: post_vote_reveal | ||
| default_trace_mode: minimal | ||
| max_pending_per_conversation: 1 | ||
| target_roles: [] | ||
| target_permissions: [] | ||
| pool: | ||
| champion: default | ||
| variants: | ||
| - label: default | ||
| agent_spec: default.md | ||
| - label: creative | ||
| agent_spec: default.md | ||
| provider: openai | ||
| model: gpt-4o | ||
| recursion_limit: 30 | ||
| - label: concise | ||
| agent_spec: concise.md | ||
| provider: anthropic | ||
| model: claude-sonnet-4-20250514 | ||
| num_documents_to_retrieve: 3 | ||
| ``` | ||
|
|
||
| `services.ab_testing` is deprecated and no longer loaded. Use `services.chat_app.ab_testing` only. | ||
|
|
||
| If `enabled: true` is set before the A/B pool is fully configured, Archi starts successfully but keeps A/B inactive until setup is completed in the admin UI. Missing champion/variants or unresolved A/B agent-spec records are surfaced as warnings instead of blocking startup. | ||
|
|
||
| The runtime source of truth for A/B agent specs is now PostgreSQL. The optional `ab_agents_dir` path is treated only as a legacy import source during reconciliation; runtime A/B loading never falls back to reading staged container markdown files directly. | ||
|
|
||
| New A/B specs created through the admin UI are stored in the same PostgreSQL-backed catalog. Existing A/B specs are not edited through the admin page; if you need a changed prompt or tool selection, create a new A/B spec and point the variant at that new catalog entry. | ||
|
|
||
| ### Variant Fields | ||
|
|
||
| | Field | Type | Default | Description | | ||
| |-------|------|---------|-------------| | ||
| | `label` | string | *required* | Unique human-facing variant label used in the UI and metrics | | ||
| | `agent_spec` | string | *required* | A/B agent-spec filename resolved from the database-backed A/B catalog | | ||
| | `provider` | string | `null` | Override LLM provider | | ||
| | `model` | string | `null` | Override LLM model | | ||
| | `num_documents_to_retrieve` | int | `null` | Override retriever document count | | ||
| | `recursion_limit` | int | `null` | Override agent recursion limit | | ||
|
|
||
| ### Pool Fields | ||
|
|
||
| | Field | Type | Default | Description | | ||
| |-------|------|---------|-------------| | ||
| | `enabled` | boolean | `false` | Enable the experiment pool | | ||
| | `ab_agents_dir` | string | `/root/archi/ab_agents` | Optional legacy import directory for migrating A/B markdown specs into the DB catalog | | ||
| | `sample_rate` | float | `1.0` | Fraction of eligible turns that should run A/B | | ||
| | `disclosure_mode` | string | `post_vote_reveal` | One of `blind`, `post_vote_reveal`, `named` | | ||
|
||
| | `default_trace_mode` | string | `minimal` | One of `minimal`, `normal`, `verbose` | | ||
|
||
| | `max_pending_per_conversation` | int | `1` | Maximum unresolved comparisons per conversation | | ||
| | `target_roles` | list[string] | `[]` | Restrict participation to matching RBAC roles | | ||
| | `target_permissions` | list[string] | `[]` | Restrict participation to matching permissions | | ||
|
|
||
| The `champion` field must reference an existing variant `label`. At least two variants are required before the experiment becomes active. `name`-only variant config is not supported. When a user enables A/B mode in the chat UI, the pool takes over — the champion always appears in one arm, and a random challenger is placed in the other. Arm positions (A vs B) are randomized per comparison. | ||
|
|
||
| ### A/B RBAC and User Preference | ||
|
|
||
| Use RBAC to separate participation, read-only review, metrics access, and write access: | ||
|
|
||
| | Permission | Purpose | | ||
| |------------|---------| | ||
| | `ab:participate` | Makes a user eligible for A/B comparisons and shows the per-user sampling slider in chat settings | | ||
| | `ab:view` | Allows read-only access to the A/B admin page | | ||
| | `ab:metrics` | Allows access to aggregate A/B metrics | | ||
| | `ab:manage` | Allows editing variants, A/B agent specs, and experiment settings | | ||
|
|
||
| `services.chat_app.ab_testing.sample_rate` remains the deployment default. Users with `ab:participate` can override that default per account with a `0..1` slider in chat settings. | ||
|
|
||
| Users with `ab:view`, `ab:metrics`, or `ab:manage` can open the dedicated A/B Testing page from the data viewer and from chat settings. Users with `ab:participate` but not A/B page access still get the personal sampling slider in chat settings. | ||
|
|
||
| Variant metrics (wins, losses, ties) are tracked in the `ab_variant_metrics` database table and available via `GET /api/ab/metrics`. | ||
|
|
||
| --- | ||
|
|
||
| ## Agent Configuration Model | ||
|
|
||
| Archi no longer uses a top-level `archi:` block in standard deployment YAML. | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.