feat: Add llm-eval-harness kit#179
Conversation
LLM-as-judge eval harness: scores a system prompt against a golden set across faithfulness/relevancy/correctness and gates on pass rate. - judge + run-target Lamatic flow prompts - Next.js app: orchestrate loop (run-target -> judge), gate banner, expandable per-case results table - eval utils: HTML-entity decode, defensive JSON parsing, app-side gate recompute, bounded concurrency - add tsconfig.json + postcss.config.mjs (absent in scaffold) WIP: lamatic.config / README / agent.md metadata and exported flow files still to be finalised.
- add judge + run-target flow definitions with their prompts and model-config stubs (@references resolve) - rewrite lamatic.config.ts, README, agent.md, and constitution for the eval-harness kit (problem/approach/results/tradeoffs) - remove content-generation leftovers; rename app package to llm-eval-harness
- redesign as a dark eval dashboard (pass-rate ring, score chips, glassy cards) - real-time golden-set JSON validation with inline errors; Run gated on validity - auto-open the first failed case; stay collapsed when all cases pass - recover scores from malformed judge JSON (e.g. unescaped quotes in reasoning) - cap textarea growth and scroll internally
- drop next.config typescript.ignoreBuildErrors (the app type-checks clean) - type golden-set case items instead of relying on implicit any
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (4)
WalkthroughThis PR adds the ChangesLLM Eval Harness Kit and Next.js App
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
:robot_face: AgentKit Structural ValidationNew Contributions Detected
Check Results
|
There was a problem hiding this comment.
Actionable comments posted: 26
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
kits/llm-eval-harness/apps/components/ui/use-toast.ts (1)
1-192:⚠️ Potential issue | 🟡 MinorRemove duplicate dead code:
components/ui/use-toast.tsis unused.This file duplicates
kits/llm-eval-harness/apps/hooks/use-toast.tsidentically and is never imported anywhere in the codebase. The canonical implementation lives in hooks. Delete this file to eliminate code duplication and future confusion about which path is authoritative.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kits/llm-eval-harness/apps/components/ui/use-toast.ts` around lines 1 - 192, Delete the entire file `kits/llm-eval-harness/apps/components/ui/use-toast.ts` as it is a duplicate of the canonical implementation in `kits/llm-eval-harness/apps/hooks/use-toast.ts`. This file containing the exports of useToast, toast, and the reducer function is not imported anywhere in the codebase and creates confusion about which version is authoritative. Use the hooks directory version as the single source of truth.kits/llm-eval-harness/apps/components/ui/use-mobile.tsx (1)
1-20: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winMission note: eliminate duplicate hook logic to prevent drift.
useIsMobileis duplicated here and inkits/llm-eval-harness/apps/hooks/use-mobile.ts(Lines 1-20). Keep one canonical implementation and re-export it from this UI path.Proposed consolidation
-import * as React from 'react' - -const MOBILE_BREAKPOINT = 768 - -export function useIsMobile() { - const [isMobile, setIsMobile] = React.useState<boolean | undefined>(undefined) - - React.useEffect(() => { - const mql = window.matchMedia(`(max-width: ${MOBILE_BREAKPOINT - 1}px)`) - const onChange = () => { - setIsMobile(window.innerWidth < MOBILE_BREAKPOINT) - } - mql.addEventListener('change', onChange) - setIsMobile(window.innerWidth < MOBILE_BREAKPOINT) - return () => mql.removeEventListener('change', onChange) - }, []) - - return !!isMobile -} +export { useIsMobile } from '`@/hooks/use-mobile`'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kits/llm-eval-harness/apps/components/ui/use-mobile.tsx` around lines 1 - 20, The useIsMobile function is duplicated across two files which risks implementation drift. Remove the useIsMobile implementation from the UI components file (use-mobile.tsx) and instead import and re-export it from the canonical location in kits/llm-eval-harness/apps/hooks/use-mobile.ts to maintain a single source of truth for this hook logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kits/llm-eval-harness/apps/actions/orchestrate.ts`:
- Around line 3-4: Import the lamatic.config file from the parent kit directory
(../../lamatic.config) at the top of the file alongside the existing imports
from `@/lib/lamatic-client` and `@/lib/eval`. Replace all references to environment
variables for resolving judge and runTarget flow IDs (including the code block
at lines 23-30) with reads from the imported lamatic.config step definitions
instead, ensuring the flow IDs are sourced from the kit's declared configuration
rather than environment variables.
- Around line 47-60: The runEvaluation function accepts the threshold parameter
without validation, which allows NaN or out-of-range values to be passed to
computeAggregate and cause invalid gate verdicts. Add threshold validation in
the try block alongside the existing systemPrompt and cases validation, ensuring
threshold is a finite number and within a valid numeric range (such as 0 to 100
for percentage-based comparisons). Use the same error-throwing pattern as the
systemPrompt.trim() and Array.isArray(cases) checks to provide clear validation
messages.
In `@kits/llm-eval-harness/apps/app/layout.tsx`:
- Around line 6-7: The Geist and Geist_Mono font variables are being created but
not utilized in the layout. Remove the underscore prefixes from the variable
names `_geist` and `_geistMono` to indicate they are being used, then apply
these font variables to the root layout element by incorporating their variable
names into the body className or style configuration so the loaded Google Fonts
are actually applied instead of remaining unused.
- Around line 1-44: The CLAUDE.md guidelines document is out of sync with the
current codebase. Update the mission parameters in CLAUDE.md to authorize
Next.js 16.0.0 and React 19.2.0 instead of the currently listed Next.js 14-15
and React 18, since these newer versions are now the standard practice across
the kit ecosystem and are actively being used in the current deployment as
evidenced by the layout.tsx file and broader kit ecosystem confirmation.
In `@kits/llm-eval-harness/apps/app/page.tsx`:
- Line 153: The onChange handler for threshold input at line 153 allows NaN
values to corrupt the state when users type intermediate invalid characters like
"-", "e", or during empty input edits. Modify the onChange callback to check if
Number(e.target.value) produces NaN, and if it does, either skip the state
update or default to a safe value like 0 instead of allowing the NaN to
propagate through Math.max and Math.min into the setThreshold state.
- Around line 30-36: The validation loop checking parsed items is too permissive
and allows blank strings and unvalidated optional fields to pass through before
casting to GoldenCase[]. Enhance the validation condition to not only check that
input and criteria are strings but also verify they are non-empty strings.
Additionally add type validation for optional fields like id and reference if
they exist in the item object, ensuring they match the expected types defined in
GoldenCase. Apply these stricter checks within the loop condition before the
final cast to GoldenCase[] occurs at the return statement.
- Around line 20-70: Replace the manual useState-based form state management in
the EvalHarnessPage component with react-hook-form and zod. Create a Zod schema
that validates systemPrompt (non-empty string), goldenSet (valid JSON array with
objects containing input and criteria string fields), and threshold (number)
fields, then use the useForm hook with zodResolver to initialize the form.
Remove or integrate the validateGoldenSet function into the schema validation,
update the handleRun method to use the form submission handler pattern, and
modify loadExample to use form.setValue instead of individual setters. This
ensures the form follows project standards and maintains validation consistency
with server expectations.
In `@kits/llm-eval-harness/apps/components/results-table.tsx`:
- Around line 51-55: Add accessibility attributes to expose the row expansion
state to assistive technology. On the toggle button element, add the
aria-expanded attribute set to the isOpen boolean value to indicate whether the
row is currently expanded, and add an aria-controls attribute with a unique
identifier string. Then locate the corresponding details panel for this row
(referenced in lines 86-94) and add an id attribute to it that matches the value
used in the button's aria-controls attribute. This enables screen readers to
track the open and closed states of expandable rows.
- Line 50: The Fragment key at line 50 uses `result.case.id ?? index` which
causes collisions when IDs are null or repeat across rows, leading to state
reuse and incorrect expansion behavior. Replace the key with a collision-proof
value that combines both the ID and index to ensure uniqueness, such as
concatenating result.case.id with the index parameter, so that even when IDs
repeat or are undefined, each row maintains its own distinct key and state.
In `@kits/llm-eval-harness/apps/components/theme-provider.tsx`:
- Around line 1-11: The kit app in kits/llm-eval-harness/apps/ declares Next.js
16.0.0 and React 19.2.0, which violate the repository guideline requiring
Next.js 14–15 and React 18. Update the package.json dependencies to downgrade
next to a compliant version within the 14–15 range and react to version 18,
while keeping all other dependencies (Tailwind CSS v4+, react-hook-form, zod,
lucide-react, lamatic) aligned as they currently are.
In `@kits/llm-eval-harness/apps/components/ui/breadcrumb.tsx`:
- Around line 52-62: The BreadcrumbPage function incorrectly assigns role="link"
to a non-interactive span element, which creates an accessibility breach because
spans are not keyboard-navigable by default and assistive technology expects
link roles to be focusable and activatable. Remove the role="link" attribute
from the span since the aria-current="page" attribute already provides
sufficient semantic information to screen readers to indicate the current page,
and a non-interactive span is the correct choice here.
In `@kits/llm-eval-harness/apps/components/ui/button.tsx`:
- Around line 39-58: The Button function component does not handle the ref prop,
which breaks when CalendarDayButton attempts to pass a ref for focus management.
To fix this for React 19, add ref as a destructured parameter in the Button
function signature alongside asChild, and then pass it directly to the Comp
component using the ref attribute. This allows ref to be treated as a standard
prop without needing forwardRef.
In `@kits/llm-eval-harness/apps/components/ui/carousel.tsx`:
- Around line 96-105: In the React.useEffect hook, the event listener registered
with api.on('reInit', onSelect) is missing cleanup in the return function. Add a
corresponding api.off('reInit', onSelect) call in the cleanup function alongside
the existing api.off('select', onSelect) call to ensure both event listeners are
properly removed when the component unmounts or dependencies change. This
prevents stale handlers from being retained and avoids duplicate state updates
across lifecycle transitions.
In `@kits/llm-eval-harness/apps/components/ui/chart.tsx`:
- Around line 235-239: The conditional check `item.value &&` in the tooltip
render logic treats 0 as falsy, causing zero-valued series to be hidden from the
tooltip display. Replace the truthy check with an explicit null/undefined check
instead, such as checking if `item.value != null` or `typeof item.value !==
'undefined'`, to ensure that legitimate zero values are rendered in the span
element alongside the toLocaleString() formatting.
In `@kits/llm-eval-harness/apps/components/ui/form.tsx`:
- Around line 45-66: The useFormField hook accesses fieldContext.name before
validating that fieldContext exists, and the validation check itself is flawed
because the context default value is an empty object which is always truthy.
Move the validation check for fieldContext to the beginning of the hook, before
any context properties are accessed, and update the validation logic to check if
fieldContext.name is defined rather than just checking if fieldContext exists,
since an empty object will not trigger the current falsy check. This ensures the
hook properly detects when it is called outside the FormField component wrapper.
In `@kits/llm-eval-harness/apps/components/ui/kbd.tsx`:
- Around line 18-26: The KbdGroup component is currently rendering a `<kbd>`
element, which creates semantically incorrect nested kbd tags when it contains
child Kbd components. Replace the `<kbd>` element in the KbdGroup function with
a `<div>` or `<span>` element instead, while keeping all other attributes
(data-slot, className with the cn() utility, and props spreading) exactly the
same. This preserves the visual grouping behavior while maintaining proper HTML
semantics.
In `@kits/llm-eval-harness/apps/components/ui/menubar.tsx`:
- Around line 227-240: The MenubarSubTrigger component is using the Tailwind
utility class outline-none in its className string, but other interactive
elements in this file (referenced at lines 59, 106, 124, and 148) use
outline-hidden instead. Replace outline-none with outline-hidden in the
className prop of the MenubarPrimitive.SubTrigger to maintain consistency across
the component and align with Tailwind v4 best practices for complete outline
removal on custom focus-styled components.
In `@kits/llm-eval-harness/apps/components/ui/pagination.tsx`:
- Around line 102-117: The PaginationEllipsis function has a contradiction where
the parent span element is marked with aria-hidden but contains a child span
with sr-only class that provides "More pages" text for screen readers. When a
parent is aria-hidden, assistive technologies cannot read its children. Remove
the aria-hidden attribute from the outer span element in PaginationEllipsis so
that the screen reader text inside the sr-only span will be properly announced
to users relying on assistive technology.
In `@kits/llm-eval-harness/apps/components/ui/table.tsx`:
- Around line 42-52: In the TableFooter function, the arbitrary Tailwind variant
in the className prop contains an invalid CSS pseudo-class selector. Replace the
incorrect `:last` pseudo-class with `:last-child` in the string
`[&>tr]:last:border-b-0` to ensure the border-bottom utility is properly applied
only to the last table row. The selector should read
`[&>tr]:last-child:border-b-0` to correctly target the final child element and
remove its bottom border.
In `@kits/llm-eval-harness/apps/components/ui/toggle.tsx`:
- Around line 9-29: The toggleVariants constant definition contains outline-none
in its base variant string, but Tailwind v4 changed this utility to only remove
the outline-style property rather than the entire outline. Locate the
outline-none utility in the base variant string of the toggleVariants constant
and replace it with outline-hidden to ensure complete outline removal and
maintain proper visual presentation in Tailwind v4.
In `@kits/llm-eval-harness/apps/components/ui/tooltip.tsx`:
- Around line 21-29: The Tooltip component wraps its content in a
TooltipProvider, but since sidebar.tsx already has a TooltipProvider at the
SidebarProvider level, this creates nested providers which is inefficient.
Export the base TooltipPrimitive.Root as a separate named export (such as
TooltipRoot) in the tooltip.tsx file alongside the existing Tooltip component,
then update sidebar.tsx to use this TooltipRoot component directly instead of
the Tooltip component to avoid the unnecessary provider nesting while still
maintaining tooltip functionality.
In `@kits/llm-eval-harness/apps/hooks/use-toast.ts`:
- Around line 174-183: The useEffect hook in the useToast function has an
incorrect dependency array that includes state, causing unnecessary
re-subscriptions on every state change. Since the effect only uses the stable
setState function from React.useState and never actually references the state
value, change the dependency array from [state] to an empty array []. This
prevents the listener from being unsubscribed and re-subscribed on state
changes, eliminating the race window where dispatched updates could be missed.
In `@kits/llm-eval-harness/apps/lib/eval.ts`:
- Around line 123-141: The mapWithConcurrency function lacks validation for the
limit parameter, which can accept 0, negative, or NaN values. When limit is
invalid, Math.min(limit, items.length) will create zero workers, causing the
function to return an array with uninitialized slots without throwing an error.
Add an explicit check at the start of the mapWithConcurrency function to
validate that limit is a positive integer, and throw an appropriate error if the
validation fails before proceeding with worker creation.
In `@kits/llm-eval-harness/apps/next.config.mjs`:
- Around line 3-5: The images configuration object in next.config.mjs has
unoptimized set to true, which disables Next.js's automatic image optimization.
Verify the actual deployment target for this kit - if it will be deployed to
Vercel or a similar platform that supports automatic image optimization,
consider removing the unoptimized: true setting to enable performance
optimization. If the kit is intended for static export or a specific platform
that requires this setting, add a comment documenting the reason for disabling
image optimization.
In `@kits/llm-eval-harness/apps/package.json`:
- Line 53: The lamatic dependency in package.json is pinned to "latest" which
creates non-deterministic builds and risks pulling breaking changes. Replace the
"latest" value for the lamatic package with a specific version number or a
version range using semantic versioning (for example, using caret notation like
"^1.2.3" to allow compatible minor and patch updates while locking the major
version). Check the current resolved version in
node_modules/lamatic/package.json to determine the appropriate version to pin
to.
In `@kits/llm-eval-harness/prompts/judge_system.md`:
- Around line 50-59: The JSON code block in judge_system.md starting at line 50
is missing a language fence specification, which violates markdownlint rule
MD040. Add `json` immediately after the opening triple backticks (change ``` to
```json) to specify the code block language and resolve the linting error.
---
Outside diff comments:
In `@kits/llm-eval-harness/apps/components/ui/use-mobile.tsx`:
- Around line 1-20: The useIsMobile function is duplicated across two files
which risks implementation drift. Remove the useIsMobile implementation from the
UI components file (use-mobile.tsx) and instead import and re-export it from the
canonical location in kits/llm-eval-harness/apps/hooks/use-mobile.ts to maintain
a single source of truth for this hook logic.
In `@kits/llm-eval-harness/apps/components/ui/use-toast.ts`:
- Around line 1-192: Delete the entire file
`kits/llm-eval-harness/apps/components/ui/use-toast.ts` as it is a duplicate of
the canonical implementation in `kits/llm-eval-harness/apps/hooks/use-toast.ts`.
This file containing the exports of useToast, toast, and the reducer function is
not imported anywhere in the codebase and creates confusion about which version
is authoritative. Use the hooks directory version as the single source of truth.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 23a86a12-90c8-44ec-83a2-f10ae8469787
⛔ Files ignored due to path filters (1)
kits/llm-eval-harness/apps/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (91)
kits/llm-eval-harness/.gitignorekits/llm-eval-harness/README.mdkits/llm-eval-harness/agent.mdkits/llm-eval-harness/apps/.env.examplekits/llm-eval-harness/apps/.gitignorekits/llm-eval-harness/apps/README.mdkits/llm-eval-harness/apps/actions/orchestrate.tskits/llm-eval-harness/apps/app/globals.csskits/llm-eval-harness/apps/app/layout.tsxkits/llm-eval-harness/apps/app/page.tsxkits/llm-eval-harness/apps/components.jsonkits/llm-eval-harness/apps/components/gate-banner.tsxkits/llm-eval-harness/apps/components/results-table.tsxkits/llm-eval-harness/apps/components/theme-provider.tsxkits/llm-eval-harness/apps/components/ui/accordion.tsxkits/llm-eval-harness/apps/components/ui/alert-dialog.tsxkits/llm-eval-harness/apps/components/ui/alert.tsxkits/llm-eval-harness/apps/components/ui/aspect-ratio.tsxkits/llm-eval-harness/apps/components/ui/avatar.tsxkits/llm-eval-harness/apps/components/ui/badge.tsxkits/llm-eval-harness/apps/components/ui/breadcrumb.tsxkits/llm-eval-harness/apps/components/ui/button-group.tsxkits/llm-eval-harness/apps/components/ui/button.tsxkits/llm-eval-harness/apps/components/ui/calendar.tsxkits/llm-eval-harness/apps/components/ui/card.tsxkits/llm-eval-harness/apps/components/ui/carousel.tsxkits/llm-eval-harness/apps/components/ui/chart.tsxkits/llm-eval-harness/apps/components/ui/checkbox.tsxkits/llm-eval-harness/apps/components/ui/collapsible.tsxkits/llm-eval-harness/apps/components/ui/command.tsxkits/llm-eval-harness/apps/components/ui/context-menu.tsxkits/llm-eval-harness/apps/components/ui/dialog.tsxkits/llm-eval-harness/apps/components/ui/drawer.tsxkits/llm-eval-harness/apps/components/ui/dropdown-menu.tsxkits/llm-eval-harness/apps/components/ui/empty.tsxkits/llm-eval-harness/apps/components/ui/field.tsxkits/llm-eval-harness/apps/components/ui/form.tsxkits/llm-eval-harness/apps/components/ui/hover-card.tsxkits/llm-eval-harness/apps/components/ui/input-group.tsxkits/llm-eval-harness/apps/components/ui/input-otp.tsxkits/llm-eval-harness/apps/components/ui/input.tsxkits/llm-eval-harness/apps/components/ui/item.tsxkits/llm-eval-harness/apps/components/ui/kbd.tsxkits/llm-eval-harness/apps/components/ui/label.tsxkits/llm-eval-harness/apps/components/ui/menubar.tsxkits/llm-eval-harness/apps/components/ui/navigation-menu.tsxkits/llm-eval-harness/apps/components/ui/pagination.tsxkits/llm-eval-harness/apps/components/ui/popover.tsxkits/llm-eval-harness/apps/components/ui/progress.tsxkits/llm-eval-harness/apps/components/ui/radio-group.tsxkits/llm-eval-harness/apps/components/ui/resizable.tsxkits/llm-eval-harness/apps/components/ui/scroll-area.tsxkits/llm-eval-harness/apps/components/ui/select.tsxkits/llm-eval-harness/apps/components/ui/separator.tsxkits/llm-eval-harness/apps/components/ui/sheet.tsxkits/llm-eval-harness/apps/components/ui/sidebar.tsxkits/llm-eval-harness/apps/components/ui/skeleton.tsxkits/llm-eval-harness/apps/components/ui/slider.tsxkits/llm-eval-harness/apps/components/ui/sonner.tsxkits/llm-eval-harness/apps/components/ui/spinner.tsxkits/llm-eval-harness/apps/components/ui/switch.tsxkits/llm-eval-harness/apps/components/ui/table.tsxkits/llm-eval-harness/apps/components/ui/tabs.tsxkits/llm-eval-harness/apps/components/ui/textarea.tsxkits/llm-eval-harness/apps/components/ui/toast.tsxkits/llm-eval-harness/apps/components/ui/toaster.tsxkits/llm-eval-harness/apps/components/ui/toggle-group.tsxkits/llm-eval-harness/apps/components/ui/toggle.tsxkits/llm-eval-harness/apps/components/ui/tooltip.tsxkits/llm-eval-harness/apps/components/ui/use-mobile.tsxkits/llm-eval-harness/apps/components/ui/use-toast.tskits/llm-eval-harness/apps/hooks/use-mobile.tskits/llm-eval-harness/apps/hooks/use-toast.tskits/llm-eval-harness/apps/lib/eval.tskits/llm-eval-harness/apps/lib/lamatic-client.tskits/llm-eval-harness/apps/lib/types.tskits/llm-eval-harness/apps/lib/utils.tskits/llm-eval-harness/apps/next.config.mjskits/llm-eval-harness/apps/package.jsonkits/llm-eval-harness/apps/postcss.config.mjskits/llm-eval-harness/apps/tsconfig.jsonkits/llm-eval-harness/constitutions/default.mdkits/llm-eval-harness/flows/judge.tskits/llm-eval-harness/flows/run-target.tskits/llm-eval-harness/lamatic.config.tskits/llm-eval-harness/model-configs/judge.tskits/llm-eval-harness/model-configs/run-target.tskits/llm-eval-harness/prompts/judge_system.mdkits/llm-eval-harness/prompts/judge_user.mdkits/llm-eval-harness/prompts/run-target_system.mdkits/llm-eval-harness/prompts/run-target_user.md
| React.useEffect(() => { | ||
| listeners.push(setState) | ||
| return () => { | ||
| const index = listeners.indexOf(setState) | ||
| if (index > -1) { | ||
| listeners.splice(index, 1) | ||
| } | ||
| } | ||
| }, [state]) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Find use-toast effects that re-subscribe on state:"
rg -n --type=ts --type=tsx '\}, \[state\]\)' kits/llm-eval-harness/apps/hooks/use-toast.ts kits/llm-eval-harness/apps/components/ui/use-toast.ts -C3Repository: Lamatic/AgentKit
Length of output: 157
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if file exists
if [ -f "kits/llm-eval-harness/apps/hooks/use-toast.ts" ]; then
echo "=== File found, reading lines 174-200 ==="
sed -n '174,200p' kits/llm-eval-harness/apps/hooks/use-toast.ts | cat -n
echo ""
echo "=== Full file context (first 50 lines) ==="
head -50 kits/llm-eval-harness/apps/hooks/use-toast.ts | cat -n
else
echo "File not found at expected path. Searching for use-toast files:"
find . -name "*use-toast*" -type f 2>/dev/null || echo "No files found"
fiRepository: Lamatic/AgentKit
Length of output: 1996
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for the exact pattern with simpler approach
echo "=== Searching for useEffect with state dependency ==="
rg -n "useEffect" kits/llm-eval-harness/apps/hooks/ -A 5 -B 1
echo ""
echo "=== Listing all files in hooks directory ==="
find kits/llm-eval-harness/apps/hooks -type f 2>/dev/null || echo "Directory not found"Repository: Lamatic/AgentKit
Length of output: 1376
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Full use-toast hook around the useEffect ==="
sed -n '150,192p' kits/llm-eval-harness/apps/hooks/use-toast.ts | cat -nRepository: Lamatic/AgentKit
Length of output: 1155
Mission critical: eliminate listener subscription churn that creates notification gaps.
Your useToast hook re-subscribes the listener on every state change (line 182). Since setState from React.useState is stable and the effect never actually uses the state value—only setState—the dependency array should be empty. Each unnecessary re-run unsubscribes and re-subscribes the same listener, creating a race window where dispatched updates could be missed before re-subscription completes.
Suggested patch
React.useEffect(() => {
listeners.push(setState)
return () => {
const index = listeners.indexOf(setState)
if (index > -1) {
listeners.splice(index, 1)
}
}
- }, [state])
+ }, [])🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@kits/llm-eval-harness/apps/hooks/use-toast.ts` around lines 174 - 183, The
useEffect hook in the useToast function has an incorrect dependency array that
includes state, causing unnecessary re-subscriptions on every state change.
Since the effect only uses the stable setState function from React.useState and
never actually references the state value, change the dependency array from
[state] to an empty array []. This prevents the listener from being unsubscribed
and re-subscribed on state changes, eliminating the race window where dispatched
updates could be missed.
…fig-driven flows) - remove ~50 unused shadcn boilerplate components + hooks + theme-provider (keep only button/input/label/textarea that are actually used) - resolve flow IDs from lamatic.config step definitions (widen turbopack root so the orchestrate action can import ../../lamatic.config) - validate threshold (server action + input) and the concurrency limit - stricter golden-set validation: non-empty input/criteria, typed optional fields - results rows: aria-expanded/aria-controls + collision-proof keys - Button forwards ref (React 19); wire Geist fonts; pin lamatic ^0.3.2; json-fence the judge prompt; drop unused images.unoptimized
|
Thanks for the detailed review — addressed in Most of the UI-component findings are resolved by removing the unused shadcn boilerplate: this kit only ships the four components it actually uses ( Also fixed:
Two I intentionally left, happy to change if you'd prefer otherwise:
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
kits/llm-eval-harness/apps/app/page.tsx (1)
20-83: 📐 Maintainability & Code Quality | 🟠 Major | 🏗️ Heavy liftMission directive: replace manual form control with
react-hook-form+zodbefore launch.Lines 20-83 still rely on hand-rolled validation/state (
validateGoldenSet+ multipleuseStates). That leaves this form outside the required app form contract and increases drift risk versus server-side rules. As per coding guidelines,kits/*/apps/**/*.{ts,tsx,js,jsx}must usereact-hook-form + zod for forms.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@kits/llm-eval-harness/apps/app/page.tsx` around lines 20 - 83, Replace the manual form validation and state management in the EvalHarnessPage component with react-hook-form and zod. Remove the validateGoldenSet function and the multiple useState declarations for systemPrompt, goldenSet, threshold, etc., then create a zod schema that encapsulates the validation logic currently in validateGoldenSet (checking for non-empty strings, valid JSON array format, required fields like input and criteria, and optional fields like reference and id). Use useForm hook from react-hook-form with the zod schema via zodResolver to manage form state and validation, replacing the manual validation logic in handleRun and the useMemo call with the form's built-in validation methods.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@kits/llm-eval-harness/apps/app/page.tsx`:
- Around line 20-83: Replace the manual form validation and state management in
the EvalHarnessPage component with react-hook-form and zod. Remove the
validateGoldenSet function and the multiple useState declarations for
systemPrompt, goldenSet, threshold, etc., then create a zod schema that
encapsulates the validation logic currently in validateGoldenSet (checking for
non-empty strings, valid JSON array format, required fields like input and
criteria, and optional fields like reference and id). Use useForm hook from
react-hook-form with the zod schema via zodResolver to manage form state and
validation, replacing the manual validation logic in handleRun and the useMemo
call with the form's built-in validation methods.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: ea913545-55d3-41a7-bd2a-39e73018b38b
⛔ Files ignored due to path filters (1)
kits/llm-eval-harness/apps/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (11)
kits/llm-eval-harness/apps/actions/orchestrate.tskits/llm-eval-harness/apps/app/globals.csskits/llm-eval-harness/apps/app/layout.tsxkits/llm-eval-harness/apps/app/page.tsxkits/llm-eval-harness/apps/components/results-table.tsxkits/llm-eval-harness/apps/components/ui/button.tsxkits/llm-eval-harness/apps/lib/eval.tskits/llm-eval-harness/apps/lib/lamatic-client.tskits/llm-eval-harness/apps/next.config.mjskits/llm-eval-harness/apps/package.jsonkits/llm-eval-harness/prompts/judge_system.md
💤 Files with no reviewable changes (1)
- kits/llm-eval-harness/apps/lib/lamatic-client.ts
- replace manual useState/validateGoldenSet with useForm + zodResolver - zod schema validates systemPrompt, golden-set JSON (array of cases), threshold - Input/Textarea forward ref so RHF register attaches correctly - preserves the live validation UX (status pill, inline errors, gated Run button)
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kits/llm-eval-harness/apps/app/page.tsx`:
- Around line 21-23: The `input` and `criteria` validation in the schema uses
`z.string().min(1)` which accepts whitespace-only strings like `" "`, causing
semantic empty cases to pass through and waste evaluation runs. Update both the
`input` field and `criteria` field validation to ensure they contain actual
non-whitespace content by either using `.trim().min(1)` to remove and validate
whitespace, or adding a `.refine()` check that explicitly validates the string
contains at least one non-whitespace character after trimming.
In `@kits/llm-eval-harness/apps/components/ui/input.tsx`:
- Around line 5-8: The Input component in the Input function accepts ref as a
destructured prop, which is only valid in React 19 but violates the React 18
guideline requirement. Resolve this by first confirming the target React version
with your team. If React 18 is enforced, wrap the Input component with
React.forwardRef, remove ref from the function parameter destructuring, accept
it as the second parameter in the forwardRef callback, and pass it to the input
element. Additionally, ensure package.json React version matches the confirmed
target. If React 19 adoption is approved, update the project guidelines to
reflect the actual React version being used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5282ba74-bc1a-4daf-90f1-b331622220fd
📒 Files selected for processing (3)
kits/llm-eval-harness/apps/app/page.tsxkits/llm-eval-harness/apps/components/ui/input.tsxkits/llm-eval-harness/apps/components/ui/textarea.tsx
- convert Button/Input/Textarea to React.forwardRef (compatible with React 18 and 19; lets RHF register attach a ref without version-specific patterns) - zod schema: trim input/criteria so whitespace-only cases are rejected
|
/validate |
|
📡 Running Studio validation — results will appear here shortly. |
Studio Runtime Validation (Phase 2)✅ Studio validation passed. The kit loaded successfully in Lamatic Studio. This PR is ready for final review and merge. |
|
@coderabbitai review |
✅ Action performedReview finished.
|
What this adds
An LLM Eval Harness kit — it scores an LLM system prompt against a golden set using an LLM-as-judge, then applies a CI-style pass/fail gate. Submitted for the agentkit-challenge.
Problem
Tweaking a prompt or swapping a model can silently regress output quality — hallucinations, over-promising, drift — and you usually find out in production. This makes that quality measurable and repeatable.
How it works
run-targetflow runs the system-prompt-under-test on each golden-set case input.judgeflow (LLM-as-judge) scores the output on faithfulness, relevancy, and correctness (0–5). Faithfulness is a veto, so a fluent-but-hallucinated answer can't pass.Stack
Lamatic flows (Groq
llama-3.3-70b-versatile) · Next.js 16 / React 19 · Tailwind v4 · shadcn/ui. Verified locally end-to-end;npm run buildtype-checks clean.Notes / tradeoffs
Label:
agentkit-challengeFiles Added
Kit / Lamatic (15 files)
kits/llm-eval-harness/.gitignore— ignore local Lamatic/Node artifacts and env fileskits/llm-eval-harness/README.md— kit overview, judge scoring + gate logic, usage, and assumptions (single-provider v1 via Groq)kits/llm-eval-harness/agent.md— detailed “quality-gate” agent docs forrun-target+judge, JSON I/O expectations, and failure modeskits/llm-eval-harness/constitutions/default.md— judge constitution (strict JSON-only scoring, faithfulness veto, determinism, no answer rewriting)kits/llm-eval-harness/lamatic.config.ts— kit config wiring required stepsjudge/run-targetto env flow IDskits/llm-eval-harness/flows/judge.ts— LLM-as-judge flow: trigger → judge LLM → response mapping to{ answer }kits/llm-eval-harness/flows/run-target.ts— system-under-test flow: trigger → target LLM → response mapping to{ answer }kits/llm-eval-harness/model-configs/judge.ts— judge model configuration wiringkits/llm-eval-harness/model-configs/run-target.ts— run-target model configuration wiringkits/llm-eval-harness/prompts/judge_system.md— judge rubric + strict JSON output contract (faithfulness,relevancy,correctness,overall,pass,reasoning)kits/llm-eval-harness/prompts/judge_user.md— judge prompt template (INPUT/OUTPUT/CRITERIA/REFERENCE sections)kits/llm-eval-harness/prompts/run-target_system.md— run-target system prompt template (injectssystemPrompt)kits/llm-eval-harness/prompts/run-target_user.md— run-target user prompt template (injects goldeninput)kits/llm-eval-harness/apps/.env.example— env placeholders forJUDGE_FLOW,RUN_TARGET_FLOW, and Lamatic Studio credentialskits/llm-eval-harness/flows/**and prompt/model-configs together define the evaluation pipeline used by the appNext.js app (selected files listed from provided changes)
kits/llm-eval-harness/apps/.gitignore— ignore Next.js/Node build artifacts and local env fileskits/llm-eval-harness/apps/README.md— app setup + env vars + internal module/component mapkits/llm-eval-harness/apps/actions/orchestrate.ts— server action to runrun-targetthenjudgefor each golden case, aggregate results, and apply threshold gatingkits/llm-eval-harness/apps/app/globals.css— Tailwind theme tokens + base styles (Geist-themed)kits/llm-eval-harness/apps/app/layout.tsx— root layout (metadata, fonts, analytics)kits/llm-eval-harness/apps/app/page.tsx— evaluation UI: validate system prompt + golden set JSON, run evaluation, and render gate/tablekits/llm-eval-harness/apps/components.json— shadcn/ui + Tailwind/RSC configurationkits/llm-eval-harness/apps/components/gate-banner.tsx— “GATE PASSED/FAILED” banner with pass-rate ring + statskits/llm-eval-harness/apps/components/results-table.tsx— expandable per-case results table (scores, reasoning, output)kits/llm-eval-harness/apps/components/ui/button.tsx— shadcn Button (ref-forwarding for React 19)kits/llm-eval-harness/apps/components/ui/input.tsx— shadcn Input (ref-forwarding)kits/llm-eval-harness/apps/components/ui/label.tsx— shadcn Label wrapperkits/llm-eval-harness/apps/components/ui/textarea.tsx— shadcn Textarea (ref forwarding)kits/llm-eval-harness/apps/lib/types.ts—GoldenCase,JudgeResult,CaseResult,RunAggregatekits/llm-eval-harness/apps/lib/eval.ts— judge output normalization/parsing, concurrency helper, aggregation, and sample golden setkits/llm-eval-harness/apps/lib/lamatic-client.ts— cached Lamatic client initialization from env varskits/llm-eval-harness/apps/lib/utils.ts—cn()class-name merge helperkits/llm-eval-harness/apps/next.config.mjs— Next/Turbopack configkits/llm-eval-harness/apps/package.json— app dependencies/scriptskits/llm-eval-harness/apps/postcss.config.mjs— Tailwind PostCSS configkits/llm-eval-harness/apps/tsconfig.json— strict TS + app-router configFlow overview (node types + high level behavior)
run-targetflow (kits/llm-eval-harness/flows/run-target.ts){ systemPrompt, input }from the app.systemPrompt.answer(mapped for the caller).judgeflow (kits/llm-eval-harness/flows/judge.ts){ input, output, criteria, reference? }.answercontaining the three dimension scores plusoverall,pass, and brief reasoning; faithfulness is treated as a veto by the gate logic.