Skip to content

Add Simulated LLM user for true multi turn conversation#230

Merged
prernakakkar-google merged 10 commits into
mainfrom
gemini-cli-evals
Jan 27, 2026
Merged

Add Simulated LLM user for true multi turn conversation#230
prernakakkar-google merged 10 commits into
mainfrom
gemini-cli-evals

Conversation

@prernakakkar-google

Copy link
Copy Markdown
Collaborator

This PR introduces a "Simulated User" capability to the evaluation framework, enabling fully automated multi-turn testing of the Gemini CLI agent. Instead of relying on static single-turn datasets, the evaluator can now simulate a user who follows a conversation plan, reacts to the agent's responses, and provides subsequent prompts until the task is complete.

Example run:

(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$ ./evalbench/run.sh
I0120 14:39:52.576589 139753019032384 evalbench.py:36] EvalBench v1.0.0
I0120 14:39:52.578478 139753019032384 evalbench.py:50] Loaded Configurations in datasets/gemini-cli-tools/example_run_config.yaml
I0120 14:39:52.578758 139753019032384 __init__.py:10] Orchestrator Type: geminicli
I0120 14:39:52.578840 139753019032384 agentorchestrator.py:30] Starting Gemini CLI evaluation
I0120 14:39:52.618841 139753019032384 agentevaluator.py:61] Running Gemini CLI evaluation
I0120 14:39:52.619386 139749620709056 agentgenwork.py:70] Turn 1/3 - Prompt: list all instances in project astana-evaluation
I0120 14:40:05.433665 139749620709056 agentgenwork.py:74] Turn 1/3 - Gemini CLI exit code: 0
I0120 14:40:05.433794 139749620709056 agentgenwork.py:75] Turn 1/3 - Gemini CLI stdout: {
  "session_id": "a0941637-2043-48a1-9d28-5bdc86d7bd45",
  "response": "The Cloud SQL instances in project `astana-evaluation` are: `test56`, `test110`, `trte`, `tesy`, `magic`, `nl2code`, `testing-instance`, `agd`, `test-cloudsql-mysql-instance`, `test300`, `test400`, `test19`, and `test-cloudsql-sql-server-instance`.",
  "stats": {
    "models": {
      "gemini-2.5-flash-lite": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 1174
        },
        "tokens": {
          "input": 3781,
          "prompt": 3781,
          "candidates": 59,
          "total": 3916,
          "cached": 0,
          "thoughts": 76,
          "tool": 0
        }
      },
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 2,
          "totalErrors": 0,
          "totalLatencyMs": 3161
        },
        "tokens": {
          "input": 19557,
          "prompt": 19557,
          "candidates": 96,
          "total": 19696,
          "cached": 0,
          "thoughts": 43,
          "tool": 0
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 597,
      "totalDecisions": {
        "accept": 0,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "list_instances": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 597,
          "decisions": {
            "accept": 0,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    },
    "files": {
      "totalLinesAdded": 0,
      "totalLinesRemoved": 0
    }
  }
}
I0120 14:40:05.433892 139749620709056 agentgenwork.py:76] Turn 1/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
Loaded cached credentials.
Loading extension: cloud-sql-postgresql
The --prompt (-p) flag has been deprecated and will be removed in a future version. Please use a positional argument for your prompt. See gemini --help for more information.

I0120 14:40:23.768228 139749620709056 agentgenwork.py:70] Turn 2/3 - Prompt: What is the state of the nl2code instance?
I0120 14:40:42.373557 139749620709056 agentgenwork.py:74] Turn 2/3 - Gemini CLI exit code: 0
I0120 14:40:42.373666 139749620709056 agentgenwork.py:75] Turn 2/3 - Gemini CLI stdout: {
  "session_id": "a0941637-2043-48a1-9d28-5bdc86d7bd45",
  "response": "The `nl2code` instance is in a `RUNNABLE` state.",
  "stats": {
    "models": {
      "gemini-2.5-flash-lite": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 1512
        },
        "tokens": {
          "input": 3880,
          "prompt": 3880,
          "candidates": 75,
          "total": 4053,
          "cached": 0,
          "thoughts": 98,
          "tool": 0
        }
      },
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 2,
          "totalErrors": 0,
          "totalLatencyMs": 5717
        },
        "tokens": {
          "input": 9009,
          "prompt": 21659,
          "candidates": 30,
          "total": 21802,
          "cached": 12650,
          "thoughts": 113,
          "tool": 0
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 553,
      "totalDecisions": {
        "accept": 0,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "get_instance": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 553,
          "decisions": {
            "accept": 0,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    },
    "files": {
      "totalLinesAdded": 0,
      "totalLinesRemoved": 0
    }
  }
}
I0120 14:40:42.373727 139749620709056 agentgenwork.py:76] Turn 2/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
Loaded cached credentials.
Loading extension: cloud-sql-postgresql
The --prompt (-p) flag has been deprecated and will be removed in a future version. Please use a positional argument for your prompt. See gemini --help for more information.

I0120 14:40:58.027757 139749620709056 agentgenwork.py:105] Simulated user terminated conversation.
I0120 14:40:58.035415 139753019032384 report.py:25] Total Prompts: 1.
I0120 14:40:58.037858 139753019032384 report.py:35] Prompt Errors: 0.
I0120 14:40:58.038292 139753019032384 report.py:36] SQLGen Errors: 0.
I0120 14:40:58.038634 139753019032384 report.py:37] SQLExec Gen Errors: 0.
I0120 14:40:58.038959 139753019032384 report.py:38] Golden Errors: 0.
I0120 14:40:58.040027 139753019032384 analyzer.py:34] trajectory_matcher:       1/1 = 100.0%
I0120 14:40:58.040869 139753019032384 analyzer.py:34] executable:       1/1 = 100.0%
I0120 14:40:58.046607 139753019032384 csv.py:31] Created csv configs.csv for StoreType.CONFIGS in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
I0120 14:40:58.048658 139753019032384 csv.py:31] Created csv evals.csv for StoreType.EVALS in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
I0120 14:40:58.049929 139753019032384 csv.py:31] Created csv scores.csv for StoreType.SCORES in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
I0120 14:40:58.050867 139753019032384 csv.py:31] Created csv summary.csv for StoreType.SUMMARY in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
Finished Job ID 0fbde191-95e7-4adb-b243-e0a72f91adcd

Comment thread evalbench/evaluator/simulateduser.py Outdated
Comment thread evalbench/work/agentgenwork.py Outdated
Comment thread evalbench/work/agentgenwork.py Outdated
Comment thread evalbench/work/agentscorework.py Outdated
Comment thread evalbench/work/agentgenwork.py Outdated
Comment thread evalbench/work/agentgenwork.py Outdated
IsmailMehdi
IsmailMehdi previously approved these changes Jan 23, 2026

@IsmailMehdi IsmailMehdi left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work

@IsmailMehdi

Copy link
Copy Markdown
Collaborator

/gcbrun

@IsmailMehdi

Copy link
Copy Markdown
Collaborator

/gcbrun

@prernakakkar-google prernakakkar-google merged commit f307148 into main Jan 27, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants