Add Simulated LLM user for true multi turn conversation by prernakakkar-google · Pull Request #230 · GoogleCloudPlatform/evalbench

prernakakkar-google · 2026-01-20T14:52:40Z

This PR introduces a "Simulated User" capability to the evaluation framework, enabling fully automated multi-turn testing of the Gemini CLI agent. Instead of relying on static single-turn datasets, the evaluator can now simulate a user who follows a conversation plan, reacts to the agent's responses, and provides subsequent prompts until the task is complete.

Example run:

(.venv) prernakakkar@prernakakkar:~/senseai/evalbench$ ./evalbench/run.sh
I0120 14:39:52.576589 139753019032384 evalbench.py:36] EvalBench v1.0.0
I0120 14:39:52.578478 139753019032384 evalbench.py:50] Loaded Configurations in datasets/gemini-cli-tools/example_run_config.yaml
I0120 14:39:52.578758 139753019032384 __init__.py:10] Orchestrator Type: geminicli
I0120 14:39:52.578840 139753019032384 agentorchestrator.py:30] Starting Gemini CLI evaluation
I0120 14:39:52.618841 139753019032384 agentevaluator.py:61] Running Gemini CLI evaluation
I0120 14:39:52.619386 139749620709056 agentgenwork.py:70] Turn 1/3 - Prompt: list all instances in project astana-evaluation
I0120 14:40:05.433665 139749620709056 agentgenwork.py:74] Turn 1/3 - Gemini CLI exit code: 0
I0120 14:40:05.433794 139749620709056 agentgenwork.py:75] Turn 1/3 - Gemini CLI stdout: {
  "session_id": "a0941637-2043-48a1-9d28-5bdc86d7bd45",
  "response": "The Cloud SQL instances in project `astana-evaluation` are: `test56`, `test110`, `trte`, `tesy`, `magic`, `nl2code`, `testing-instance`, `agd`, `test-cloudsql-mysql-instance`, `test300`, `test400`, `test19`, and `test-cloudsql-sql-server-instance`.",
  "stats": {
    "models": {
      "gemini-2.5-flash-lite": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 1174
        },
        "tokens": {
          "input": 3781,
          "prompt": 3781,
          "candidates": 59,
          "total": 3916,
          "cached": 0,
          "thoughts": 76,
          "tool": 0
        }
      },
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 2,
          "totalErrors": 0,
          "totalLatencyMs": 3161
        },
        "tokens": {
          "input": 19557,
          "prompt": 19557,
          "candidates": 96,
          "total": 19696,
          "cached": 0,
          "thoughts": 43,
          "tool": 0
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 597,
      "totalDecisions": {
        "accept": 0,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "list_instances": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 597,
          "decisions": {
            "accept": 0,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    },
    "files": {
      "totalLinesAdded": 0,
      "totalLinesRemoved": 0
    }
  }
}
I0120 14:40:05.433892 139749620709056 agentgenwork.py:76] Turn 1/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
Loaded cached credentials.
Loading extension: cloud-sql-postgresql
The --prompt (-p) flag has been deprecated and will be removed in a future version. Please use a positional argument for your prompt. See gemini --help for more information.

I0120 14:40:23.768228 139749620709056 agentgenwork.py:70] Turn 2/3 - Prompt: What is the state of the nl2code instance?
I0120 14:40:42.373557 139749620709056 agentgenwork.py:74] Turn 2/3 - Gemini CLI exit code: 0
I0120 14:40:42.373666 139749620709056 agentgenwork.py:75] Turn 2/3 - Gemini CLI stdout: {
  "session_id": "a0941637-2043-48a1-9d28-5bdc86d7bd45",
  "response": "The `nl2code` instance is in a `RUNNABLE` state.",
  "stats": {
    "models": {
      "gemini-2.5-flash-lite": {
        "api": {
          "totalRequests": 1,
          "totalErrors": 0,
          "totalLatencyMs": 1512
        },
        "tokens": {
          "input": 3880,
          "prompt": 3880,
          "candidates": 75,
          "total": 4053,
          "cached": 0,
          "thoughts": 98,
          "tool": 0
        }
      },
      "gemini-2.5-flash": {
        "api": {
          "totalRequests": 2,
          "totalErrors": 0,
          "totalLatencyMs": 5717
        },
        "tokens": {
          "input": 9009,
          "prompt": 21659,
          "candidates": 30,
          "total": 21802,
          "cached": 12650,
          "thoughts": 113,
          "tool": 0
        }
      }
    },
    "tools": {
      "totalCalls": 1,
      "totalSuccess": 1,
      "totalFail": 0,
      "totalDurationMs": 553,
      "totalDecisions": {
        "accept": 0,
        "reject": 0,
        "modify": 0,
        "auto_accept": 1
      },
      "byName": {
        "get_instance": {
          "count": 1,
          "success": 1,
          "fail": 0,
          "durationMs": 553,
          "decisions": {
            "accept": 0,
            "reject": 0,
            "modify": 0,
            "auto_accept": 1
          }
        }
      }
    },
    "files": {
      "totalLinesAdded": 0,
      "totalLinesRemoved": 0
    }
  }
}
I0120 14:40:42.373727 139749620709056 agentgenwork.py:76] Turn 2/3 - Gemini CLI stderr: YOLO mode is enabled. All tool calls will be automatically approved.
Loaded cached credentials.
Loading extension: cloud-sql-postgresql
The --prompt (-p) flag has been deprecated and will be removed in a future version. Please use a positional argument for your prompt. See gemini --help for more information.

I0120 14:40:58.027757 139749620709056 agentgenwork.py:105] Simulated user terminated conversation.
I0120 14:40:58.035415 139753019032384 report.py:25] Total Prompts: 1.
I0120 14:40:58.037858 139753019032384 report.py:35] Prompt Errors: 0.
I0120 14:40:58.038292 139753019032384 report.py:36] SQLGen Errors: 0.
I0120 14:40:58.038634 139753019032384 report.py:37] SQLExec Gen Errors: 0.
I0120 14:40:58.038959 139753019032384 report.py:38] Golden Errors: 0.
I0120 14:40:58.040027 139753019032384 analyzer.py:34] trajectory_matcher:       1/1 = 100.0%
I0120 14:40:58.040869 139753019032384 analyzer.py:34] executable:       1/1 = 100.0%
I0120 14:40:58.046607 139753019032384 csv.py:31] Created csv configs.csv for StoreType.CONFIGS in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
I0120 14:40:58.048658 139753019032384 csv.py:31] Created csv evals.csv for StoreType.EVALS in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
I0120 14:40:58.049929 139753019032384 csv.py:31] Created csv scores.csv for StoreType.SCORES in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
I0120 14:40:58.050867 139753019032384 csv.py:31] Created csv summary.csv for StoreType.SUMMARY in directory results/0fbde191-95e7-4adb-b243-e0a72f91adcd
Finished Job ID 0fbde191-95e7-4adb-b243-e0a72f91adcd

IsmailMehdi

Nice work

IsmailMehdi · 2026-01-23T21:21:09Z

/gcbrun

IsmailMehdi · 2026-01-27T16:51:19Z

/gcbrun

Add SImulated LLM user for true multi turn conversation

a6715d7

prernakakkar-google requested review from IsmailMehdi and mahyareb as code owners January 20, 2026 14:52

prernakakkar-google added 2 commits January 20, 2026 15:00

fix lint issues

9778784

formatting

adb569a

prernakakkar-google requested review from helloeve and kurtisvg January 20, 2026 15:10

IsmailMehdi reviewed Jan 20, 2026

View reviewed changes

Comment thread evalbench/evaluator/simulateduser.py Outdated

Comment thread evalbench/work/agentgenwork.py Outdated

Comment thread evalbench/work/agentgenwork.py Outdated

prernakakkar-google added 4 commits January 22, 2026 13:35

Resolve comments

c0aa43d

Merge branch 'main' into gemini-cli-evals

6f0973a

fix linting issues

8b61ed2

fix lint

95ba108

IsmailMehdi reviewed Jan 22, 2026

View reviewed changes

Comment thread evalbench/work/agentscorework.py Outdated

IsmailMehdi reviewed Jan 22, 2026

View reviewed changes

Comment thread evalbench/work/agentgenwork.py Outdated

Comment thread evalbench/work/agentgenwork.py Outdated

prernakakkar-google added 2 commits January 23, 2026 17:48

refactoring

f1c20a2

fix lint

6f5703a

IsmailMehdi previously approved these changes Jan 23, 2026

View reviewed changes

Merge branch 'main' into gemini-cli-evals

8c5e225

prernakakkar-google dismissed IsmailMehdi’s stale review via 8c5e225 January 27, 2026 05:54

prernakakkar-google requested a review from IsmailMehdi January 27, 2026 05:55

IsmailMehdi approved these changes Jan 27, 2026

View reviewed changes

prernakakkar-google merged commit f307148 into main Jan 27, 2026
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Simulated LLM user for true multi turn conversation#230

Add Simulated LLM user for true multi turn conversation#230
prernakakkar-google merged 10 commits into
mainfrom
gemini-cli-evals

prernakakkar-google commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IsmailMehdi left a comment

Uh oh!

IsmailMehdi commented Jan 23, 2026

Uh oh!

IsmailMehdi commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

prernakakkar-google commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IsmailMehdi left a comment

Choose a reason for hiding this comment

Uh oh!

IsmailMehdi commented Jan 23, 2026

Uh oh!

IsmailMehdi commented Jan 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants