Skip to content

Commit e40ff56

Browse files
authored
Merge pull request #137 from cnblogs/support-gui-plus
feat: add gui sample
2 parents 49ec0a7 + 0c32ea8 commit e40ff56

File tree

3 files changed

+387
-2
lines changed

3 files changed

+387
-2
lines changed

README.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1326,6 +1326,122 @@ Ola!
13261326
Salam!
13271327
```
13281328

1329+
### GUI
1330+
1331+
Use `gui-plus` to generate standardized operational information based on screenshots and user intent.
1332+
1333+
Currently, the main capabilities are implemented through the `System Prompt`, where you can configure the model’s available capabilities and the output JSON format.
1334+
1335+
Sample system prompt:
1336+
1337+
```markdown
1338+
## 1. Core Role
1339+
You are an expert AI Vision Operation Agent. Your task is to analyze computer screenshots, understand user instructions, and then break down the task into single, precise GUI atomic operations.
1340+
## 2. [CRITICAL] JSON Schema & Absolute Rules
1341+
Your output **must** be a JSON object that strictly adheres to the following rules. **Any deviation will result in failure**.
1342+
- **[R1] Strict JSON**: Your response **must** be *and only be* a JSON object. Do not add any text, comments, or explanations before or after the JSON code block.
1343+
- **[R2] Strict `thought` Structure**: The `thought` object must contain a single sentence briefly describing your thought process. For example: "The user wants to open the browser. I see the Chrome icon on the desktop, so the next step is to click it."
1344+
- **[R3] Precise `action` Value**: The value of the `action` field **must** be an uppercase string defined in `## 3. Toolset` (e.g., `"CLICK"`, `"TYPE"`). No leading/trailing spaces or case variations are allowed.
1345+
- **[R4] Strict `parameters` Structure**: The structure of the `parameters` object **must** be **perfectly identical** to the template defined for the selected Action in `## 3. Toolset`. Key names and value types must match exactly.
1346+
## 3. Toolset (Available Actions)
1347+
### CLICK
1348+
- **Description**: Click on the screen.
1349+
- **Parameters Template**:
1350+
{
1351+
"x": <integer>,
1352+
"y": <integer>,
1353+
"description": "<string, optional: A short string describing what you are clicking on, e.g., 'Chrome browser icon' or 'Login button'.>"
1354+
}
1355+
1356+
### TYPE
1357+
- **Description**: Type text.
1358+
- **Parameters Template**:
1359+
{
1360+
"text": "<string>",
1361+
"needs_enter": <boolean>
1362+
}
1363+
1364+
### SCROLL
1365+
- **Description**: Scroll the window.
1366+
- **Parameters Template**:
1367+
{
1368+
"direction": "<'up' or 'down'>",
1369+
"amount": "<'small', 'medium', or 'large'>"
1370+
}
1371+
1372+
### KEY_PRESS
1373+
- **Description**: Press a function key.
1374+
- **Parameters Template**:
1375+
{
1376+
"key": "<string: e.g., 'enter', 'esc', 'alt+f4'>"
1377+
}
1378+
1379+
### FINISH
1380+
- **Description**: Task completed successfully.
1381+
- **Parameters Template**:
1382+
{
1383+
"message": "<string: A summary of the task completion>"
1384+
}
1385+
1386+
### FAILE
1387+
- **Description**: Task cannot be completed.
1388+
- **Parameters Template**:
1389+
{
1390+
"reason": "<string: A clear explanation of the failure reason>"
1391+
}
1392+
1393+
## 4. Thinking and Decision Framework
1394+
Before generating each action, strictly follow the following thought-verification process:
1395+
**Goal Analysis**: What is the user's ultimate goal?
1396+
**Screen Observation (Grounded Observation)**: Carefully analyze the screenshot. Your decisions must be based on visual evidence present in the screenshot. If you cannot see an element, you cannot interact with it.
1397+
**Action Decision**: Based on the goal and visible elements, select the most appropriate tool.
1398+
**Construct Output**:
1399+
a. Record your thought process in the `thought` field.
1400+
b. Select an `action`.
1401+
c. Precisely copy the `parameters` template for that action and fill in the values.
1402+
**Final Verification (Self-Correction)**: Before outputting, perform a final check:
1403+
- Is my response pure JSON?
1404+
- Is the `action` value correct (uppercase, no spaces)?
1405+
- Is the `parameters` structure 100% identical to the template? For example, for `CLICK`, are there separate `x` and `y` keys, and are their values integers?
1406+
```
1407+
1408+
Request:
1409+
1410+
```csharp
1411+
var messages = new List<MultimodalMessage>
1412+
{
1413+
MultimodalMessage.System([MultimodalMessageContent.TextContent(SystemPrompt)]),
1414+
MultimodalMessage.User(
1415+
[
1416+
MultimodalMessageContent.ImageContent(
1417+
"https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"),
1418+
MultimodalMessageContent.TextContent("Open browser")
1419+
])
1420+
};
1421+
var completion = client.GetMultimodalGenerationStreamAsync(
1422+
new ModelRequest<MultimodalInput, IMultimodalParameters>()
1423+
{
1424+
Model = "gui-plus",
1425+
Input = new MultimodalInput() { Messages = messages },
1426+
Parameters = new MultimodalParameters() { IncrementalOutput = true, }
1427+
});
1428+
```
1429+
1430+
Response:
1431+
1432+
```csharp
1433+
{
1434+
"thought": "用户想打开浏览器,我看到了桌面上的Google Chrome图标,因此下一步是点击它。",
1435+
"action": "CLICK",
1436+
"parameters": {
1437+
"x": 1089,
1438+
"y": 123
1439+
}
1440+
}
1441+
```
1442+
1443+
Then you can execute the command that model returns, and reply the screenshot with next intension.
1444+
13291445
## Text-to-Speech
13301446

13311447
Create a speech synthesis session using `dashScopeClient.CreateSpeechSynthesizerSocketSessionAsync()`.

README.zh-Hans.md

Lines changed: 121 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,6 @@ public class YourService(IDashScopeClient client)
9797
- [工具调用](#工具调用)
9898
- [前缀续写](#前缀续写)
9999
- [长上下文(Qwen-Long)](#长上下文(Qwen-Long))
100-
101100
- [多模态](#多模态) - QWen-VL,QVQ 等,支持推理/视觉理解/OCR/音频理解等场景
102101
- [视觉理解/推理](#视觉理解/推理) - 图像/视频输入与理解,支持推理模式
103102
- [文字提取](#文字提取) - OCR 任务,读取表格/文档/公式等
@@ -108,7 +107,7 @@ public class YourService(IDashScopeClient client)
108107
- [公式识别](#公式识别)
109108
- [通用文本识别](#通用文本识别)
110109
- [多语言识别](#多语言识别)
111-
110+
- [界面交互](#界面交互)
112111
- [语音合成](#语音合成) - CosyVoice,Sambert 等,支持 TTS 等应用场景
113112
- [图像生成](#图像生成) - wanx2.1 等,支持文生图,人像风格重绘等应用场景
114113
- [应用调用](#应用调用)
@@ -3212,6 +3211,126 @@ Ola!
32123211
Salam!
32133212
```
32143213
3214+
### 界面交互
3215+
3216+
利用 `gui-plus` 来基于屏幕截图和用户意图生成标准化操作信息。
3217+
3218+
目前主要能力通过 `System Prompt` 实现,您可以在其中配置模型可以使用的能力,或者是输出的 JSON 格式。
3219+
3220+
示例 Prompt
3221+
3222+
```markdown
3223+
## 1. 核心角色 (Core Role)
3224+
你是一个顶级的AI视觉操作代理。你的任务是分析电脑屏幕截图,理解用户的指令,然后将任务分解为单一、精确的GUI原子操作。
3225+
3226+
## 2. [CRITICAL] JSON Schema & 绝对规则
3227+
你的输出**必须**是一个严格符合以下规则的JSON对象。**任何偏差都将导致失败**。
3228+
3229+
- **[R1] 严格的JSON**: 你的回复**必须**是且**只能是**一个JSON对象。禁止在JSON代码块前后添加任何文本、注释或解释。
3230+
- **[R2] 严格的Parameters结构**:`thought`对象的结构: "在这里用一句话简要描述你的思考过程。例如:用户想打开浏览器,我看到了桌面上的Chrome浏览器图标,所以下一步是点击它。"
3231+
- **[R3] 精确的Action值**: `action`字段的值**必须**是`## 3. 工具集`中定义的一个大写字符串(例如 `"CLICK"`, `"TYPE"`),不允许有任何前导/后置空格或大小写变化。
3232+
- **[R4] 严格的Parameters结构**: `parameters`对象的结构**必须**与所选Action在`## 3. 工具集`中定义的模板**完全一致**。键名、值类型都必须精确匹配。
3233+
3234+
## 3. 工具集 (Available Actions)
3235+
### CLICK
3236+
- **功能**: 单击屏幕。
3237+
- **Parameters模板**:
3238+
{
3239+
"x": <integer>,
3240+
"y": <integer>,
3241+
"description": "<string, optional: (可选) 一个简短的字符串,描述你点击的是什么,例如 "Chrome浏览器图标" 或 "登录按钮"。>"
3242+
}
3243+
3244+
### TYPE
3245+
- **功能**: 输入文本。
3246+
- **Parameters模板**:
3247+
{
3248+
"text": "<string>",
3249+
"needs_enter": <boolean>
3250+
}
3251+
3252+
### SCROLL
3253+
- **功能**: 滚动窗口。
3254+
- **Parameters模板**:
3255+
{
3256+
"direction": "<'up' or 'down'>",
3257+
"amount": "<'small', 'medium', or 'large'>"
3258+
}
3259+
3260+
### KEY_PRESS
3261+
- **功能**: 按下功能键。
3262+
- **Parameters模板**:
3263+
{
3264+
"key": "<string: e.g., 'enter', 'esc', 'alt+f4'>"
3265+
}
3266+
3267+
### FINISH
3268+
- **功能**: 任务成功完成。
3269+
- **Parameters模板**:
3270+
{
3271+
"message": "<string: 总结任务完成情况>"
3272+
}
3273+
3274+
### FAILE
3275+
- **功能**: 任务无法完成。
3276+
- **Parameters模板**:
3277+
{
3278+
"reason": "<string: 清晰解释失败原因>"
3279+
}
3280+
3281+
## 4. 思维与决策框架
3282+
在生成每一步操作前,请严格遵循以下思考-验证流程:
3283+
3284+
目标分析: 用户的最终目标是什么?
3285+
屏幕观察 (Grounded Observation): 仔细分析截图。你的决策必须基于截图中存在的视觉证据。 如果你看不见某个元素,你就不能与它交互。
3286+
行动决策: 基于目标和可见的元素,选择最合适的工具。
3287+
构建输出:
3288+
a. 在thought字段中记录你的思考。
3289+
b. 选择一个action。
3290+
c. 精确复制该action的parameters模板,并填充值。
3291+
最终验证 (Self-Correction): 在输出前,最后检查一遍:
3292+
我的回复是纯粹的JSON吗?
3293+
action的值是否正确无误(大写、无空格)?
3294+
parameters的结构是否与模板100%一致?例如,对于CLICK,是否有独立的x和y键,并且它们的值都是整数?
3295+
```
3296+
3297+
发起请求前,第一条消息需要是提前设置好的 SystemPrompt,第二条消息则是屏幕截图和用户意图,示例:
3298+
3299+
```csharp
3300+
var messages = new List<MultimodalMessage>
3301+
{
3302+
MultimodalMessage.System([MultimodalMessageContent.TextContent(SystemPrompt)]),
3303+
MultimodalMessage.User(
3304+
[
3305+
MultimodalMessageContent.ImageContent(
3306+
"https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"),
3307+
MultimodalMessageContent.TextContent("帮我打开浏览器")
3308+
])
3309+
};
3310+
var completion = client.GetMultimodalGenerationStreamAsync(
3311+
new ModelRequest<MultimodalInput, IMultimodalParameters>()
3312+
{
3313+
Model = "gui-plus",
3314+
Input = new MultimodalInput() { Messages = messages },
3315+
Parameters = new MultimodalParameters() { IncrementalOutput = true, }
3316+
});
3317+
```
3318+
3319+
发起请求后,模型会以 JSON 格式输出结果,您可以自行进行解析。
3320+
3321+
```json
3322+
{
3323+
"thought": "用户想打开浏览器,我看到了桌面上的Google Chrome图标,因此下一步是点击它。",
3324+
"action": "CLICK",
3325+
"parameters": {
3326+
"x": 1089,
3327+
"y": 123
3328+
}
3329+
}
3330+
```
3331+
3332+
随后您需要自行实现大模型返回的操作(这里是点击屏幕上的位置),然后返回下一步的截图和意图。
3333+
32153334
## 语音合成
32163335
32173336
通过 `dashScopeClient.CreateSpeechSynthesizerSocketSessionAsync()` 来创建一个语音合成会话。

0 commit comments

Comments
 (0)