Skip to content

midsence.js 智能体


midsence.js

视觉模型驱动,支持全平台的 UI 自动化 SDK

\


技术栈

  • Rsbuild 和 Rslib 提供构建工具。
  • UI-TARS 提供开源智能体模型 UI-TARS。
  • Qwen2.5-VL 提供开源 VL 模型 Qwen2.5-VL。
  • scrcpy 和 yume-chan 让我们能够用浏览器控制 Android 设备。
  • appium-adb 提供 adb 的 JavaScript 桥接。
  • appium-webdriveragent 用于 JavaScript 操作 XCTest。
  • YADB 提供 yadb 工具,提升文本输入性能。
  • Puppeteer 提供浏览器自动化和控制。
  • Playwright 提供浏览器自动化、控制和测试。

UI 自动化策略

  • 基于 DOM + 截图标注:提前提取页面的 DOM 结构,结合截图做好标注,请模型“挑选”其中的内容。
  • 纯视觉:利用模型的视觉定位能力,基于截图完成所有分析工作,即模型收到的只有图片,没有 DOM,也没有标注信息。

Midscene 早期同时兼容「DOM 定位」和「纯视觉」两种技术路线,交由开发者自行选择比对 综合上述情况,从 1.0 版本开始,Midscene 只支持纯视觉方案,不再提供“提取 DOM”的兼容模式。这一限制针对 UI 操作与元素定位;在数据提取或页面理解场景中,仍可按需附带 DOM 信息。


模型选择策略

  • 豆包 Seed 模型
  • 千问 Qwen3-VL
  • 千问 Qwen2.5-VL
  • Gemini-3-Pro / Gemini-3-Flash
  • UI-TARS


快速开始


安装

  • 命令行工具
  • 浏览器工具
  • playground 工具
  • mcp 工具
npm i -g @midscene/cli
# or
npx midscene

\


基础配置

# 替换为你自己的 API Key
export MIDSCENE_MODEL_BASE_URL="https://.../compatible-mode/v1"
export MIDSCENE_MODEL_API_KEY="sk-abcde..."
export MIDSCENE_MODEL_NAME="qwen3-vl-plus"
export MIDSCENE_MODEL_FAMILY="qwen3-vl"

.env 配置

MIDSCENE_MODEL_BASE_URL="https://.../compatible-mode/v1"
MIDSCENE_MODEL_API_KEY="sk-abcdefghijklmnopqrstuvwxyz"
MIDSCENE_MODEL_NAME="qwen3-vl-plus"
MIDSCENE_MODEL_FAMILY="qwen3-vl"

进阶配置

# 默认视觉模型:Qwen3-VL
export MIDSCENE_MODEL_BASE_URL="https://..."       # Qwen3-VL 接口地址
export MIDSCENE_MODEL_API_KEY="..."                # 你的 Qwen3-VL API Key
export MIDSCENE_MODEL_NAME="qwen3-vl-plus"
export MIDSCENE_MODEL_FAMILY="qwen3-vl"

# Planning 模型:GPT-5.1
export MIDSCENE_PLANNING_MODEL_API_KEY="sk-..."    # 你的 GPT-5.1 API Key
export MIDSCENE_PLANNING_MODEL_BASE_URL="https://..."
export MIDSCENE_PLANNING_MODEL_NAME="gpt-5.1"

# Insight 模型:GPT-5.1
export MIDSCENE_INSIGHT_MODEL_API_KEY="sk-..."     # 你的 GPT-5.1 API Key
export MIDSCENE_INSIGHT_MODEL_BASE_URL="https://..."
export MIDSCENE_INSIGHT_MODEL_NAME="gpt-5.1"

浏览器集成

  • 安装 chrome 扩展
  • 配置模型
  • 操作


Web SDK 方式

  • puppeter
  • playwright
  • chrome bridge model

Android

Midscene 可以驱动 adb 工具来实现 Android 自动化。 由于适配了视觉模型方案,整个自动化过程可以适配任意的 App 技术栈,无论是 Native、Flutter 还是 React Native 构建的 App 或小程序都能使用。开发者只需面向最终效果调试 UI 自动化脚本即可。

  • 支持使用 Playground 进行零代码试用。
  • 支持 JavaScript SDK。
  • 支持使用 YAML 格式的自动化脚本和命令行工具。
  • 支持生成 HTML 报告来回放所有操作路径。

Android Playground

npx --yes @midscene/android-playground


iOS Playground

npx --yes @midscene/ios-playground


自动化工具

web:
  url: https://www.bing.com

tasks:
  - name: 搜索天气
    flow:
      - ai: 搜索 "今日天气"
      - sleep: 3000
      - aiAssert: 结果显示天气信息
npm i -g @midscene/cli
midscene ./bing-search.yaml
midscene ./bing-search.yaml --headed
midscene ./bing-search.yaml --keep-window


# or
npx midscene ./bing-search.yaml

Android 自动化

android:
  # launch: https://www.bing.com
  deviceId: s4ey59 # device id 可以在 adb 命令行中通过 `adb devices` 命令获取

tasks:
  - name: 搜索天气
    flow:
      - ai: 打开浏览器并访问 bing.com
      - ai: 搜索 "今日天气"
      - sleep: 3000
      - aiAssert: 结果显示天气信息

iOS 自动化

ios:
  # launch: com.apple.mobilesafari
  wdaPort: 8100

tasks:
  - name: 搜索天气
    flow:
      - ai: 打开浏览器并访问 bing.com
      - ai: 搜索 "今日天气"
      - sleep: 3000
      - aiAssert: 结果显示天气信息

提示词工程分析

{
  "model": "qwen3-vl",
  "messages": [
    {
      "role": "system",
      "content": "\nTarget: User will give you an instruction, some screenshots and previous logs indicating what have been done. Your task is to plan the next one action according to current situation to accomplish the instruction.\n\nPlease tell what the next one action is (or null if no action should be done) to do the tasks the instruction requires. \n\n## Rules\n\n- Don't give extra actions or plans beyond the instruction. For example, don't try to submit the form if the instruction is only to fill something.\n- Give just the next ONE action you should do\n- Consider the current screenshot and give the action that is most likely to accomplish the instruction. For example, if the next step is to click a button but it's not visible in the screenshot, you should try to find it first instead of give a click action.\n- Make sure the previous actions are completed successfully before performing the next step\n- If there are some error messages reported by the previous actions, don't give up, try parse a new action to recover. If the error persists for more than 5 times, you should think this is an error and set the \"error\" field to the error message.\n- If there is nothing to do but waiting, set the \"sleep\" field to the positive waiting time in milliseconds and null for the \"action\" field.\n- Assertions are also important steps. When getting the assertion instruction, a solid conclusion is required. You should explicitly state your conclusion by calling the \"Print_Assert_Result\" action.\n\n## Supporting actions\n- Tap, Tap the element\n  - type: \"Tap\"\n  - param:\n    - locate: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The element to be tapped\n- DoubleClick, Double click the element\n  - type: \"DoubleClick\"\n  - param:\n    - locate: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The element to be double clicked\n- Input, Input text into the input field\n  - type: \"Input\"\n  - param:\n    - value: string // The text to input. Provide the final content for replace/append modes, or an empty string when using clear mode to remove existing text.\n    - autoDismissKeyboard?: boolean // If true, the keyboard will be dismissed after the input is completed. Do not set it unless the user asks you to do so.\n    - mode?: enum('replace', 'clear', 'append') // Input mode: \"replace\" (default) - clear the field and input the value; \"append\" - append the value to existing content; \"clear\" - clear the field without inputting new text.\n    - locate?: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The input field to be filled\n- Scroll, Scroll the page or an element. The direction to scroll, the scroll type, and the distance to scroll. The distance is the number of pixels to scroll. If not specified, use `down` direction, `once` scroll type, and `null` distance.\n  - type: \"Scroll\"\n  - param:\n    - scrollType?: enum('singleAction', 'scrollToBottom', 'scrollToTop', 'scrollToRight', 'scrollToLeft') // The scroll behavior: \"singleAction\" for a single scroll action, \"scrollToBottom\" for scrolling to the bottom, \"scrollToTop\" for scrolling to the top, \"scrollToRight\" for scrolling to the right, \"scrollToLeft\" for scrolling to the left\n    - direction?: enum('down', 'up', 'right', 'left') // The direction to scroll. Only effective when scrollType is \"singleAction\".\n    - distance?: number // The distance in pixels to scroll\n    - locate?: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // Describe the target element to be scrolled on, like \"the table\" or \"the list\" or \"the content area\" or \"the scrollable area\". Do NOT provide a general intent like \"scroll to find some element\"\n- DragAndDrop, Drag and drop (hold the mouse or finger down and move the mouse) \n  - type: \"DragAndDrop\"\n  - param:\n    - from: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The position to be dragged\n    - to: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The position to be dropped\n- KeyboardPress, Press a key or key combination, like \"Enter\", \"Tab\", \"Escape\", or \"Control+A\", \"Shift+Enter\". Do not use this to type text.\n  - type: \"KeyboardPress\"\n  - param:\n    - locate?: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The element to be clicked before pressing the key\n    - keyName: string // The key to be pressed. Use '+' for key combinations, e.g., 'Control+A', 'Shift+Enter'\n- AndroidLongPress, Trigger a long press on the screen at specified coordinates on Android devices\n  - type: \"AndroidLongPress\"\n  - param:\n    - duration?: number // The duration of the long press in milliseconds\n    - locate: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The element to be long pressed\n- AndroidPull, Trigger pull down to refresh or pull up actions\n  - type: \"AndroidPull\"\n  - param:\n    - direction: enum('up', 'down') // The direction to pull\n    - distance?: number // The distance to pull (in pixels)\n    - duration?: number // The duration of the pull (in milliseconds)\n    - locate?: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The element to start the pull from (optional)\n- ClearInput, the position of the placeholder or text content in the target input field. If there is no content, locate the center of the input field.\n  - type: \"ClearInput\"\n  - param:\n    - locate: {bbox: [number, number, number, number], prompt: string } // 2d bounding box as [xmin, ymin, xmax, ymax] // The input field to be cleared\n- RunAdbShell, Execute ADB shell command on Android device\n  - type: \"RunAdbShell\"\n  - param: string // ADB shell command to execute (pass the value directly, not as an object)\n- Launch, Launch an Android app or URL\n  - type: \"Launch\"\n  - param: string // App package name or URL to launch (pass the value directly, not as an object)\n- AndroidBackButton, Trigger the system \"back\" operation on Android devices\n  - type: \"AndroidBackButton\"\n- AndroidHomeButton, Trigger the system \"home\" operation on Android devices\n  - type: \"AndroidHomeButton\"\n- AndroidRecentAppsButton, Trigger the system \"recent apps\" operation on Android devices\n  - type: \"AndroidRecentAppsButton\"\n- Print_Assert_Result, Print the result of the assertion\n  - type: \"Print_Assert_Result\"\n  - param:\n    - condition: string // The condition of the assertion\n    - thought: string // The thought of the assertion, like \"I can see there are A, B, C elements on the page, which means ... , so the assertion is true\"\n    - result: boolean // The result of the assertion, true or false\n\n\n## About the `log` field (preamble message)\n\nThe `log` field is a brief preamble message to the user explaining what you’re about to do. It should follow these principles and examples:\n\n- **Use the same language as the user's instruction**\n- **Keep it concise**: be no more than 1-2 sentences, focused on immediate, tangible next steps. (8–12 words or Chinese characters for quick updates).\n- **Build on prior context**: if this is not the first action to be done, use the preamble message to connect the dots with what’s been done so far and create a sense of momentum and clarity for the user to understand your next actions.\n- **Keep your tone light, friendly and curious**: add small touches of personality in preambles feel collaborative and engaging.\n\n**Examples:**\n- \"Click the login button\"\n- \"Scroll to find the 'Yes' button in popup\"\n- \"Previous actions failed to find the 'Yes' button, i will try again\"\n- \"Go back to find the login button\"\n\n\n## Return format\n\nReturn in JSON format:\n{\n  \"log\": string, // a brief preamble to the user explaining what you’re about to do\n  \"error\"?: string, // Error messages about unexpected situations, if any. Only think it is an error when the situation is not foreseeable according to the instruction. Use the same language as the user's instruction.\n  \"more_actions_needed_by_instruction\": boolean, // Consider if there is still more action(s) to do after the action in \"Log\" is done, according to the instruction. If so, set this field to true. Otherwise, set it to false.\n  \"action\": \n    {\n      \"type\": string, // the type of the action\n      \"param\"?: { // The parameter of the action, if any\n         // k-v style parameter fields\n      }, \n    } | null,\n  ,\n  \"sleep\"?: number, // The sleep time after the action, in milliseconds.\n}\n\nFor example, if the instruction is to login and the form has already been filled, this is a valid return value:\n\n{\n  \"log\": \"Click the login button\",\n  \"more_actions_needed_by_instruction\": false,\n  \"action\": {\n    \"type\": \"Tap\",\n    \"param\": {\n      \"locate\": { \n        \"prompt\": \"The login button\", \"bbox\": [100, 200, 300, 400]\n      }\n    }\n  }\n}\n"
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "<user_instruction>打开浏览器并访问 bing.com</user_instruction>"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "this is the latest screenshot"
        },
        {
          "type": "text",
          "text": "(image ignored due to size optimization)"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "{\n  \"log\": \"Tap on the Chrome browser icon to open it\",\n  \"more_actions_needed_by_instruction\": true,\n  \"action\": {\n    \"type\": \"Tap\",\n    \"param\": {\n      \"locate\": {\n        \"bbox\": [538, 809, 657, 857],\n        \"prompt\": \"Chrome browser icon\"\n      }\n    }\n  }\n}"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "I have finished the action previously planned.. The last screenshot is attached. Please going on according to the instruction."
        },
        {
          "type": "text",
          "text": "(image ignored due to size optimization)"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "{\n  \"log\": \"Wait for Chrome browser to fully load before navigating\",\n  \"more_actions_needed_by_instruction\": true,\n  \"action\": null,\n  \"sleep\": 2000\n}"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "I have finished the action previously planned.. The last screenshot is attached. Please going on according to the instruction."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMUFRUVDA8XGBYUGBIUFRT",
            "detail": "high"
          }
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "{\n  \"log\": \"Tap 'Use without an account' to proceed to Chrome homepage\",\n  \"more_actions_needed_by_instruction\": true,\n  \"action\": {\n    \"type\": \"Tap\",\n    \"param\": {\n      \"locate\": {\n        \"bbox\": [295, 867, 702, 885],\n        \"prompt\": \"Use without an account button\"\n      }\n    }\n  }\n}"
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "I have finished the action previously planned.. The last screenshot is attached. Please going on according to the instruction."
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,/9j/2wBDAAMCAgMCAgMDAwMEAwMEBQgFBQQEBQoHBwYIDAoMDAsKCwsNDhIQDQ4RDgsLEBYQERMUFRUVDA8XGBYUGBIUFRT",
            "detail": "high"
          }
        }
      ]
    }
  ],
  "temperature": 0,
  "stream": false
}

MCP 支持

npx -y @midscene/web-bridge-mcp
npx -y @midscene/android-mcp
npx -y @midscene/ios-mcp
{
  "mcpServers": {
    "midscene-web": {
      "command": "npx",
      "args": ["-y", "@midscene/web-bridge-mcp"],
      "env": {
        "MIDSCENE_MODEL_BASE_URL": "替换为你的模型服务地址",
        "MIDSCENE_MODEL_API_KEY": "替换为你的 API Key",
        "MIDSCENE_MODEL_NAME": "替换为你的模型名称",
        "MIDSCENE_MODEL_FAMILY": "替换为你的模型系列",
        "MCP_SERVER_REQUEST_TIMEOUT": "600000"
      }
    }
  }
}

MCP 工具清单

\

npx @modelcontextprotocol/inspector