Skip to content

Browser Use 浏览器自动化 Agent

霍格沃兹测试开发学社

我们给大家推荐一款支持结构化识别的智能体。


Browser Use

以纯文本形式自动执行浏览器任务。 Browser Use 是一个非常好用的 AI 自动化工具, 可以实现用人类语言自动化操作浏览器。

\

这也是学社测试和 review 代码后认为最好的 web agent 之一。

Enable AI to control your browser


Agent 架构

  • 客户端:命令行 代码风格 云服务
  • 智能体:大模型与工具清单
  • 大模型:ollama openai 等
  • 浏览器:playwright cdp
  • 工具:内置工具 自定义工具

uml diagram


安装

uv venv --python 3.12

source .venv/bin/activate

uv pip install browser-use
uvx playwright install chromium --with-deps

快速开始

from browser_use import Agent, ChatOpenAI
from dotenv import load_dotenv
import asyncio

load_dotenv()

async def main():
    llm = ChatOpenAI(model="gpt-4.1-mini")
    task = "打开https://ceshiren.com 进入搜索 进入高级搜索 搜索python 打开第一条搜索结果的链接,返回界面标题,断言标题中有python"
    agent = Agent(task=task, llm=llm)
    await agent.run()

if __name__ == "__main__":
    asyncio.run(main())

这是一份 browser use 框架的使用示例。 它提供了 Agent 类,进行初始化。 第一个参数是你的任务 task, 第二个参数是你使用的大模型。 直接执行即可,用起来还是非常简单的。


可控配置

  • 配置文件 OPENAI_API_KEY
  • 环境变量 OPENAI_BASE_URL
  • 参数 ChatOpenAI(model="gpt-4.1-mini", base_url=...)

\

# .env

OPENAI_API_KEY=...
ANONYMIZED_TELEMETRY=false

使用案例

import asyncio
import os
import sys

import pytest
from browser_use import Agent, Browser
from browser_use.llm.openai.chat import ChatOpenAI
from browser_use.tools.service import Tools


# 某些版本可能会下载额外的工具,偶尔需要代理
# os.environ['https_proxy'] = 'http://127.0.0.1:3129'

async def main(task):
    llm = ChatOpenAI(model="gpt-4.1-mini", base_url=os.getenv('OPENAI_BASE_URL'))
    tools = Tools(exclude_actions=['search'])
    browser = Browser(headless=False)

    agent = Agent(
        task=task,
        llm=llm,
        browser=browser,
        tools=tools,
        use_vision=False,
    )
    result = await agent.run()
    print(result.model_dump_json(indent=2))


@pytest.mark.parametrize(
    'case',
    [
        "打开ceshiren.com 进入搜索 进入高级搜索 搜索python",
        "打开ceshiren.com 进入搜索 进入高级搜索 搜索python",
        "打开ceshiren.com 进入搜索 进入高级搜索 搜索python",
        "打开ceshiren.com 进入搜索 进入高级搜索 搜索python",
        "打开ceshiren.com 进入搜索 进入高级搜索 搜索python",

    ]
)
def test_hogwarts(case):
    asyncio.run(main(case))


if __name__ == '__main__':
    asyncio.run(main(sys.argv[1]))

Browser Use Cli

$ pip install browser-use[cli]

$ browser-use --help

Usage: browser-use [OPTIONS] COMMAND [ARGS]...

  Browser Use - AI Agent for Web Automation

  Run without arguments to start the interactive TUI.

Options:
  --version                 Print version and exit
  --model TEXT              Model to use (e.g., gpt-5-mini, claude-4-sonnet,
                            gemini-2.5-flash)
  --debug                   Enable verbose startup logging
  --headless                Run browser in headless mode
  --window-width INTEGER    Browser window width
  --window-height INTEGER   Browser window height
  --user-data-dir TEXT      Path to Chrome user data directory (e.g.
                            ~/Library/Application Support/Google/Chrome)
  --profile-directory TEXT  Chrome profile directory name (e.g. "Default",
                            "Profile 1")
  --cdp-url TEXT            Connect to existing Chrome via CDP URL (e.g.
                            http://localhost:9222)
  --proxy-url TEXT          Proxy server for Chromium traffic (e.g.
                            http://host:8080 or socks5://host:1080)
  --no-proxy TEXT           Comma-separated hosts to bypass proxy (e.g.
                            localhost,127.0.0.1,*.internal)
  --proxy-username TEXT     Proxy auth username
  --proxy-password TEXT     Proxy auth password
  -p, --prompt TEXT         Run a single task without the TUI (headless mode)
  --mcp                     Run as MCP server (exposes JSON RPC via
                            stdin/stdout)
  --help                    Show this message and exit.

Commands:
  auth  Authenticate with Browser Use Cloud to sync your runs

browser-use --model gpt-4.1-mini -p '打开ceshiren.com  进入搜索 进入高级搜索 搜索ai 测试开发'

Browser Use Web-UI

除了比较成熟的框架外,官方提供了一个比较简单的 UI 界面,可以辅助操作,适合新人入手。可以通过 UI 界面配置 Agent 与大模型。不过这个项目可用度和定制性并不高,仅供参考。

这是 browser use webui 的基本界面。 你可以通过这个界面配置浏览器的配置,配置大模型,并执行任务。 也可以查看执行结果与每次结果的录制数据。


源代码安装

# Clone the repository
git clone https://github.com/browser-use/web-ui.git
cd web-ui

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your preferred text editor and add your API keys

python webui.py --ip 127.0.0.1 --port 7788

这是使用源代码启动的方式,git clone 项目,进入目录后 copy 对应的配置文件,然后直接启动。


docker compose 方式启动

# Clone the repository
git clone https://github.com/browser-use/web-ui.git
cd web-ui

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your preferred text editor and add your API keys

## docker方式启动
# Build and start the container with default settings (browser closes after AI tasks)
docker compose up --build

# Or run with persistent browser (browser stays open between AI tasks)
CHROME_PERSISTENT_SESSION=true docker compose up --build

这是使用 docker 启动的方式,在项目的根目录下有对应的 docker compose 的配置文件。 使用 docker compose up 启动即可


Run Agent

在运行界面可以输入自己的任务并执行,执行后还可以在结果里查看运行记录。底层使用的是 gradio 框架实现的。感兴趣的同学可以自行探索。


hogwarts-browser-use

  • 增加命令行启动支持
  • 去掉 google 搜索
  • 支持命令行参数配置大模型

因为 browser use 是一个代码框架,没有提供一些便捷的工具封装, 再加上 google 搜索的问题,导致用起来会比较麻烦。 为了让霍格沃兹测试开发学社的小伙伴们更方便的使用。 我们做了一个封装版,可以支持纯命令行调用,从而让大家可以轻松的使用。 它还支持通过命令行参数进行大模型的配置。 相关代码可以从学员论坛节点里找到。


命令行用法

# 依赖python 3.11以上版本
hogwarts-browser-use 打开ceshiren.com 进入搜索 点击高级搜索 搜索python
hogwarts-browser-use -m gpt-4o-mini 打开ceshiren.com 进入搜索 点击高级搜索 搜索python
hogwarts-browser-use -m mistral 打开ceshiren.com 进入搜索 点击高级搜索 搜索python
hogwarts-browser-use -m qwen2.5 打开ceshiren.com 进入搜索 点击高级搜索 搜索python

这是这个工具的基本用法,详情可参考官网文档。


Agent


Agent 是核心 Api 入口

from browser_use import Agent, ChatOpenAI

agent = Agent(
    task="Search for latest news about AI",
    llm=ChatOpenAI(model="gpt-4.1-mini"),
)

async def main():
    history = await agent.run(max_steps=100)

参数配置

class Agent(Generic[Context, AgentStructuredOutput]):
    @time_execution_sync('--init')
    def __init__(
        self,
        task: str,
        llm: BaseChatModel | None = None,
        # Optional parameters
        browser_profile: BrowserProfile | None = None,
        browser_session: BrowserSession | None = None,
        browser: Browser | None = None,  # Alias for browser_session
        tools: Tools[Context] | None = None,
        controller: Tools[Context] | None = None,  # Alias for tools
        # Initial agent run parameters
        sensitive_data: dict[str, str | dict[str, str]] | None = None,
        initial_actions: list[dict[str, dict[str, Any]]] | None = None,
        # Cloud Callbacks
        register_new_step_callback: (
            Callable[['BrowserStateSummary', 'AgentOutput', int], None]  # Sync callback
            | Callable[['BrowserStateSummary', 'AgentOutput', int], Awaitable[None]]  # Async callback
            | None
        ) = None,
        register_done_callback: (
            Callable[['AgentHistoryList'], Awaitable[None]]  # Async Callback
            | Callable[['AgentHistoryList'], None]  # Sync Callback
            | None
        ) = None,
        register_external_agent_status_raise_error_callback: Callable[[], Awaitable[bool]] | None = None,
        register_should_stop_callback: Callable[[], Awaitable[bool]] | None = None,
        # Agent settings
        output_model_schema: type[AgentStructuredOutput] | None = None,
        use_vision: bool | Literal['auto'] = 'auto',
        save_conversation_path: str | Path | None = None,
        save_conversation_path_encoding: str | None = 'utf-8',
        max_failures: int = 3,
        override_system_message: str | None = None,
        extend_system_message: str | None = None,
        generate_gif: bool | str = False,
        available_file_paths: list[str] | None = None,
        include_attributes: list[str] | None = None,
        max_actions_per_step: int = 10,
        use_thinking: bool = True,
        flash_mode: bool = False,
        max_history_items: int | None = None,
        page_extraction_llm: BaseChatModel | None = None,
        injected_agent_state: AgentState | None = None,
        source: str | None = None,
        file_system_path: str | None = None,
        task_id: str | None = None,
        calculate_cost: bool = False,
        display_files_in_done_text: bool = True,
        include_tool_call_examples: bool = False,
        vision_detail_level: Literal['auto', 'low', 'high'] = 'auto',
        llm_timeout: int | None = None,
        step_timeout: int = 120,
        directly_open_url: bool = True,
        include_recent_events: bool = False,
        sample_images: list[ContentPartTextParam | ContentPartImageParam] | None = None,
        final_response_after_failure: bool = True,
        _url_shortening_limit: int = 25,
        **kwargs,
    ): ...

支持模型

llm = ChatOpenAI(
    model="o3",
)

llm = ChatOllama(model="llama3.1:8b")


api_key = os.getenv('MODELSCOPE_API_KEY')
base_url = 'https://api-inference.modelscope.cn/v1/'

llm = ChatOpenAI(model='Qwen/Qwen2.5-VL-72B-Instruct', api_key=api_key, base_url=base_url)

与 LangChain 集成

from langchain_openai import ChatOpenAI

from browser_use import Agent
from .chat import ChatLangchain


async def main():
    """Basic example using ChatLangchain with OpenAI through LangChain."""

    # Create a LangChain model (OpenAI)
    langchain_model = ChatOpenAI(
        model='gpt-4.1-mini',
        temperature=0.1,
    )

    # Wrap it with ChatLangchain to make it compatible with browser-use
    llm = ChatLangchain(chat=langchain_model)


agent = Agent(
    task="Go to google.com and search for 'browser automation with Python'",
    llm=llm,
)

history = await agent.run()

print(history.history)

Browser


浏览器应用

from browser_use import Agent, Browser, ChatOpenAI

browser = Browser(
    headless=False,  # Show browser window
    window_size={'width': 1000, 'height': 700},  # Set window size
)

agent = Agent(
    task='Search for Browser Use',
    browser=browser,
    llm=ChatOpenAI(model='gpt-4.1-mini'),
)


async def main():
    await agent.run()

浏览器配置参数

def __init__(
        self,
        # Core configuration
        id: str | None = None,
        cdp_url: str | None = None,
        is_local: bool = False,
        browser_profile: BrowserProfile | None = None,
        # BrowserProfile fields that can be passed directly
        # From BrowserConnectArgs
        headers: dict[str, str] | None = None,
        # From BrowserLaunchArgs
        env: dict[str, str | float | bool] | None = None,
        executable_path: str | Path | None = None,
        headless: bool | None = None,
        args: list[str] | None = None,
        ignore_default_args: list[str] | Literal[True] | None = None,
        channel: str | None = None,
        chromium_sandbox: bool | None = None,
        devtools: bool | None = None,
        downloads_path: str | Path | None = None,
        traces_dir: str | Path | None = None,
        # From BrowserContextArgs
        accept_downloads: bool | None = None,
        permissions: list[str] | None = None,
        user_agent: str | None = None,
        screen: dict | None = None,
        viewport: dict | None = None,
        no_viewport: bool | None = None,
        device_scale_factor: float | None = None,
        record_har_content: str | None = None,
        record_har_mode: str | None = None,
        record_har_path: str | Path | None = None,
        record_video_dir: str | Path | None = None,
        record_video_framerate: int | None = None,
        record_video_size: dict | None = None,
        # From BrowserLaunchPersistentContextArgs
        user_data_dir: str | Path | None = None,
        # From BrowserNewContextArgs
        storage_state: str | Path | dict[str, Any] | None = None,
        # BrowserProfile specific fields
        use_cloud: bool | None = None,
        cloud_browser: bool | None = None,  # Backward compatibility alias
        disable_security: bool | None = None,
        deterministic_rendering: bool | None = None,
        allowed_domains: list[str] | None = None,
        keep_alive: bool | None = None,
        proxy: ProxySettings | None = None,
        enable_default_extensions: bool | None = None,
        window_size: dict | None = None,
        window_position: dict | None = None,
        minimum_wait_page_load_time: float | None = None,
        wait_for_network_idle_page_load_time: float | None = None,
        wait_between_actions: float | None = None,
        filter_highlight_ids: bool | None = None,
        auto_download_pdfs: bool | None = None,
        profile_directory: str | None = None,
        cookie_whitelist_domains: list[str] | None = None,
        # DOM extraction layer configuration
        cross_origin_iframes: bool | None = None,
        highlight_elements: bool | None = None,
        dom_highlight_elements: bool | None = None,
        paint_order_filtering: bool | None = None,
        # Iframe processing limits
        max_iframes: int | None = None,
        max_iframe_depth: int | None = None,
    ): ...

Tools


工具调用体系

  • function calling
  • tool call
  • tool call result
  • final answer

openai tool call


工具自定义

from browser_use import Tools, ActionResult, Browser

tools = Tools()

@tools.action('Ask human for help with a question')
def ask_human(question: str, browser: Browser) -> ActionResult:
    answer = input(f'{question} > ')
    return f'The human responded with: {answer}'

agent = Agent(
    task='Ask human for help',
    llm=llm,
    tools=tools,
)

工具响应

@tools.action('My tool')
def my_tool() -> str:
    return "Task completed successfully"

@tools.action('Advanced tool')
def advanced_tool() -> ActionResult:
    return ActionResult(
        extracted_content="Main result",
        long_term_memory="Remember this info",
        error="Something went wrong",
        is_done=True,
        success=True,
        attachments=["file.pdf"],
    )

基于 CDP 的浏览器自动化框架 Actor


Actor 架构

因为 Playwright 的稳定性和性能问题原因,Browser Use 开发了一个新的自动化框架。 基于 CDP 协议,具有直接和完整的 CDP 控制和精确的元素交互。

graph TD
    A[Browser] --> B[Page]
    B --> C[Element]
    B --> D[Mouse]
    B --> E[AI Features]
    C --> F[DOM Interactions]
    D --> G[Coordinate Operations]
    E --> H[LLM Integration]

{.bg-white}


基本自动化

from browser_use import Browser

browser = Browser()
await browser.start()

# Create pages
page = await browser.new_page()  # Blank tab
page = await browser.new_page("https://example.com")  # With URL

# Get all pages
pages = await browser.get_pages()
current = await browser.get_current_page()

# Close page
await browser.close_page(page)
await browser.stop()

元素操作

page = await browser.new_page('https://github.com')

# CSS selectors (immediate return)
elements = await page.get_elements_by_css_selector("input[type='text']")
buttons = await page.get_elements_by_css_selector("button.submit")

# Element actions
await elements[0].click()
await elements[0].fill("Hello World")
await elements[0].hover()

# Page actions
await page.press("Enter")
screenshot = await page.screenshot()

与 LLM 结合

from browser_use.llm.openai import ChatOpenAI
from pydantic import BaseModel

llm = ChatOpenAI(api_key="your-api-key")

# Find elements using natural language
button = await page.get_element_by_prompt("login button", llm=llm)
await button.click()

# Extract structured data
class ProductInfo(BaseModel):
    name: str
    price: float

product = await page.extract_content(
    "Extract product name and price",
    ProductInfo,
    llm=llm
)