Alibaba Launches Page Agent for In-Page Web Control
Alibaba has introduced Page Agent, an open-source JavaScript library designed to enable natural language control of web interfaces directly from within a webpage. Unlike traditional browser automation tools such as Playwright, Puppeteer, and Selenium, which operate from an external process and interact with pages via screenshots or the Chrome DevTools Protocol, Page Agent embeds itself as plain JavaScript. This allows it to read the live Document Object Model (DOM) as text and simulate user actions from the inside.
The core innovation of Page Agent is a technique called "DOM dehydration." This process compresses the webpage's DOM into a FlatDomTree, enabling smaller, more efficient text-based language models to precisely interpret and act upon the page's structure. The agent is model-agnostic, meaning it can connect to any large language model (LLM) via an OpenAI-compatible endpoint. Only text is transmitted to the LLM, making a robust text model sufficient for its operation. The project is licensed under the MIT license and is TypeScript-first, building upon the browser-use framework for its DOM processing capabilities.
Page Agent functions as a client-side library that developers can integrate into their web applications. Once embedded, users can issue commands in natural language, and the agent will identify and interact with elements like buttons and form fields directly within the page. A significant advantage is that it inherits the user's existing cookies, session data, and authentication, eliminating the need for a separate backend infrastructure. Furthermore, it respects the web application's existing UI validation and security rules.
While Page Agent offers a novel approach to in-page automation, the developers note certain limitations. Prompt-level safety and the scope of single-page applications are identified as constraints. Therefore, server-side validation is still recommended for sensitive actions. The tool is best suited for copilots and form-filling functionalities within applications that developers own, rather than for interacting with external or restricted websites. The codebase is available under the permissive MIT license.
Original source — read the full reporting at the publisher:
Read on MarkTechPost