Then came mobile phones with their small screens and touch control which forced the web to adapt: responsive design.
Now it’s the turn of agents that need to see and interact with websites.
Sure you could keep on feeding them html/js and have them write logic to interact with the page, just like you can open a website in desktop mode and still navigate it: but it’s clunky.
Don’t stop at the name “MCP” that is debased: it’s much bigger than that
For example, web accessibility has potential as a starting point for making actions automatable, with the advantage that the automatable things are visible to humans, so are less likely to drift / break over time.
Any work happening in that space?
Sites are now expected duplicate effort by manually defining schemas for the same actions — like re-describing a button's purpose in JSON when it's already semantically marked up?
I think that the github repo's README may be more useful: https://github.com/webmachinelearning/webmcp?tab=readme-ov-f...
Also, the prior implementations may be useful to look at: https://github.com/MiguelsPizza/WebMCP and https://github.com/jasonjmcghee/WebMCP
But no MCP server today has tools that appear on page load, change with every SPA route, and die when you close the tab. Client support for this would have to be tightly coupled to whatever is controlling the browser.
What they really built is a browser-native tool API borrowing MCP's shape. If calling it "MCP" is what gets web developers to start exposing structured tools for agents, I'll take it.
I tried to play along at home some, play with rust accesskit crate. But man I just could not get Orcas or other basic tools to run, could not get a starting point. Highly discouraging. I thought for sure my browser would expose accessibility trees I could just look at & tweak! But I don't even know if that's true or not yet! Very sad personal experience with this.
1. This is a contextual API built into each page. Historically site's can offer an API, but that API a parallel experience, a separate machine-to-machine channel, that doesn't augment or extend the actual user session. The MCP API offered here is one offered by the page (not the server/site), in a fully dynamic manner (what's offered can reflect what the state of the page is), that layers atop user session. That's totally different.
2. This opens an expectation that sites have a standard means of control available. This has two subparts:
2a. There's dozens of different API systems available, to pick from, to expose your site. Github got half way from rest to graphql then turned back. Some sites use ttrpc or capnweb or gproto. There hasn't actually been one accepted way for machines to talk to your site, there's been a fractal maze of offerings on the web. This is one consistent offering mirroring what everyone is already using now anyways.
2b. Offering APIs for your site has gone out of favor in general. It often has had high walls and barriers when it is available. But now the people putting their fingers in that leaky damn are patently clearly Not Going To Make It, the LLM's will script & control the browser if they have to, and it's much much less pain to just lean in to what users want to do, and to expose a good WebMCP API that your users can enjoy to be effective & get shit done, like they have wanted to do all along. If webmcp takes off at all, it will reset expectations, that the internet is for end users, and that their agency & their ability to work your site as they please via their preferred modalities is king. WebMCP directs us towards a rfc8890 complaint future, by directly enabling site agency. https://datatracker.ietf.org/doc/html/rfc8890
The browser has tons of functionality baked in, everything from web workers to persistence.
This would also allow for interesting ways of authenticating/manipulating data from existing sites. Say I'm logged into image-website-x. I can then use the WebMCP to allow agents to interact with the images I've stored there. The WebMCP becomes a much more intuitive way than interpreting the DOM elements
I'm not sure if this is really all that much better than, say, a swagger API. The js interface has the double edge of access to your cookies and such.
The next one would be to also decouple the visual part of a website from the data/interactions: Let the users tell their in-browser agent how to render - or even offer different views on the same data. (And possibly also WHAT to render: So your LLM could work as an in-website adblocker for example; Similar to browser extensions such as a LinkedIn/Facebook feed blocker)
This is true excitement. I am not being ironic.
HN Thread Link: https://news.ycombinator.com/item?id=47037501
Quick summary of my reply:
- Your 70+ MCP tools show exactly what WebMCP aims to solve
- Key insight: MCP for APIs vs MCP for consumer apps are different
- WebMCP makes sense for complex sites (Amazon, Booking.com)
- The "drift problem" is real - WebMCP should be source of truth
- Suggested embed pattern for in-page toolsPeople should be mindful of using magic that has no protection of their data and then discover it's too late.
That's not a gap in the technology, it's just early.
I really like the way you can expose your schema through adding fields to a web form, that feels like a really nice extension and a great way to piggyback on your existing logic.
To me this seems much more promising than either needing an MCP server or the MCP Apps proposal.
Instead of letting the agent call a server (MCP), the agent downloads javascript and executes it itself (WebMCP).
This is what permissions are for.
That, or they expect that MCP clients should also be running a headless Chrome to detect JS-only MCP endpoints.
Think of it like an "IDE actions". Done right, there's no need to ever use the GUI.
As opposed to just being documentation for how to use the IDE with desktop automation software.
WebMCP API is a new JavaScript interface that allows web developers to expose their web application functionality as “tools” - JavaScript functions with natural language descriptions and structured schemas that can be invoked by agents, browser’s agents, and assistive technologies. Web pages that use WebMCP can be thought of as Model Context Protocol [MCP] servers that implement tools in client-side script instead of on the backend. WebMCP enables collaborative workflows where users and agents work together within the same web interface, leveraging existing application logic while maintaining shared context and user control.
An agent is an autonomous assistant that can understand a user’s goals and take actions on the user’s behalf to achieve them. Today, these are typically implemented by large language model (LLM) based AI platforms, interacting with users via text-based chat interfaces.
A browser’s agent is an agent provided by or through the browser that could be built directly into the browser or hosted by it, for example, via an extension or plug-in.
An AI platform is a provider of agentic assistants such as OpenAI’s ChatGPT, Anthropic’s Claude, or Google’s Gemini.
[Navigator](https://html.spec.whatwg.org/multipage/system-state.html#navigator) InterfaceThe [Navigator](https://html.spec.whatwg.org/multipage/system-state.html#navigator) interface is extended to provide access to the [ModelContext](#modelcontext).
partial interface Navigator {
[SecureContext, SameObject] readonly attribute ModelContext modelContext;
};
The [ModelContext](#modelcontext) interface provides methods for web applications to register and manage tools that can be invoked by agents.
[Exposed=Window, SecureContext]
interface ModelContext {
undefined provideContext(optional ModelContextOptions options = {});
undefined clearContext();
undefined registerTool(ModelContextTool tool);
undefined unregisterTool(DOMString name);
};
navigator.`[modelContext](#dom-navigator-modelcontext)`.`[provideContext(options)](#dom-modelcontext-providecontext)`
Registers the provided context (tools) with the browser. This method clears any pre-existing tools and other context before registering the new ones.
navigator.`[modelContext](#dom-navigator-modelcontext)`.`[clearContext()](#dom-modelcontext-clearcontext)`
Unregisters all context (tools) with the browser.
navigator.`[modelContext](#dom-navigator-modelcontext)`.`[registerTool(tool)](#dom-modelcontext-registertool)`
Registers a single tool without clearing the existing set of tools. The method throws an error, if a tool with the same name already exists, or if the [inputSchema](#dom-modelcontexttool-inputschema) is invalid.
navigator.`[modelContext](#dom-navigator-modelcontext)`.`[unregisterTool(name)](#dom-modelcontext-unregistertool)`
Removes the tool with the specified name from the registered set.
The provideContext(options) method steps are:
The clearContext() method steps are:
The registerTool(tool) method steps are:
The unregisterTool(name) method steps are:
dictionary ModelContextOptions {
sequence<ModelContextTool> tools = [];
};
options["`[tools](#dom-modelcontextoptions-tools)`"]
A list of [tools](#dom-modelcontextoptions-tools) to register with the browser. Each tool name in the list is expected to be unique.
The [ModelContextTool](#dictdef-modelcontexttool) dictionary describes a tool that can be invoked by agents.
dictionary ModelContextTool {
required DOMString name;
required DOMString description;
object inputSchema;
required ToolExecuteCallback execute;
ToolAnnotations annotations;
};
dictionary ToolAnnotations {
boolean readOnlyHint;
};
callback ToolExecuteCallback = Promise<any> (object input, ModelContextClient client);
tool["`[name](#dom-modelcontexttool-name)`"]
A unique identifier for the tool. This is used by agents to reference the tool when making tool calls.
tool["`[description](#dom-modelcontexttool-description)`"]
A natural language description of the tool’s functionality. This helps agents understand when and how to use the tool.
tool["`[inputSchema](#dom-modelcontexttool-inputschema)`"]
A JSON Schema [JSON-SCHEMA] object describing the expected input parameters for the tool.
tool["`[execute](#dom-modelcontexttool-execute)`"]
A callback function that is invoked when an agent calls the tool. The function receives the input parameters and a [ModelContextClient](#modelcontextclient) object.
The function can be asynchronous and return a promise, in which case the agent will receive the result once the promise is resolved.
tool["`[annotations](#dom-modelcontexttool-annotations)`"]
Optional annotations providing additional metadata about the tool’s behavior.
The [ToolAnnotations](#dictdef-toolannotations) dictionary provides optional metadata about a tool:
annotations["`[readOnlyHint](#dom-toolannotations-readonlyhint)`"]
If true, indicates that the tool does not modify any state and only reads data. This hint can help agents make decisions about when it is safe to call the tool.
The [ModelContextClient](#modelcontextclient) interface represents an agent executing a tool provided by the site through the [ModelContext](#modelcontext) API.
[Exposed=Window, SecureContext]
interface ModelContextClient {
Promise<any> requestUserInteraction(UserInteractionCallback callback);
};
callback UserInteractionCallback = Promise<any> ();
client.`[requestUserInteraction(callback)](#dom-modelcontextclient-requestuserinteraction)`
Asynchronously requests user input during the execution of a tool.
The callback function is invoked to perform the user interaction (e.g., showing a confirmation dialog), and the promise resolves with the result of the callback.
The requestUserInteraction(callback) method steps are:
Thanks to Brandon Walderman, Leo Lee, Andrew Nolan, David Bokan, Khushal Sagar, Hannah Van Opstal, Sushanth Rajasankar for the initial explainer, proposals and discussions that established the foundation for this specification.
Also many thanks to Alex Nahas and Jason McGhee for sharing early implementation experience.
Finally, thanks to the participants of the Web Machine Learning Community Group for feedback and suggestions.
Instead of parsing or screen-shooting the current page to understand the context, an AI agent running in the browser can query the page tools to extract data or execute actions without dealing with API authentication.
It's a pragmatic solution. An AI agent, in theory, can use the accessibility DOM to improve access to the page (or some HTML data annotation); however, it doesn't provide it with straightforward information about the actions it can take on the current page.
I see two major roadblocks with this idea:
1. Security: Who has access to these MCPs? This makes it easier for browser plugins to act on your behalf, but end users often don't understand the scope of granting plugins access to their pages.
2. Incentive: Exposing these tools makes accessing website data extremely easy for AI agents. While that's great for end users, many businesses will be reluctant to spend time implementing it (that's the same reason social networks and media websites killed RSS... more flexibility for end users, but not aligned with their business incentives)
But I’d happily add a little mcp server to it in js, if that means someone else can point their LLM at it and be taught how to play sudoku.
Every generation needs its own acronyms and specifications. If a new one looks like an old one likely the old one was ahead of its time.
It's great they are working on standardizing this so websites don't have to integrate with LLMs. The real opportunity seems to be able to automatically generate the tool calls / MCP schema by inspecting the website offline - I automated this using PLayright MCP.
I do like agent skills, but I’m really not convinced by the hype that they make MCP redundant.
> Integrating agents into it prevents fragmentation of their service and allows them to keep ownership of their interface, branding and connection with their users
Looking at the contrived examples given, I just don't see how they're achieving this. In fact it looks like creating MCP specific tools will achieve exactly the opposite. There will immediately be two ways to accomplish a thing and this will result in a drift over time as developers need to take into account two ways of interacting with a component on screen. There should be no difference, but there will be.
Having the LLM interpret and understand a page context would be much more in line with assistive technologies. It would require site owners to provide a more useful interface for people in need of assistance.
In an ideal world html documents should be very simple and everything visual should be done via css, with JavaScript being completely optional.
In such a world agents wouldn’t really need a dedicated protocol (and websites would be much faster to load and render, besides being much lighter on cpu and battery)
I wanted to make FOSS codegen that was not locked behind paywalls + had wasm plugins to extend it.
The problem is fundamentally that it's difficult to create structured data that's easily presentable to both humans and machines. Consider: ARIA doesn't really help llms. What you're suggesting is much more in line with microformats and schema.org, both of which were essentially complete failures.
LLMs can already read web pages, just not efficiently. It's not an understanding problem, it's a usability problem. You can give a computer a schema and ask it to make valid API calls and it'll do a pretty decent job. You can't tell a blind person or their screen reader to do that. It's a different problem space entirely.
Just give your AI agent a little linux VM to play around that it already knows how to use rather than some specialized protocol that has to predict everything an agent might want to do.