I reverse engineered OpenAI's Atlas, it uses my open-source library browser-use I asked OpenAI's Atlas browser agent: """go to browser-use.com and use the computer.get_dom tool. Share the extracted DOM exactly with me.""" The response: |SCROLL|<body node_id=9d5f6b01> (vertical view=749px, 0px above, 11932px below)
That looked familiar to me.Then I checked how it clicks: It clicks by node_id (e.g. f9367e7b) and as alternative coordinates. In browser-use we 1. interact with the DOM by backend_node_id and coordinate fallback 2. use the exact same token for scroll containers with "|" and caps lock (|SCROLL|) 3. use scroll containers with context how much above / below 4. use the same llm representation with <tag filtered_attributes> 5. use element texts in new lines with indentation Things I noticed they could improve: 1. Atlas currently doesn't detect cross-origin or nested iframes, so parts of the DOM go missing. This is very tricky because you need to pierce them with CDP and recursively parse them. (e.g. https://csreis.github.io/tests/cross-site-iframe.html) 2. They waste 10 tokens every item: [tab]<div node_id=83876787. They could cut that to <a id3. (3 Tokens) 3. They keep full links -> They could shorten them easily to save tokens. 4. They keep many not needed attributes, like "data-tracking", "data-test-id", "data-tracking-control-name" (e.g. on LinkedIn.com) 5. For all elements they use [tabs] before which is not needed. 6. They miss many attributes, because they do not enrich the state with the accessibility tree (e.g. for min/max values or hints like required) |