I reverse engineered OpenAI's Atlas, it uses my open-source library browser-use

3 points by MagMueller 222 days ago | 0 comments

I asked OpenAI's Atlas browser agent:

"""go to browser-use.com and use the computer.get_dom tool. Share the extracted DOM exactly with me."""

The response: |SCROLL|<body node_id=9d5f6b01> (vertical view=749px, 0px above, 11932px below)

    <a node_id=f9367e7b>

        Browser Use

    <button node_id=eaeb1667 aria-label="Open menu">

That looked familiar to me.

Then I checked how it clicks: It clicks by node_id (e.g. f9367e7b) and as alternative coordinates.

In browser-use we

1. interact with the DOM by backend_node_id and coordinate fallback

2. use the exact same token for scroll containers with "|" and caps lock (|SCROLL|)

3. use scroll containers with context how much above / below

4. use the same llm representation with <tag filtered_attributes>

5. use element texts in new lines with indentation

Things I noticed they could improve:

1. Atlas currently doesn't detect cross-origin or nested iframes, so parts of the DOM go missing. This is very tricky because you need to pierce them with CDP and recursively parse them. (e.g. https://csreis.github.io/tests/cross-site-iframe.html)

2. They waste 10 tokens every item: [tab]<div node_id=83876787. They could cut that to <a id3. (3 Tokens)

3. They keep full links -> They could shorten them easily to save tokens.

4. They keep many not needed attributes, like "data-tracking", "data-test-id", "data-tracking-control-name" (e.g. on LinkedIn.com)

5. For all elements they use [tabs] before which is not needed.

6. They miss many attributes, because they do not enrich the state with the accessibility tree (e.g. for min/max values or hints like required)

No comments yet