Web Bench: a new way to compare AI browser agents

Web Bench: a new way to compare AI browser agents(blog.skyvern.com)

34 points by suchintan 1 year ago | 9 comments

helsinki 1 year ago |

Does anyone use Skyvern to build their websites? I’m wondering how I might benefit from using an agentic browser workflow instead of a playwright MCP server for building a web UI?

neveroddoreven 1 year ago |

I had no idea WebVoyager only spanned 15 websites lol... the 452 figure you have still seems a little low though - do you have plans to expand it? It seems like you'd want as many sites as possible to improve the real-world accuracy of agents due to the long tail nature of website traffic

suchintan 1 year ago | |

We definitely plan to expand it. I want to get to ~10,000 for a reasonable benchmark.

15 blew my mind -- it's too easy to overfit that dataset

vasusen 1 year ago |

Thank you so much for creating this folks! A browser navigation agent is key part of our AI QA setup at Donobu (https://donobu.com/). We found the WebVoyager benchmarks severely lacking for complex e2e test cases like logged-in dashboards, onboarding forms, etc.

While the extraction/2fa flows aren't super relevant to us, this saves us time from building our own set of benchmarks. Really appreciate it and hope we can contribute to make this a really large set.

suchintan 1 year ago | |

That would be amazing!!

gitmagic 1 year ago |

Would love to see how Nelly [0] performs on this benchmark.

[0] https://nelly.is

suchintan 1 year ago | |

Very cool. The benchmark can be found here if you want to take a look at it: https://github.com/Halluminate/WebBench

pants2 1 year ago |

Great work! Big fan of Skyvern.

Looking forward to the benchmarks on Claude 4 (and o3 CUA when that's released)

wm2 1 year ago |

super cool!