Web Bench: a new way to compare AI browser agents(blog.skyvern.com) |
Web Bench: a new way to compare AI browser agents(blog.skyvern.com) |
15 blew my mind -- it's too easy to overfit that dataset
While the extraction/2fa flows aren't super relevant to us, this saves us time from building our own set of benchmarks. Really appreciate it and hope we can contribute to make this a really large set.
[0] https://nelly.is
Looking forward to the benchmarks on Claude 4 (and o3 CUA when that's released)