Benchmarking the continuous improvement of language agents in deployment | Dark Hacker News