This matches my experience as well, as the one person team on a desktop app with thousands of unit tests and hundreds of playwright e2e tests. I had a number of flaky tests that Claude was self selecting to isolate when running the tests and this was concerning. The breakthrough for me was using the superpowers debugging skill and setting a focused goal to fix one particular test that was failing most often. It ended up being a race condition that I'd never have found on my own, and it then went and found the dozen or so other similar issues in the code base. No e2e failures now. This is a very satisfying use of an AI agent for me.