The first thing you’d probably try is to generate random sequences of pushes and pops, run a lot of tests, and wait for something to break. Using both operations feels like the most thorough way to test. It seems like the best coverage, the highest chance of finding a bug. But is it?
If you pick push and pop at random, half and half, you’ll need about 370,000 tests before you ever hit the overflow. That number isn’t a mistake. Pushes and pops cancel each other out, like a random walk: the stack goes up a bit, then down, then up again. Getting to 33 items is like flipping a coin and getting 33 heads in a row. It almost never happens.
Now try something different. Before each test, pick a random non-empty subset of the API. With two operations, you get three cases: both push and pop, just push, or just pop. In a third of your tests, you’ll only use push, so every operation grows the stack. The bug shows up right away.
If you leave out features at random when you write tests, you find the stack bug about a third of the time. Every test that only uses push will overflow. Before, the chance was almost zero. The tests that only use push are the ones that catch the bug, and you get those just by picking random subsets. You didn’t have to guess that pop was the problem. Doing less actually finds more.