“Expect tests” make test-writing feel like a REPL session

“Expect tests” make test-writing feel like a REPL session(blog.janestreet.com)

138 points by jsomers 3 years ago | 92 comments

mabbo 3 years ago |

> But think: everything in those describe blocks had to be written by hand.

It also had to be thought about by the developer. Someone had to say "I want the code to do this under these conditions".

If your tests can be autogenerated then they aren't verifying expected behaviour, they're just locking in your implementation such that it can't change later. They are saying "hey look everyone, I got my coverage metric to 100% (despite any bugs I may have)."

codetrotter 3 years ago | |

One of the projects at a place where I have worked was set up so that when you ran the tests it automatically and silently updated the values that were expected. Completely bonkers because the first time I was contributing to the project I prepared the tests first and then started the implementation, and then while I was working on it I ran the tests which at this point should fail because I hadn’t finished writing the code but instead all tests passed. Because helpfully the test setup overwrote the expected values that I had prepared in my new tests, with the bad data. Yeah great, very helpful >:(

Oh yeah and the whole test setup was also way too tied to the implementation rather than verifying behaviour. Complete trash the whole thing.

mabbo 3 years ago | | |

I keep rereading this hoping I'm misunderstanding.

That is cargo cult level behaviour. They know that software with lots of tests tend to have few bugs, so let's automatically have lots of tests!

I just hope whatever you were building wasn't critical to human lives.

https://en.m.wikipedia.org/wiki/Cargo_cult

travisjungroth 3 years ago | | |

Would it do this just the first time? It’s still bad it was doing this silently, but it’s pretty common to test web APIs in a similar way manually. Make a request, check the response you get back looks right (important step) and then save it as the expected value.

Edit: or after reading the article, like in the article.

teeray 3 years ago | | |

I can somewhat understand, because this is kind of the goal of property based testing—the actual values themselves matter so little to the test that you’re willing to subject those inputs to randomness

That said, this doesn’t sound like a very good way to pull that off because the developer has no control over that randomness (where it’s needed greatly).

superb-owl 3 years ago | | |

So long as the diffs get reviewed and checked in, this is a great form of testing called "regression testing". It doesn't replace unit testing, but it can be super valuable.

WastingMyTime89 3 years ago | |

You are missing the point entirely. It’s actually discussed at length in the article btw if you had bothered reading it.

Regression tests are extremely useful because you don’t want working code to get broken but they are tedious to write. What the author is describing is pretty much how everyone does it if you want anything moderately complex in the test, you just run and then copy-paste. Having something do it for you in a frictionless way is a huge win.

Plus the way the framework works you can still test expected behaviours before writing the code if that’s what you actually want.

carry_bit 3 years ago | |

Think of it as manual testing where your work is captured so it can be ran later in an automated fashion. There are many problems where verifying the answer is easier than coming up with the answer.

Asserting formatted output can also be really useful. A picture might be worth a thousand words, but when it comes to tests it can save you a thousand asserts. Writing those thousand asserts separately also would be so tedious that in practice you'd probably not write them all, leaving part of your output uncovered by tests.

When I wrote a LALR parser generator for fun, I added some code to print out a nicely formatted parsing table with debugging information. Besides being useful for debugging, it let me write simple yet powerful tests: I would feed the generator a grammar and then assert on the formatted parsing table. That made it easy to verify that I was asserting the right thing, and let me assert everything in one go.

tantalor 3 years ago | |

> locking in your implementation such that it can't change later

That's the whole point of tests. All tests do that.

This protects against later code changes that change behavior (output or side effects) unintentionally.

When you intend to change behavior then you need to change the tests tests too.

mabbo 3 years ago | | |

I disagree.

Tests should define what the expectations are. If a change does not impact those expectations, then it should be allowed and not break any tests.

Locking your code such that all future changes require updating old tests tells me that your tests are just your code written a second time, with no thought about what the code's requirements are.

schwartzworld 3 years ago | | |

Implementation !== Behavior. You want to test the behavior, not the implementation. I'd expect tests to change when behavior changes, but reimplementing the same behavior, the tests should pass when you're done.

IshKebab 3 years ago | |

Yeah in their Fibonacci example if it printed out 510 instead of 610 you'd still have a bug and think you had tested it. Especially confusing for future people who will assume it works because there are passing tests!

angio 3 years ago | | |

The title mentions writing tests as if they are repl sessions because you're supposed to iterate until you have the correct result.

pydry 3 years ago | |

For Fibonacci (or indeed the result of most mathematical calculations) it makes no sense but I use this kind of thing all the time where the expected output is, for example, a templated string like an error message.

There are plenty of kinds of test outputs where rewriting the test and eyeballing the result is quicker, easier and ultimately better.

polio 3 years ago | | |

It makes sense in scenarios where it's easier to verify a provided solution than it is to create one.

User23 3 years ago | |

If you’re autogenerating your tests from a specification and not an implementation then it can potentially be useful.

khuey 3 years ago | |

In many contexts there's value in ensuring the behavior doesn't change without being noticed. You're just moving the developer thinking about the expected behavior from when the test is written to when the test fails.

mannykannot 3 years ago | |

See the related memes "code never lies", "the code is the contract" and “when I use a word, it means just what I choose it to mean — neither more nor less."

avgcorrection 3 years ago |

> I think you’re supposed to write some nonsense, like assert fibonacci(15) == 8, then when the test says “WRONG! Expected 8, got 610”, you’re supposed to copy and paste the 610 from your terminal buffer into your editor.

> This is insane!

The sane approach is presumably to either expand the call tree and verify all the unique subsolutions. Or to do every step with a calculator if you can’t expand the call tree.

> The %expect block starts out blank precisely because you don’t know what to expect. You let the computer figure it out for you. In our setup, you don’t just get a build failure telling you that you want 610 instead of a blank string. You get a diff showing you the exact change you’d need to make to your file to make this test pass; and with a keybinding you can “accept” that diff. The Emacs buffer you’re in will literally be overwritten in place with the new contents [1]:

Oh okay. The non-insane approach is to do the first thing but Emacs copies the result on your behalf.

eru 3 years ago | |

Well, the non-insane thing is to do property-based testing. Instead of testing only a handful of examples.

c-cube 3 years ago | | |

They also do that, the post refers to their Quickcheck library. But how do you property test the Fibonacci function ? There isn't much to say about it...

postalrat 3 years ago | | |

I prefer "code it twice and hope you get it right once" testing.

Complex systems use that system everywhere. Why aren't we doing it for our code?

thaumasiotes 3 years ago | |

Yes, I have difficulty understanding the point of a test-writing system that relies on your explicit assumption that whatever the code already does is correct.

What are you testing? Why?

MichaelBurge 3 years ago | | |

A regression test is checking causality: Changes in new code, updating dependencies, updating the OS the software is running on, updating shared libraries, porting the code to a new platform, etc. aren't supposed to change the test results.

"I may not know what cos(x) means, but whatever it is shouldn't depend on what OS version I'm running"

sesm 3 years ago | | |

This looks similar to snapshot testing in UI, where you save an output of UI components and test system notifies you when the output changes. This can be useful to detect changes in components that you didn’t intend to change.

gavinray 3 years ago | |

Lol I came here to post this but you beat me.

CGamesPlay 3 years ago |

Snapshot testing is great, and I wish more test frameworks included first-class support for them. This means that they can auto update with a flag, and can be stored either in the source inline or in an external file (both modes have different use cases). Note that doc tests can also be a form of this, e.g. in Python's.

"Expect tests" seems like a bad name, since that covers all tests.

chii 3 years ago | |

i find that snapshot testing gets overused in javascript - and mistakes can creep in easily, and if the snapshot is big, and in a separate file, code review can miss it.

I much prefer property based testing over expectation based testing. You have to explicitly think about what properties hold true about the thing you're writing.

For example, fib(N+1) = fib(N) + fib(N), so this property can be tested for all N; primitive generators can easily generate the data, and good composition framework can easily generate complex data from primitive data.

Of course, you have to have a property you can specify easily. Otherwise, it'd be exactly the same as expectation based testing.

IanCal 3 years ago | | |

Every single time I've introduced property based testing, even as a simple example, I've discovered a bug in either the code or the spec.

I've found a bug in a Haskell program about fib generation - your test would work (if fixed for the subtractions) but incorrectly as there was an overflow in the addition. A basic property of "fib(n+1) > fib(n)" for n>1 finds this.

I like this type of testing as it asks you to more generally consider what guarantees your code is making about its operation.

Edit - your example is a good one and necessary, I just wanted to add a bit extra as I really like property based testing

sesm 3 years ago | | |

Snapshot testing works well for component systems, especially with storybook. There is a service called Chromatic that lets you diff component changes visually using storybook output.

tantalor 3 years ago | |

> update with a flag

Yes this is right level of automation, not whatever this article is going on about with the editor integration. Yuck.

thedufer 3 years ago | | |

The open source use pattern for expect tests in OCaml (via dune) is exactly as you describe (see https://dune.readthedocs.io/en/stable/tests.html) - you run the tests with `--auto-promote` to tell it to update. The editor integration is a very simple keybinding on top of more generic tooling.

ElliotH 3 years ago |

I wonder if this has the same downsides as golden and screenshot type tests, where you end up over-asserting resulting in tests that break for unrelated changes?

Obviously that’s a risk for hand written tests too but it’s easier (today… who knows what copilot like systems will offer soon!) for a human to reason about what’s relevant.

potatoyogurt 3 years ago | |

Yes, that is definitely a downside for these tests. The worst is when the text of some exception is printed and it includes line numbers. It does still require some discipline to think about what you're printing and avoid output that will be very noisy. This problem is mitigated quite a bit by the ease of accepting changes when these tests fail for obviously nonsense reasons though (just hit a couple buttons in an emacs buffer).

avgcorrection 3 years ago | |

Q: Why does this test assert the value X?

A: The value X was revealed to me by ChatGPT.

scotty79 3 years ago |

Doesn't this approach make you update results of failing tests wholesale and possibly miss where a new result of some test is actually wrong?

https://docs.rs/expect-test/latest/expect_test/

eru 3 years ago | |

At Google the nickname for these kinds of tests was 'change detector tests'.

imajoredinecon 3 years ago | | |

Yeah, the OP's counterargument is that you can filter down what goes into the test output. But at that point it seems not too different qualitatively from the traditional bottom-up approach where you just write assertions yourself, except that the framework does the job of populating the assertions' expected values.

mannykannot 3 years ago | | |

If you are saying this approach would tend to produce a lot of change-detector tests, then that is an issue, but I think scotty79 is making a different point: this approach would seem to make it easy to overlook any regressions that the latest change has created.

wmanley 3 years ago |

Was discussed recently here: https://news.ycombinator.com/item?id=34350749

vdm 3 years ago |

A similar approach with pytest and pdb https://simonwillison.net/2020/Feb/11/cheating-at-unit-tests...

This does get me writing tests sooner.

foobarbecue 3 years ago | |

I do this with pytest-regtest's --regtest-reset command.

BoppreH 3 years ago |

Some years ago I wrote a Python function, "replace_me"[1], that edits the caller's source code. You can use it for code generation, inserting comments, generating fixed random seeds, etc.

And one more use case I found was exactly what TFA describes, but even easier:

   import replace_me
   replace_me.test(1+1)

Once executed, it evaluates the argument and becomes an assertion:

   import replace_me
   replace_me.test(1+1, 2)

I never actually used it for anything important, but it comes back to my mind once in a while.

[1]: https://github.com/boppreh/replace_me

theptip 3 years ago |

I tend to think that tests should be carefully crafted for readability just like normal code. The “content of a REPL” is unlikely to be well-thought out enough to preserve meaningful invariants while remaining supple in the direction of likely changes. Perhaps in the hands of very good engineers this tool is net positive, but I shudder at giving junior engineers a tool that encourages less structure in tests.

A good set of fixture/helper functions should let you write really short and expressive tests (or tabular parametrized tests, if you prefer) which seems to me to resolve most of the pain points the author is complaining about.

One big advantage I do see with this approach is it seems to be a very compact rendering of a table of outputs; in Python+pytest+PyCharm if I run a 10-example parametrized test, I have to click through to see each failure individually. Perhaps there is a UX learning here that just rendering the raw errors into the code beside the test matrix could help visualize results faster.

As an aside, I have recently been enjoying the “write an ascii representation as your test assert” mode of testing, it can give a different way of intuiting what is going on.

gleb 3 years ago |

Similar idea in Elixir, where the library itself handles the interactive bits: https://github.com/assert-value/assert_value_elixir

adrianmonk 3 years ago |

I think this would suffer from the same problem as partial self-driving cars: it's human nature for vigilance to falter if it doesn't feel like you're the sole/primary one in control.

Of course, you can say "I won't let myself do that", but working against human nature is not a formula for success. If my back hurts, I can tell myself I'm just going to go lie down on the bed for 10 minutes but not take a nap, but then 30 minutes later I wake up feeling groggy.

evrimoztamur 3 years ago |

Here's an older post from 2015 (also from Jane Street) explaining the same process https://blog.janestreet.com/testing-with-expectations/, but at the infancy of the method. It looks like they heavily polished it!

I like the approach, and I was indeed copy-pasting the result from my console...

bmitc 3 years ago |

I don’t really understand this. How is this different from just writing the code and just assuming that you got it correct, and then locking in a potentially wrong implementation?

> What does fibonacci(15) equal? If you already know, terrific—but what are you meant to do if you don’t?

Who does that? How do you know 610 is correct? That’s just assuming your implementation is right from the get go. For such a function, I’d independently calculate it, using some method I trust (maybe Wolfram Alpha). I’d do this for a handful of examples, trying to cover base and extreme cases. And then I’d do property testing if I really wanted good coverage. Further, this expect test library seems to just smoothen the experience of copying what the function returns into a test.

This whole “expect test” business seems to rely on the developer looking at what the function returns for a given input, evaluating if it’s correct or not and then locking that in as “this is what this function is supposed to do”. That seems backwards and no different from how one implements functions in the first place, so I don’t know what is actually being tested.

The entire point of testing is saying “this is what this function should do” and not “this is what the function did and thus that’s what it should always do”.

angio 3 years ago | |

You're supposed to use it as a repl, so you start with a test for `fib(1) = 1`, then `fib(2)` and so on. Once you're confident of your implementation, you use quickcheck to test general properties of the system.

Similarly if you find a bug in the live system, you add a test for that and the initial output will be wrong. Then you fix your code until it prints the correct value and commit that so any regression will be caught.

CJefferson 3 years ago |

I work with a language where all test are expect tests ( GAP ). The biggest problem is you can basically never change how built in types are printed, as you'll break all tests in every program. For example, someone wanted to improve how plurals are printed, but that would break every test.

arcturus17 3 years ago |

Is there anything like this in Python or C#? I have worked with OCaml extensively in coursework, but there’s no chance I’ll be using it in prod any time soon and I’d love toying with this approach in my working languages.

wildcow 3 years ago | |

In C# https://theramis.github.io/Snapper/#/pages/quickstart. I actually think there are other as well. But this is what i am trying to scratch to see if I can use it somehow.

anuragsoni 3 years ago | |

For python there is https://github.com/ezyang/expecttest which is modeled after the OCaml expect test library.

drothlis 3 years ago | |

https://approvaltests.com/