“Expect tests” make test-writing feel like a REPL session(blog.janestreet.com) |
“Expect tests” make test-writing feel like a REPL session(blog.janestreet.com) |
It also had to be thought about by the developer. Someone had to say "I want the code to do this under these conditions".
If your tests can be autogenerated then they aren't verifying expected behaviour, they're just locking in your implementation such that it can't change later. They are saying "hey look everyone, I got my coverage metric to 100% (despite any bugs I may have)."
Oh yeah and the whole test setup was also way too tied to the implementation rather than verifying behaviour. Complete trash the whole thing.
That is cargo cult level behaviour. They know that software with lots of tests tend to have few bugs, so let's automatically have lots of tests!
I just hope whatever you were building wasn't critical to human lives.
Edit: or after reading the article, like in the article.
That said, this doesn’t sound like a very good way to pull that off because the developer has no control over that randomness (where it’s needed greatly).
Regression tests are extremely useful because you don’t want working code to get broken but they are tedious to write. What the author is describing is pretty much how everyone does it if you want anything moderately complex in the test, you just run and then copy-paste. Having something do it for you in a frictionless way is a huge win.
Plus the way the framework works you can still test expected behaviours before writing the code if that’s what you actually want.
Asserting formatted output can also be really useful. A picture might be worth a thousand words, but when it comes to tests it can save you a thousand asserts. Writing those thousand asserts separately also would be so tedious that in practice you'd probably not write them all, leaving part of your output uncovered by tests.
When I wrote a LALR parser generator for fun, I added some code to print out a nicely formatted parsing table with debugging information. Besides being useful for debugging, it let me write simple yet powerful tests: I would feed the generator a grammar and then assert on the formatted parsing table. That made it easy to verify that I was asserting the right thing, and let me assert everything in one go.
That's the whole point of tests. All tests do that.
This protects against later code changes that change behavior (output or side effects) unintentionally.
When you intend to change behavior then you need to change the tests tests too.
Tests should define what the expectations are. If a change does not impact those expectations, then it should be allowed and not break any tests.
Locking your code such that all future changes require updating old tests tells me that your tests are just your code written a second time, with no thought about what the code's requirements are.
There are plenty of kinds of test outputs where rewriting the test and eyeballing the result is quicker, easier and ultimately better.
> This is insane!
The sane approach is presumably to either expand the call tree and verify all the unique subsolutions. Or to do every step with a calculator if you can’t expand the call tree.
> The %expect block starts out blank precisely because you don’t know what to expect. You let the computer figure it out for you. In our setup, you don’t just get a build failure telling you that you want 610 instead of a blank string. You get a diff showing you the exact change you’d need to make to your file to make this test pass; and with a keybinding you can “accept” that diff. The Emacs buffer you’re in will literally be overwritten in place with the new contents [1]:
Oh okay. The non-insane approach is to do the first thing but Emacs copies the result on your behalf.
Complex systems use that system everywhere. Why aren't we doing it for our code?
What are you testing? Why?
"I may not know what cos(x) means, but whatever it is shouldn't depend on what OS version I'm running"
"Expect tests" seems like a bad name, since that covers all tests.
I much prefer property based testing over expectation based testing. You have to explicitly think about what properties hold true about the thing you're writing.
For example, fib(N+1) = fib(N) + fib(N), so this property can be tested for all N; primitive generators can easily generate the data, and good composition framework can easily generate complex data from primitive data.
Of course, you have to have a property you can specify easily. Otherwise, it'd be exactly the same as expectation based testing.
I've found a bug in a Haskell program about fib generation - your test would work (if fixed for the subtractions) but incorrectly as there was an overflow in the addition. A basic property of "fib(n+1) > fib(n)" for n>1 finds this.
I like this type of testing as it asks you to more generally consider what guarantees your code is making about its operation.
Edit - your example is a good one and necessary, I just wanted to add a bit extra as I really like property based testing
Yes this is right level of automation, not whatever this article is going on about with the editor integration. Yuck.
Obviously that’s a risk for hand written tests too but it’s easier (today… who knows what copilot like systems will offer soon!) for a human to reason about what’s relevant.
A: The value X was revealed to me by ChatGPT.
This does get me writing tests sooner.
And one more use case I found was exactly what TFA describes, but even easier:
import replace_me
replace_me.test(1+1)
Once executed, it evaluates the argument and becomes an assertion: import replace_me
replace_me.test(1+1, 2)
I never actually used it for anything important, but it comes back to my mind once in a while.A good set of fixture/helper functions should let you write really short and expressive tests (or tabular parametrized tests, if you prefer) which seems to me to resolve most of the pain points the author is complaining about.
One big advantage I do see with this approach is it seems to be a very compact rendering of a table of outputs; in Python+pytest+PyCharm if I run a 10-example parametrized test, I have to click through to see each failure individually. Perhaps there is a UX learning here that just rendering the raw errors into the code beside the test matrix could help visualize results faster.
As an aside, I have recently been enjoying the “write an ascii representation as your test assert” mode of testing, it can give a different way of intuiting what is going on.
Of course, you can say "I won't let myself do that", but working against human nature is not a formula for success. If my back hurts, I can tell myself I'm just going to go lie down on the bed for 10 minutes but not take a nap, but then 30 minutes later I wake up feeling groggy.
I like the approach, and I was indeed copy-pasting the result from my console...
> What does fibonacci(15) equal? If you already know, terrific—but what are you meant to do if you don’t?
> I think you’re supposed to write some nonsense, like assert fibonacci(15) == 8, then when the test says “WRONG! Expected 8, got 610”, you’re supposed to copy and paste the 610 from your terminal buffer into your editor.
Who does that? How do you know 610 is correct? That’s just assuming your implementation is right from the get go. For such a function, I’d independently calculate it, using some method I trust (maybe Wolfram Alpha). I’d do this for a handful of examples, trying to cover base and extreme cases. And then I’d do property testing if I really wanted good coverage. Further, this expect test library seems to just smoothen the experience of copying what the function returns into a test.
This whole “expect test” business seems to rely on the developer looking at what the function returns for a given input, evaluating if it’s correct or not and then locking that in as “this is what this function is supposed to do”. That seems backwards and no different from how one implements functions in the first place, so I don’t know what is actually being tested.
The entire point of testing is saying “this is what this function should do” and not “this is what the function did and thus that’s what it should always do”.
Similarly if you find a bug in the live system, you add a test for that and the initial output will be wrong. Then you fix your code until it prints the correct value and commit that so any regression will be caught.
One person's "cargo cult behavior" is another person's "best practices". :P
My favorite example is automatically generated documentation. The kind that merely repeats the name of the method, the names and types of arguments, and the type of return value. The ironic part is that this is later used as an evidence that all documentation is useless. Uhm, how about documenting the methods where something is not obvious, and leaving the obvious ones (getters, setters) alone? But then the documentation coverage checker would return a number smaller than 100% and someone would freak out...
This is just one of many examples, of course.
Like "give review feedback that this code isn't doing the right thing" -> "change the test to make it pass, not change the code to make it work". And it wasn't really a small case where you could plausibly do that and still understand what you were trying to do.
Coincidentally that was a few weeks after I saw a comment here on HN about someone who hired someone from Facebook, and the guy would change the tests so he could push to production, rather than fixing the bug that the tests pointed out ...
So yes it happens.
Can't blame him, he moved fast and broke things /s
I remember once, using some in-house software, which for god knows why could not log it's errors back to the IT department. Instead, they relied on users to call up IT, or email them with the error. To make it more fun for users, each error message contained a humorous haiku.
Chaos reigns within.
Reflect, repent, and reboot.
Order shall return.
Edit: Just found this from 2001 https://www.gnu.org/fun/jokes/error-haiku.en.html
And my experience with haiku error messages at work was 01 or 02.But you don't always have an oracle. So other properties still make sense.
As a simple example: if you code up a quantum mechanics simulator, that's hard, and I wouldn't be able to code up an oracle for you straight away. But I can tell you that you probably want to check that things like momentum and energy better be conserved.
What the commenter just described is tautology testing: whatever result of the computation I get is what I expected.
Cosine is a terrible example to use for that idea. It's pretty likely to change, for certain x, in similar circumstances to your examples of "when test results should never change".
In either case, if the behavior is to change, it should change as an informed decision and not because nobody noticed.
When you're working on developing a random utility function (real example!), it's easy to say "come on, it's no big deal to return DECIMAL(14, 4) instead of DECIMAL(12, 3)". It feels like they're basically the same, updating the test is make-work, and the guidelines saying you must document it as a breaking change are pointless annoyances. It's hard, requiring substantial amounts of knowledge and expertise, to recognize that this change will cause a production outage because the schema of a customer's view is no longer write-compatible with their existing data.
This suggests that there are so many changes to tests that it's just become background noise.
There's a much higher chance of detecting bugs that give plausible output if you aren't given the opportunity to say "eh looks plausible I won't bother double checking it".
I personally already use a similar cycle to expect-test when I write tests. A great place to start when writing test assertions is the debug output, just like this thing uses. Then you convert the output into assertions after you have thought through which parts are right or wrong. Just like you can do with expect-test, but without the automation. If you don't know whether the output is right or not, just add an assert(false, "hmm, not sure about this") aka todo!() and voilà, your test fails and future you can be prompted to check over it again.
Sometimes the output is obviously wrong, but you still don't know what the right output is. (At this point you know you're doing useful work!) The remedy is the same. Just make the test fail somehow.
Then what's the point of this methodology? It requires you to write tests and also blindly accept that your program is correct.
Maybe they should just rename it to "plausibility tests" or similar because that's what they're really testing. And while that does have some value, I think most of the value is negated by the fact that it sounds like they are properly vetted tests which they are not.
So a more appropriate name would help a lot. I still think it's a bad idea though.
For example, you start with the inputs and you apply the first layer of transformations, then check what it does makes sense. Then maybe you refactor it out in its own function and add the generated test for it. Then you move on the next step and so on until you have the final result.
- Aha, an expect test!
- Oh, you mean a snapshot test!
- This here is akin to UI testing framework X where the test framework can compare an expected screenshot of the UI to a screenshot of the actual UI!
The last one basically requires automation if you want anyone to make use of it. The regression testing automation described in the OP is a nice-to-have, not a so-good-that-it-gets-a-new-name.
It is non-decreasing monotonic. fib(n) <= fib(n+1)
It is increasing monotonic after 1. fib(n) < fib(n+1)
Its domain and codomain are non-negative integers.
fib(n) + fib(n+1) == fib(n+2) Notice this is like the recursive solution except going the other way (addition not subtraction) and is missing the base case.
a, b, c = fib(n), fib(n+1), fib(n+2)
assert abs(c / b - phi) < abs(b / a - phi)You can also test that the sequence is increasing like `fib(n+2) > fib(n+1) > fib(n)`.
Unit testing elegant functions has no value.
(fib is often used as an example. But you asked how to test it.)
This is just an example of course but elegant functions might need to be tested.
Eg if you coded up an O(n) version of the Fibonacci calculation, you can check against the naive recursive one (or if you are feeling confident, you can check against the O(log n) solution via repeated squaring of matrices.)
let fib2 n =
let sq5 = sqrt(5.0)
((1.0 + sq5)/2.0)**(float n)/sq5
|> round |> intThese are fine for property-based testing, so long as you restrict yourself to the range in which you have a correct value. But at that point, you might as well just hard-code the first 93 fibonacci numbers (the most that will fit in a uint64_t) and be done with it.
No. You can say no. Just don’t accept it. You’re a human and it asks. Even if you do accept it you can modify it because you have eyes and a keyboard and it’s written right there where you wrote your test.
See https://github.com/rust-analyzer/expect-test for a demo gif of the rust version.
Yes you can except...
> You’re a human
Precisely. You're a human. Humans are lazy and bad at manually checking things are correct, especially if there's an "eh it's probably fine" option.
This is extremely well studied: https://en.wikipedia.org/wiki/Vigilance_(psychology)
As I said before, it's probably better than nothing in that it will help you detect obviously implausible results. But it really needs to be labelled as such otherwise people will assume that these are properly curated "golden" tests.
Expect-testing is a good tradeoff in the short term (time to create tests) and in the long term (quality and size of test suites produced). The evidence for that is that there are pieces of software that need so many tests for their range of functionality, that you cannot test them any other way than in this style. I am talking about testing orders of magnitude more stuff than you could do manually. A great example is the Rust compiler UI test suite (https://github.com/rust-lang/rust/tree/master/tests/ui). It doesn't have to be that your tests have large amounts of noise, like compiler UI tests do. You can make focused and noise-free tests using this method, as the original post examined. The main thing is that writing the tests faster results in bigger test suites and more opportunity to look at the same code on different inputs. I would rather have two dozen tests that required me to look at their output, than three tests that made me think thoroughly about every single assertion. It's just a better use of your time. The rewards are compounded by the massively reduced cost of maintaining the test suite. The tests update themselves when the code does.
Overall, yes you have identified the negative part of the tradeoff. But you seem to have missed every single one of the benefits.
"Regression test" means something else, at least at the companies I've worked at: It means a test that was written after a defect was found in production, to ensure that the same defect doesn't happen again (that the fix doesn't "regress"). It can be a manual test or an automated test. https://en.wikipedia.org/wiki/Regression_testing
One reason to call bug fix tests for “regression tests” (and only those kinds of tests) is that someone might regress the code base through a merge conflict (maybe they effectively undo a commit?). So that’s one argument I suppose.
I'm not particularly wedded to any of these terms, I'm just pointing out that "regression testing" has an established meaning, and it isn't snapshot testing (outside of certain industries, at least). I do find it amusing that one implementation of snapshot testing (https://pypi.org/project/pytest-regtest/) links to https://en.wikipedia.org/wiki/Regression_testing but that article doesn't describe snapshot testing at all! Maybe the article changed? Oh well, language changes too. ¯\_(ツ)_/¯