Randomise your testing

19th January 2018

If you can randomise your testing, you should. Instead of trying to come up with exhaustive test cases, it is better to deterministically generate test cases in order to exercise more of your code.

Why does randomised testing help?

Imagine a function to search through a string, looking for a regular expression. Can you write down a set of strings and expressions that are sufficient to be confident it is bug-free? How long would that take?

Okay, now write a function to generate a random string. With that and some of the techniques below, you can test your regular expression matcher to the point where you can be confident it is perfect.

Randomised testing depends on being able to generate a test case and then check your code against it. The trick is finding a way to check your code without having to write a test for each random input.

Ways to do randomised testing

Generate and verify: the simplest way to do anything: don't do it at all. Instead, use an existing, trusted piece of code to check that a new piece of code functions the same. Give the same random inputs to the old code and new code, and check the answer is the same in every case.

This is useful in a variety of cases. If you are re-implementing something, perhaps to move to a new platform or language, or to make a more optimised version, this can be a powerful technique.

In our regular expression example, you would generate random strings and regular expressions, and check both cases had the same set of matches. One caveat is to restrict the size of the alphabet used in order to ensure there actually are some matches, else you aren't testing everything.

Equality != ==

In all of these techniques, we are checking different outputs are equal. Depending on the problem, there may be different possible ways of representing each answer, so you might need to normalise before checking for equality.

Round-trip: if you are writing two functions that are inverses of each other, then you can stack them on top of each other and check they then don't do anything. So, if you had a function to turn snake_case into CamelCase, and another to turn CamelCase into snake_case, then you could feed one function into the other and check that snake_case(CamelCase(snake_input)) == snake_input.

You could also do the test the other way, and check that CamelCase(snake_case(CamelInput)) == CamelInput.

Check the intermediate

Sometimes, a round-trip is insufficient to check your code. Particularly for malformed inputs or simple functions, it may not find all errors.

Generate and test: generate the expected result alongside generating the test case. This way, you don't need another implementation, but you can be sure that you are covering all the cases you can generate. There are a number of fiddly bits to generate and test, in particular: making sure that there is only one correct output (or that you can list all of the correct outputs); and making sure you don't accidentally create test cases that don't have the properties you expect (as we'll see later).

Fuzz: generate random but valid (or not) inputs and check no errors occur (or only expected errors). This doesn't completely check the function you're testing, but still has value, particularly for security-sensitive functions or ones with gnarly memory handling.

Generally, the higher quality test cases you generate, the more subtle the bugs you will find. By quality, I mean realistic compared to the inputs the function is designed to handle. So, a parser for a programming language is going to expose a lot more of it's behaviour if you feed in strings made up of tokens for that language than completely random garbage. Mostly syntactically correct programs will generate even better results. See https://blog.regehr.org/archives/1039 for more thoughts on levels of test cases.

Another option is to take a set of valid inputs and then permute them by changing values or adding or removing blocks. See https://lcamtuf.blogspot.co.uk/2014/08/binary-fuzzing-strategies-what-works.html for a good set of permutations to try (plus a more general fuzzing tool).

You'll want your test cases to be repeatable, either using a fixed random seed, or printing the seed used, so that if there is an unexpected error, you can reproduce it.

American Fuzzy Lop

Fuzzing was made popular by the American Fuzzy Lop (AFL). AFL operates on compiled code and instruments the branches within the code to identify novel behaviour. Because it can tell which changes in the input are interesting, it can fuzz without a specific test case generator.

Use AFL if you are looking for bugs in compiled programs.

Examples

First up, let's consider the knapsack problem¹. We pass a function a set of numbers and a target, and it returns a subset of the numbers that add up to the target, or an error if this can't be done. We don't have a knapsack solver already, and we don't have a pair of reversible functions, and we aren't too worried about errors, so we'll need to use the generate and test strategy.

We can make a solution for the knapsack easily enough: pick a random sum, split it into a set of numbers, and then add some other numbers to that set. Say, the sum is 100. We split it into {10, 40, 50} and then add {20, 30, 95} to that set to get {10, 20, 30, 40, 50, 95}. The only problem with this approach is that there might be more than one solution: in this case, {10, 20, 30, 40} and {20, 30, 50} are also valid solutions. We also need some invalid solutions, but how can we be sure a randomly chosen set doesn't contain a particular sum?

Let's rethink. The knapsack problem is an NP-complete problem. This means that checking a solution is valid is relatively easy, even if finding the solution is hard. So we can generate solvable sets and random sets, test our code on them, and check the results easily by making sure the returned subset is drawn from the set and sums to the target number. We should also check that for every solvable set, some solution is found. You can check out the code demonstrating this here. I've skipped over ensuring the random numbers are repeatable, since I print out enough details about any errors to reproduce them.

Another good example is testing undo. If you have a state you can serialize/store a reference to, and you can simulate arbitrary actions, then fuzz testing can give you confidence your undo is perfect.

Simply carry out an arbitrary number of actions, then undo a number of them, and check the serialized/stored state is the same as the state when that action was first performed. You can also check that redo works correctly by doing a number of redo actions and again checking you return to the correct previous state.

Conclusions

Hopefully I've convinced you that complex functions can be tested easily and with a better level of coverage using randomised methods. You've seen a number of methods for doing this testing and when to apply each one: generate and verify if you have an easy way to check the results, generate and test if you can be sure of the right answers for the cases you generate, round-tripping to test pairs of inverse functions or undo/redo, and fuzzing to look for badly handled errors.

Quickcheck

I would have to hand in my Computer Science card if I didn't mention QuickCheck. QuickCheck is a Haskell library (with ports for many other languages) that gives you a simple set of tools for constructing generate and verify tests. It can generate instances of all the built-in types automatically, and you can also define custom generators. You then define properties of the functions you want to test, and QuickCheck runs the function with random input repeatedly until a certain number of runs pass or a counter-example is found.

Footnotes

¹ Today I learned that the knapsack problem has items with a weight and value, and you're trying to maximise the value given a weight limit, whereas the subset sum problem is trying to find a subset of a set with a particular sum. The subset sum problem is a equivalent to a knapsack problem where you just set the weight to equal the value. Anyway, we'll stick to the simpler subset sum problem.