Bitcoin Core Testing
Speakers: Greg Maxwell
Date: September 23, 2018
Transcript By: Michael Folkson
Tags: Bitcoin core, Testing
Bitcoin Core testing
I believe slower would potentially result in less testing and not likely result in more at this point.
If we had an issue that newly introduced features were turning out to frequently have serious bugs that are discovered shortly after shipping there might be a case that it would improve the situation to delay improvements more before putting them into critical operation… but I think we’ve been relatively free of such issues. The kind of issues that just will be found with a bit more time are almost all already found prior to release.
Quadrupling the amount of (useful) testing would be great. But the way to get there may be via more speed, not less. You might drive safer if you drove a bit slower, but below some speed, you become more likely to just fall asleep while driving.
Imagine a bridge construction crew with generally good safety practices that has a rare fatal accident. Some government bureaucrat swings in and says “you’re constructing too fast: it would be okay to construct slower, fill out these 1001 forms in triplicate for each action you take to prevent more problems”. In some kind of theoretical world the extra checks would help, or at least not hurt. But for most work there is a set of optimal paces where the best work is done. Fast enough to keep people’s minds maximally engaged, slow enough that everything runs smoothly and all necessary precautions can be taken. We wouldn’t be to surprised to see that hypothetical crew’s accident rate go up after a change to increase overhead in the name of making things more safe: either efforts that actually improve safety get diverted to safety theatre, or otherwise some people just tune out, assume the procedure is responsible for safety instead of themselves, and ultimately make more errors.
So I think rather, it would be good to say that things should be safer even if it would be slower. This is probably what you were thinking, but it’s important to recognize that “just do things slower” itself will not make things safer when already the effort isn’t suffering from momentary oopses.
Directing more effort into testing has been a long term challenge for us, in part because the art and science of testing is no less difficult than any other aspect of the system’s engineering. Testing involves particular skills and aptitudes that not everyone has.
What I’ve found is that when you ask people who aren’t skilled at testing to write more tests when they generally write are rigid, narrow scope, known response tests. So what they do is imagine the typical input to a function (or subsystem), feed that into it it, and then make the test check for exactly the result the existing function produces. This sort of test is of very low value: It doesn’t test extremal conditions, especially not the ones the developer(s) hadn’t thought of (which are where the bugs are most likely to be), it doesn’t check logical properties so it doesn’t give us any reason to think that the answer the function is currently getting is right, and it will trigger a failure if the behaviour changes even if the change is benign, but only for the tested input(s). They also usually only test the smallest component (e.g. a unit test instead of a system test), and so they’ll can’t catch issues arising from interactions. These sorts of tests can be worse than no test in several ways: they falsely make the component look tested when it isn’t, and they create tests that spew false positives as soon as you change something which both discourages improvements and encourages people to blindly update tests and miss true issues. A test that alerts on any change at all can be appropriate for normative consensus code which must not change behaviour at all, but good tests for that need to test boundary conditions and random inputs too. That kind of test isn’t very appropriate for things that are okay to change.
In this case, our existing practices (even those of two years ago) would have been at least minimially adequate to have prevented the bug. But they weren’t applied evenly enough. My cursory analysis of the issue suggests that there was a three component failure: The people who reviewed the change had been pre-primed by looking at the original introduction of the duplicate tests which had a strong proof that the test was redundant. Unfortunately, later changes had made it non-redundant apparently with realizing it. People who wouldn’t have been snowed by it (e.g. Suhas never saw the change at all, and since he wasn’t around for PR443 he probably wouldn’t have easily believed the test was redundant) just happened to miss that the change happened, and review of the change got distracted by minutia which might have diminished its effectiveness. Github, unfortunately doesn’t provide good tools to help track review coverage. So this is an area where we could implement some improved process that made sure that the good things we do are done more uniformly. Doing so probably won’t make anything slower. Similarly, a more systematic effort to assure that all functionality has good tests would go a long way: New things in Bitcoin tend to be tested pretty well, but day zero functionality that never had tests to begin with isn’t always.
It takes time to foster a culture where really good testing happens, especially because really good testing is not the norm in the wider world. Many OSS and commercial projects hardly have any tests at all, and many that do hardly have good ones. (Of course, many also have good tests too… it’s just far from universal.) We’ve come a long way in Bitcoin– which originally had no tests at all, and for a long time only had ‘unit tests’ that were nearly useless (almost entirely examples of that kind of bad testing). Realistically it’ll continue to be slow going especially since “redesign everything to make it easier to test well” isn’t a reasonable option, but it will continue to improve. This issue will provide a nice opportunity to nudge people’s focus a bit more in that direction.
I think we can see the negative the effect of “go slower” in libsecp256k1. We’ve done a lot of very innovative things with regard to testing in that sub-project, including designing the software from day one to be amenable to much stronger testing and had some very good results from doing so. But the testing doesn’t replace review, and as a result the pace in the project has become very slow– with nice improvements sitting in PRs for years. Slow pace results in slow pace, and so less new review and testing happen too. … and also fewer new developments that would also improve security get completed.
The relative inaccessibility of multisig and hardware wallets are probably examples where conservative development in the Bitcoin project have meaningfully reduced user security.
It’s also important to keep in mind that in Bitcoin performance is a security consideration too. If nodes don’t keep well ahead of the presented load, the result is that they get shut off or never started up by users, larger miners get outsized returns, miners centralize, attackers partition the network by resource exhausting nodes. Perhaps in an alternative universe where it in 2013 when it was discovered that prior to 0.8 nodes would randomly reject blocks under ~>500k if the block size were permanently limited to 350k we could have operated as if performance weren’t critical to the survival of the system, but that it’s what happened. So, “make the software triple check everything and run really slowly” isn’t much of an option either especially when you consider that the checks themselves can have bugs.