Home < Brink < The Bitcoin Development Podcast < Discussing Pre-0.21.0 Bitcoin Core Vulnerability Disclosures

Discussing Pre-0.21.0 Bitcoin Core Vulnerability Disclosures

Speakers: Niklas Gögge, Gloria Zhao

Date: July 11, 2024

Transcript By: kouloumos via tstbtc v1.0.0 --needs-review

Tags: Bitcoin core

Media: https://podcasters.spotify.com/pod/show/bitcoinbrink/episodes/Discussing-Pre-0-21-0-Bitcoin-Core-Vulnerability-Disclosures-e2ltlkr

AI-Generated Transcript
Review this transcript and earn sats. Learn More >

Introductions and motivation for disclosures

Speaker 0: 00:00:00

Hello, Nicholas.

Speaker 1: 00:00:01

Hi, Gloria.

Speaker 0: 00:00:02

Cool. We are here to talk about some old security vulnerabilities that were disclosed last week for Bitcoin Core versions 0.21 and before, or I guess 21.0 and before, which has been end of life for a while. This is going to be somewhat technical, quite technical, but we’ll try to make it accessible. So if you’re kind of a Bitcoiner who’s not a developer, you’ll hear a lot of interesting concepts. Of course, if you are a developer, you’ll get a little bit more out of it. And then I think this might be also interesting for software engineers in general, even if you’re not a Bitcoiner, because there’s some really fun examples of innocuous integer overflow leads to fundamentally not being able to do the main thing that this application is supposed to do. Or not deduplicating your nested loops or forgetting to remove something from a set, meaning you stall the program for several minutes. So it’s pretty fun. I recommend you keep listening. Let’s do intros. I’ll go first. I’m Gloria. I’m one of the maintainers of Bitcoin Core. I work mostly on peer-to-peer and mempool related stuff and I’m sponsored by Brink.

Speaker 1: 00:01:22

Yeah, I’m Niklas. I also contribute to Bitcoin Core. I mostly do fuzzing and security engineering type work trying to find bugs and prevent bugs being introduced. And I’m also sponsored by Brink.

Speaker 0: 00:01:36

Cool. Do you want to give us maybe some short background on kind of why we’re doing this, these disclosures and what the thinking process behind some of this was?

Speaker 1: 00:01:47

Yeah, so a while ago, we sort of realized that we weren’t doing an amazing job at disclosing bugs that have happened in the past. And that sort of has, at least we think, led to a perception that Bitcoin Core doesn’t really have any bugs, which is definitely not true. So we set out to improve the situation and went through all the old bugs, or at least the ones we could find, to try and come up with a policy and a way to disclose them going forward.

Speaker 0: 00:02:16

I guess in general, having these kind of security policies helps encourage security engineers to like, you know, credit them and encourage them to look at the project knowing that there is a proper disclosure policy and they’ll be credited accordingly, right?

Speaker 1: 00:02:34

Yeah, yeah. So maybe if we do this, we’ll attract some external security researchers, even though maybe it can just act as a bit of a guideline for them, what they can expect. Obviously, we can’t force anyone to follow these rules, but that’s sort of like our recommendation, I guess.

Speaker 0: 00:02:52

Cool. And hopefully, the goal of this podcast is to kind of maybe explore why these bugs happen, what we did right to find them, and fix them, and now report them, and maybe noodle on some things that we can be doing more of, like fuzzing. OK, so why don’t we dive in?

Absolute value of a signed integer leads to rejection of all blocks

Speaker 0: 00:03:17

OK, I think a good place to start is the adjusted time one. So net split due to malicious peer-to-peer messages by first 200 peers. I think that’s a good place to start because it feels like probably the oldest bug. Adjusted Time was introduced by Satoshi. So just to give a bit of background, of course, each block in the block chain has a timestamp. And this is a very important piece of data that we use, for example, to calculate difficulty adjustments, because it kind of tells us how long approximately it took to mine this block from the last one, approximately. But of course, there’s no consensus on the network as to what time it is, but you also need to make sure that these timestamps are reflective of what’s happening in the real world. And so because there’s no kind of inherent data in the chain that would be able to verify for us what time it is in the real world. We just kind of have this local network block download logic that says, if you download a block that’s more than two hours into the future, you’re going to drop it. And you could reconsider in the future if you’re off by a few minutes or something. But what this effectively means is, for example, a current miner couldn’t just be like, oh, It took me five years to mine this block and set this kind of bogus timestamp and get everybody to accept it. So it’s an important part of the network to have this kind of like relation between current time, like real world time and block timestamps. But of course, like we alluded to, there’s no consensus. Different computers on a distributed network might have a slightly different idea of what the current time is. There could be small differences, and that would be entirely normal. Of course, if there’s something wrong with your computer, you might have a weird idea of what time it is. What Satoshi introduced was this concept of adjusted time, where the first 200 peers that you connect to in the version handshake when you first connect, you’ll tell each other what your idea of the local time is. And then on your node, you’ll say, oh, okay, there was a offset, the difference between my time and their time, and that can be either positive or negative, if they’re ahead or behind you, of course. We’ll take the median of the first 200 offsets, and we’ll add that to our current time to get an adjusted time. We use that time to do this block acceptance rule, where we won’t allow things more than two hours into the future. This is to help you modulate for differences on the network. It’s a cool idea, but of course, there’s also the safeguards just in case like everybody is completely different from you. Of course, in that case, we’d probably want to tell the user, hey, maybe the problem is you, or I don’t know, maybe the first 200 peers you connected to are trying to attack you. So that’s kind of the scenario we’re thinking about here. So what we would do is, before we do the median for this adjusted time idea, we’ll calculate the absolute value of the time offset. So the difference between our time and their time. If that’s greater than, I think it was 70 minutes or whatever, then we’re not going to incorporate this into median time. And this is to protect us from, say, a malicious peer that is sending us a totally bogus time to mess with our adjusted time, for example. But the absolute value function doesn’t work on the minimum value of that integer. So for computer science students, an n-bit integer, as we know, stores negative 2 to the n minus 1, all the way to 2 to the n minus 1 minus 1. Those are the allowed values for an n-bit integer. An 8-bit integer will store negative 128 to 127. Obviously, the absolute value of negative 128, for example, is 128, which doesn’t fit in this integer data type of that 8-bit one. And so I think the undefined behavior sanitizer essentially caught this bug where it was saying, oh, why did you try to absolute value the min value of int 64? That’s UB. I think it was practical swift. We saw this and then traced out the implications as to where it is applied. Basically, if somebody sends us a timestamp that is so far behind, where our time offset is going to be min value, so negative, like a huge number, the absolute value is just going to be, well, it’s UB, but typically what the behavior is, is the absolute value is still the super negative number. Of course, that’s always going to be less than whatever maximum we set because it’s super negative. It’s the minimum value that this integer can store. And so this condition would always succeed. And so we would incorporate this extremely negative number into the adjusted time. So what this all means, like What the actual implication of this is, if these first 200 peers are sending us these super low time values and we end up with a really low adjusted time because we incorporated all of their times, then we would never, like all the blocks that are coming out today, we would just reject all of them because we’re using adjusted time, which is way back into the past. And obviously the timestamps are actually, hopefully somewhat accurate, and you just fall off the chain. You wouldn’t be able to download any of the Bitcoin blocks. You wouldn’t be able to Bitcoin. All because we applied absolute value on this number, which of course makes total sense, right? Like you want the offset, you absolute value it, and then because you want it to be within a certain amount. Anyway, so nowadays, so, okay, let’s go back. So the fix was to stop using this add64 function and to use something else, I guess.

Speaker 1: 00:10:10

And what we do is you compare directly against the negative value and the positive value instead of using the actual function. So you would go negative maximum should be smaller. Whatever. You just use the bounds directly. Yeah. I don’t have the code in front of me.

Speaker 0: 00:10:33

Right. Yeah. And then of course, since then, this was fixed, what was it, like several years ago for 0.20. And, oh sorry, for 0.20. It’s only affecting version 0.20.1 and before. So it was fixed for the version after that. But since then, earlier this year, actually, your PR removes adjusted time from this logic. So we just use, nowadays we just use our local time, but we keep the logic where we’re looking at all of our other peers’ time, not all, the first N. And we’ll tell the user, like, hey, for some, like, we’ll tell them if their clock is super off.

Speaker 1: 00:11:21

I wanted to know that this is somewhat hard to exploit, because you need to ask the attacker, you need to be among the first 200 peers connecting to the victim. So if your node has been online for a while, you’re basically safe from this. You’re still running this old version.

Speaker 0: 00:11:39

Yeah, so don’t start on this version.

Speaker 1: 00:11:44

Yeah. And then also, it looks like you need to somewhat guess what your victims Time is so that you can get the offset exactly, right?

Speaker 0: 00:11:56

But you have I guess you have more than one try Yeah Well, so you need to be like you need the median to work out, right?

Speaker 1: 00:12:08

Yeah.

Speaker 0: 00:12:09

I don’t know if you have to be you don’t have to be all of them for sure, but you’d need to be quite a few of them.

Speaker 1: 00:12:14

But if you get the If you get the offset, let’s assume you get the offset right all the time, then you still need to send like a hundred. I think you need to connect like a hundred times so that the negative, because the list will be sorted and then you take the median.

Speaker 0: 00:12:30

Right. But like you in the first, like you’re not totally guessing, right? Like the first version handshake you do, they’ll tell you, like they’ll tell you what their time is. And then you can just go from there. Right? So maybe you guess on the first try, but you knew what their time was at that time, so you’re like, okay, it’s N seconds from now, just send them this plus N.

Speaker 1: 00:12:57

But yeah. Yeah. Overall, it’s just tricky to exploit. That’s why we ranked it as medium and not high.

Speaker 0: 00:13:06

Yeah, yeah, yeah. By the way, the purpose of this podcast is not to scare anybody. You’re good as long as you don’t start a new node today with version 0.20, which I think is quite reasonable.

Speaker 1: 00:13:22

Yeah. Also, if you would have been attacked, you could have just restarted your node and you’re fine because the two-hour rule is not a consensus rule. So if your time adjusts back to normal, you start accepting these blocks.

Speaker 0: 00:13:37

Yeah, you’re not going to forever reject the blocks. It’s like a temporary current block acceptance rule. Okay, should we move on to the next one?

Speaker 1: 00:13:49

Yeah.

Too many misbehaving peers leads to DoS

Speaker 1: 00:13:50

So the next issue here is a CPU and memory DOS issue, also on the P2P processing messages side. I guess I’ll just give some background on the concept of a ban list. And I hope I won’t confuse anyone because the meaning of the ban list has changed throughout the years. And it’s also a pretty old concept, so I wasn’t even around when it was first introduced. But initially when it was introduced, basically what it is, is if one of your peers misbehaves, you ban them. And that means you disconnect them and you won’t allow them to reconnect to you. So you take the RP and you put it on like a list, the ban list, and you just blacklist them, I think, essentially from ever connecting to you. And then this was later relaxed to no longer be like a strict ban. So you will disconnect for misbehavior, and you’ll remember the IP. But you will allow them to reconnect, although they will be preferred for eviction. And then this is sort of the state where the bug was discovered. Or I guess there are like two separate things. So one of them is that the ban list didn’t have a bound on its size. So if you connect from lots of different IPs and you misbehave, the victim will just store all of these IPs in the ban list. If you’re using something like IP version 6, it’s pretty easy to do. Then the second thing was that in response to get adder messages and get adder is basically a way for your peers to query your database of addresses of like nodes in the network and in response to get adder message you send a list of addresses. Then for each address in our response, we would query the ban list if that address is in the ban list, because we don’t want to relay banned addresses. But then that call to query the ban list is linear. So it would iterate over the entire ban list to see if the address is banned. And then combined with the first thing that the list is unbound in size, you get into quadratic behavior. Because for each address in your response you go through every address in the banlist and that just makes you really slow. Yeah, and that’s pretty much it. So you know it would just stall if this would be exploited.

Speaker 0: 00:16:13

Seems like a pretty basic bug. Well, I think there are so many data structures for the longest time that were just unbounded. And a lot of the security problems are just fixing that.

Speaker 1: 00:16:30

I think this is a pretty typical type of bug where like you you’ve written some code that handles untrusted input But you haven’t necessarily thought about Or you didn’t have an adversarial mindset when you were writing it,

Speaker 0: 00:16:43

right?

Speaker 1: 00:16:43

Actually, so you didn’t really think about what happens when you know a bunch of randos connect to me and send me a bunch of garbage.

Speaker 0: 00:16:48

Yeah, I think it’s very typical of like, basically, Satoshi and many others wrote this POC that was amazing and Not to knock on them, but it was a POC. It’s also a really security-oriented piece of software, at least nowadays, now that the stakes are so much higher. The price of Bitcoin is not zero, but tens of thousands of dollars. Along the way, Bitcoin wasn’t shut down and then rewritten and then redeployed with a non-POC version. It was just edited over time. And over time, you’re like, oh, oops, we didn’t bound this data structure. Cool, let’s just do that. So did you want to – I guess you already gave background on what the current banning or not banning We don’t do that or we still have manual bands, right? Like you can still provide manually a list Sorry, go ahead.

Speaker 1: 00:17:46

Yeah, so I talked about the ban list evolving into not being like strict bans. But then after that, we also, because it was still called the ban list, there came another change that basically renamed the non-strict banning to discouragement. And then kept the concept of a ban list as a manual ban list, where you can configure your node to actually ban IPs or ranges of IPs. That’s still what we have today. I’m pretty sure that PR was also the one that fixed the issue by removing the unbounded growth of the ban list and also getting rid of the quadratic behavior.

Speaker 0: 00:18:26

I guess we can also talk about the idea of banning and how our philosophy around it has evolved over the years as well. Because it comes from this idea, this adversarial mindset of network peers, we don’t know who they are, they might misbehave. For example, just send us tons of garbage data for us to validate and then none of it ever ends up being useful. So we should have some way of protecting ourselves from that. But banning is A, not very effective, Because like you said, you can just create lots of IP addresses, or you can create lots of nodes from different IP addresses to skirt any one ban. But it can also be pretty dangerous, right? We might have unintentional banning behavior that leads to network splits. Like if we’re being a little bit too aggressive with the ban hammer, and we, like let’s say, accidentally had a bug that applied it to something that is completely normal. I don’t know, but like different timestamps. I mean, Maybe it boils down to different time stamps or it boils down to some difference in your medical policy or something like that. We have a bug where we punish for this behavior but we’ll also accidentally do it sometimes because of some race condition, for example, then this can be really dangerous because the banning behavior has suddenly caused the network to split. And in addition to that, they won’t reconnect to each other. That’s really potentially dangerous and not extremely effective at preventing the malicious behavior in the first place. And so that’s kind of why not ban too much, but definitely discourage. Like, We’ll be like, oh, this guy’s wasting my time. I’m probably going to evict them if I have more useful peers. Yeah, especially automatic banning.

Speaker 1: 00:20:29

Yeah. The manual banning can still be useful if you can, like, if you notice like lots of misbehavior from like some subnet, you can just ban the entire range of IPs that are doing it. Whereas the automatic one can only pretty much ban specific IPs.

Speaker 0: 00:20:49

And that’s very good.

Speaker 1: 00:20:50

I mean, it could technically also ban ranges, but that would be very dangerous.

Speaker 0: 00:20:53

Yeah.

Speaker 1: 00:20:54

And probably easy to manipulate, so you wouldn’t want that.

Speaker 0: 00:20:56

Right, like then, like any bug that we have is even worse, because then you as an attacker can get your whole range banned. That’s just, yeah, that could be pretty bad. At least that’s not the bug that we’re talking about here. Cool. Should we move on?

Speaker 1: 00:21:14

Yes.

Speaker 0: 00:21:15

Okay.

Nested loop without deduplication leads to stalling

Speaker 0: 00:21:17

My turn. So this one’s the orphan processing one. So it’s called CPU DOS slash stalling due to malicious peer to peer message. And a bit of background, we have the UTXO based model. So transactions have inputs and outputs. And each input is going to be referring to a previous output from a different transaction, a UTXO, in order to spend whatever amount of Bitcoin is in that output. But of course this means in order to validate any transaction, you’re going to need the entire chain of transactions, or as long as you have the UTXOs, you’re good, but you can’t validate them out of order. I don’t know if I explained that amazingly. But you can have what you might refer to as an orphan transaction, where you aren’t aware of the UTXOs that it’s spending. You try to look up the inputs in that transaction, and you turn up empty, either because maybe the inputs don’t exist at all, but what is pretty common is that it spends a parent transaction that you haven’t seen yet. Maybe it’s a CPFP, who knows? And so as part of our peer-to-peer net processing logic, we have an orphan pool, which we’ve had for, I think, at least 12 years now. I’m not sure if it’s exactly Satoshi, but it’s definitely like Satoshi era introduced it, where we will store these transactions to validate later, like let’s say when we do receive the parent. And this is like a bandwidth optimization, but it’s also I think in the past, we wouldn’t have been very proactive in trying to reclaim that transaction, or we might just never hear about this transaction again, so we might as well hold on to it for a little bit. So the net processing logic in version 0.17.2 and before, was every time we accept a new transaction to mempool, the loop would be for each output in this transaction, find all of the orphans in the orphan pool that spend from this output and process that orphan. So I think looking at it a few days ago, it’s like immediately obvious that like you forgot to deduplicate orphans that you’ve already seen before. And you’re also not bounding how many that you’re processing at a time. So kind of the worst case behavior here is that you have that successful transaction has a huge number of outputs, so like thousands. And then every orphan in the orphan pool spends every single output of that transaction. I think the orphan pool, I think it was bounded at the time to 100 and still is. But again, like I think 10 years ago, it was not bounded at all and they fixed that. Anyway, so you could have several thousand times 100 like iterations of this loop because you A, be processing the same orphan over and over again. And then they’re all spending every single output. Of course, like these aren’t valid orphans. They don’t need to be. But it does cause your computer here to just stall in this loop for a really long time. And so of course, the fix was to fix those two things that I mentioned. One is instead of just going in this loop and finding things, we’re going to first create a work queue of the orphans that we want to process, and that’s a set. So it’s deduplicated in case an orphan spends multiple outputs of this transaction. And then it’s interruptible, meaning we’ll just process one and then we’ll come back to the work queue later. And so that fixed the stalling bug. And since then, we have like kind of changed a few other things about orphan processing. So for example, we won’t – okay. So the orphan and the parent might be provided by different peers. So we will separate that because we’re trying to protect the process messages time of each peer. I’m not explaining this very well. But Maybe the most important point is we will process one orphan, and then we will process zero orphans, actually, when we first receive this parent, and we build the work queue. But when we come to the peer process messages loop for the peer that provided that orphan. We’ll do one, and then we’ll stop, and then we’ll come back to it later. And we’ll do one, and we’ll stop, et cetera. So we’re never processing more than one orphan at a time. And we’re never processing an orphan on the time of a different peer, if that makes sense.

Speaker 1: 00:26:17

Yeah. So I guess that’s the important part here is that we will process messages from other peers in between processing the work queue of peers. So that one peer can cause you to just stall and not process messages from other peers.

Speaker 0: 00:26:32

Right, yeah. So like in this like terrible orphan stalling loop, you’d be stuck there for like minutes. And if another peer sends you like a block, for example, which is definitely more important than processing an orphan transaction, you’d still be stuck. You wouldn’t ever be like, oh, maybe I should look at some other messages that people have sent me. You’d just be in uninterruptible orphan hell.

Speaker 1: 00:27:04

Yeah, and maybe to some people, stalling for minutes doesn’t sound all that bad. But if you’re not able to process blocks on that time frame, If it takes you longer to process the blocks than 10 minutes, essentially, the whole network won’t be able to converge, and that’s obviously really bad. So stalling for minutes is something we definitely want to avoid.

Speaker 0: 00:27:26

Right. Yeah. So that’s bad. Should we move on to the next one?

Speaker 1: 00:27:33

Yeah, sure.

Vulnerability in dependency leads to potential RCE

Speaker 1: 00:27:34

So the next one is a potential remote code execution due to a bug in MiniUPNPC. And just to give some background on this, Basically, nodes that accept incoming connections on the Bitcoin network are somewhat of a scarce resource and one reason for that is that it is not entirely trivial. For example, if you have, If you’re running your node at home, it’s sitting in your local network, and by default, it won’t be exposed to the public internet. So something you can do is enable or configure manual port forwarding. Usually, your router will support that, but It requires you to log into your router interface and set up the configuration. So for non-technical users, this isn’t really something that anybody would probably do. So in Bitcoin Core, we have, I think at the moment, two different ways of basically automatically configuring port forwarding. So one of these is the UPnP protocol and mini-UPnP-c is basically a library that implements that protocol. So if this is enabled your node will automatically talk to your router and set up the port forwarding so that your node can or is publicly reachable. And basically the bug here was that a malicious router, for example, maybe you’re on like a public network, not at home, and your node is connected to this malicious UPnP server, the client code in Bitcoin Core essentially had a bug that might cause or that might enable remote code execution.

Speaker 0: 00:29:12

Yeah. Upstream had a CVE, upstream being mini-upnp C.

Speaker 1: 00:29:20

Right. And then Vlad kind of explored that. Yeah, exactly. So there was a CVE and then Vlad was triggered by that finding to like go and look for more bugs and he found another you know stack overflow type bug and I think it was the combination of the two that enabled the remote code execution and He had a proof of concept for exploiting it but not inside of Bitcoin Core as far as I’m aware. So it was like a dummy proof of concept of like some simple mini UPnP client. So we don’t technically know if this was exploitable in Bitcoin Core, but it could have been the case.

Speaker 0: 00:30:05

Do you want to explain RCE to people who aren’t aware?

Speaker 1: 00:30:10

Yeah, sure. So I guess remote code execution is sort of the worst type of thing that you can have probably. It allows the attacker to execute code on your machine that is not supposed to be executed. So usually that involves sending the code to the victim and then getting it to execute the code. And I mean you can basically get the machine to do whatever you want if you have a bug like this. I think in this case, since it’s only on the local network, it’s not as bad. But if we had a RCE bug in the public-facing P2P code, that would be the worst bug that could exist, because you can do whatever you want essentially.

Speaker 0: 00:30:52

Right. But this is a lot more involved than that. This is pretty difficult.

Speaker 1: 00:30:58

Yeah. Like, yeah, These types of bugs are always difficult to exploit. There are like tons of mitigations at all layers. But if you’re smart enough and you find a bug, you can usually always find a way. Or not always, but like, yeah. There’s lots of tricks to try and get it to work. So if we find something like a stack overflow, we usually, like, we should fix it right away. That would be like the worst.

Speaker 0: 00:31:25

This is 0.11 and before. And that’s like, When was that? That’s quite a few years ago, like at least seven years ago, I want to say.

Speaker 1: 00:31:36

Yeah, seven, eight years.

Speaker 0: 00:31:38

Yeah, so don’t worry guys. And then, Well, I felt like this one to me was something to highlight as a really good reason to watch dependencies really closely and be really conservative about it. Because this was, again, there was a vulnerability in one of our dependencies, not in the Bitcoin Core code. But we should really be treating dependency code as our own code, at least at the same bar for security and scrutiny for code review.

Speaker 1: 00:32:16

Yeah. So kudos to Vlad for actually going and testing these dependencies.

Speaker 0: 00:32:23

Yeah, and I guess you talked about this in the beginning, but obviously you only want to have dependencies like this if it’s a very good reason. But there is a pretty good reason to assist users in being able to accept inbound connections, basically, because every connection on the network needs to have an outbound and an inbound. And so if nobody’s doing inbound, or it’s like disproportionately way harder to accept inbound connections versus make outbound connections, then it’s going to be hard to have a network as well connected as it is.

Speaker 1: 00:33:05

Yeah. Yeah. Yeah. You can definitely make an argument for this dependency to exist. I think this one specifically, like it would be nicer if it was just a better dependency. Like this is like a whole C library and it’s just barely maintained. And it’s not really surprising that it has bugs like this. So, yeah, but you can make an argument that we should assist users in doing this stuff. So, I mean, the fix was basically we disabled it by default. So prior to the fix, this behavior was on by default. And then we disabled it, and we pulled in the fixes from upstream.

Speaker 0: 00:33:45

Yeah, yeah. I guess there’s led to those.

Speaker 1: 00:33:47

But we still have the dependency today We do.

Speaker 0: 00:33:50

Yeah. Okay. Should we okay. So those are the four unless you want to add anything to this? No Okay, so those are like the four main ones that we wanted to do. I feel like maybe we should do the send and receive buffer one, and maybe just skim over the other ones. Just spend 30 seconds on each one.

Large memory allocation in peer receiver buffer and send buffer

Speaker 0: 00:34:17

Yeah. There’s memory DOS due to malicious peer-to-peer message from many peers version 0.10 and before. And that I think pairs nicely with CPU DOS due to malicious peer-to-peer message with the 0. Hold on, where is the other one? I think it’s 0.19 and before, where it was like, basically, one of them is the send buffer and the other one is the receive buffer, where you and the other peer, like in this connection, you have a buffer for messages that you want to send, and a buffer for messages that they’ve sent to you that you haven’t processed yet. Hopefully, these are fairly bounded in memory. But basically, both of these vulnerabilities were either not very good bounds or very large bounds or very easily blowing up the buffer where it could be like tens of megabytes per buffer. Then of course, if you have a 100 peers, that really blows up to gigabytes. Not everybody has that much memory to spare for just a few message queues.

Speaker 1: 00:35:34

Yeah.

Speaker 0: 00:35:35

Yeah. So that’s like the gist of it.

Payment request fetch causes mysterious crashing

Speaker 0: 00:35:38

And then we had, oh, do you want to do like the BIP 72 one?

Speaker 1: 00:35:44

Oh, yeah. So I didn’t go too deep on all the BIPs involved here, but essentially there is BIP 21, which is a URI scheme for Bitcoin payments. And then there is BIP 70, which defines a protocol between customers and merchants to make payment flow easier. And then there’s BIP 72 which is a extension to BIP 21 to support BIP 70. And the extension basically comes with an R param, well, the name is R for the param for BIP 22, pointing to a URL, which is supposed to be the BIP 70 merchant endpoint. But then if you got a bug was that Qt or like the GUI, we can call GUI, would basically just download from that URL. And if that URL was pointed to a large file, the GUI would just download that file in the background and eventually, oh, because it just would keep all of that in memory.

Speaker 0: 00:36:51

Oh, as in out of memory?

Speaker 1: 00:36:52

Yeah. Yeah. Yeah. It will just run out of memory and crash. The fact that it happens in the background and not immediately crashes is worse than, if you just try to make a payment and your node crashes, you would likely notice it. But if you try to make the payment and, you know, nothing happens, and then like an hour later it crashes, for example, you maybe wouldn’t notice. That would be hard to debug as well. Yeah, exactly. Sorry, go ahead. But yeah, the fix was that we removed BIP70 entirely. I think that was a really great decision because BIP70 also involved protocol buffers, like protocol buffs, which is just really weird to have in Bitcoin Core, I think. Yeah, so good overall that we got rid of it.

Speaker 0: 00:37:37

Nice. Cool.

Misordered logic permits download of blocks bypassing checkpoints

Speaker 0: 00:37:38

Then to go over the low-difficulty headers one, memory DOS using low-difficulty headers, 0.14.3 and before. I feel like it’s more interesting of talk about the general problem. Where, okay, so essentially we have block download. We have to download a few hundred thousand blocks when we start a node. But what we would want to do before we download all that data is to make sure that it corresponds to the most work chain. Because you can imagine if you just kind of blindly, optimistically allowed everybody to send you all these blocks and you just assumed it was the most work chain, it might actually go nowhere. So a long, long time ago, I can’t remember, maybe 0.8 or something, we started doing headers for first sync, where we download the full chain of headers, but like a block header, including just 81 bytes of like the Merkle root, the previous block hash, etc. Enough for you to like have the like, sorry, enough for you to know what the block hash is and be able to verify the total accumulated work on the chain of headers that you have. But even with that, you can imagine that if you weren’t verifying the proof of work on there, or if you were being too over optimistic in allowing people to send you headers, you could also still run out of space if you’re allowing people to spam you with headers without you fully verifying them. So the bug here was that we were allowing people to send headers that forked prior to the first checkpoint. Where checkpoints, I forgot to explain checkpoints, are like hard-coded values in Bitcoin Core that tell you basically what the network main chain is. And they’re not telling you what the chain was yesterday because that would prevent essentially, that would be like an additional consensus. I mean, they are additional consensus rules. Basically, Bitcoin Core has a little bit of hard coding as to what essentially the first 100, 000 blocks are. The last checkpoint is a long time ago. But basically when it’s downloading…

Speaker 1: 00:40:12

I think the last checkpoint is around 300, 000 or something.

Speaker 0: 00:40:18

So a really, really long time ago, basically.

Speaker 1: 00:40:21

It’s pretty old, yeah.

Speaker 0: 00:40:22

Yeah. So you can debate philosophically as to whether or not this is an additional consensus rule. I don’t think not in any meaningful way, but that’s what we’re here to talk about. Basically the bug was Bitcoin Core wasn’t basically following the checkpoints properly. And there was a bug where I think you could fork off block one to send headers and Bitcoin Core would kind of, like I said, or like I was mentioning before, like really optimistically download all these and like store them. And so if you made enough of these, you could run out of memory storing this headers chain. And that’s fixed. And of course, today we have even more safeguarding things where we won’t even download all the headers first. Like we’ll download headers twice where the first time we store, is it like one bit of information per 48 headers or like per something headers? And we’ll like make sure that this definitely is leading to a really worky chain. Then, I don’t know, a lot of work chain And then we’ll actually download the headers, and then we’ll download the blocks. And so, yeah, it’s nice that proof of work offers this kind of innate anti-denial of service protection. But even so, the logic around how you download and store this information needs to be protecting you from people who also have some work that they can do and send you alternate chains. OK. Anything else we should cover? I think there’s a couple more.

Lessons learned from these disclosures

Speaker 0: 00:42:18

Or we could just go into discussing how we feel about these bugs.

Speaker 1: 00:42:26

Yeah, that sounds good. I think maybe the ones we left out are like, I mean, they’re all pretty similar in terms of the type of bug. So it’s probably not super interesting to go into the rest.

Speaker 0: 00:42:36

OK, yeah. I guess almost all of them are net processing bugs. So that’s the first theme, I think, that we can discuss today, is so much of this is just in the net processing logic that is like super poorly tested even today.

Speaker 1: 00:42:57

Just because well I would yeah I wouldn’t say super but there is room for improvement for sure.

Speaker 0: 00:43:05

I think there’s just like an asymmetry between like what the security model of net processing logic is to like how well we test it. I guess it’s not, okay, I’m not trying to bash anyone because it’s so difficult to test in that processing logic because it’s buried behind all this stuff. And like the reason we had like almost no unit tests, for example, is like in order to meaningfully test any of this logic, you’d have to be like, oh, I’m going to start this chain and I’m going to hook up everything and I have to create real proof of work blocks to send to the net processing stack. And we didn’t have it all modularized and mocked. Whereas today we have quite a few of the net processing modules tucked away in their own place and we can kind of just send it, not fully formed blocks, you know, on the net stack, but like, we can just be like, hey, let’s say you receive this hash and this hash and this hash, for example.

Speaker 1: 00:44:15

But like, for example, the orphanage is one of these things that we like modularized into its own thing. The transaction request logic is another one like, and that’s sort of a theme where we take bits of the logic out of this huge file, processing.cpp and put it into its own module. And then we can obviously test them, whatnot.

Speaker 0: 00:44:36

Right, which is an important part of a lot of the work that’s being done today as well. I think which ones were found by fuzzing? Like quite a few, right? Or maybe, I think most of them were actually static code analysis. But like the adjusted time one was fuzzing. Where?

Speaker 1: 00:44:58

Yeah. I’m not sure of any of the others that were fuzzing.

Speaker 0: 00:45:02

Maybe the future ones.

Speaker 1: 00:45:06

Yeah. Yeah, but yeah, some of these are just so easy. Like if you look at the code, you would like, and if you look at the code with the mindset of trying to, you know, being an adversary, like some of these you would just spot immediately.

Speaker 0: 00:45:19

Yeah, I guess.

Speaker 1: 00:45:20

Like I’m sure if you look at the OG version, like, you just see them like immediately. Absolutely. You don’t need fuzzing to find these bugs.

Speaker 0: 00:45:33

Yeah. Yeah, you can kind of just look at each loop and be like, what’s the maximum iterations of this loop? And probably contribute examples.

Speaker 1: 00:45:42

And also, some of these are like, yeah, some of these could be tricky to find with fuzzing. But yeah, I think most of these are just super easy to just spot by reading the code.

Speaker 0: 00:45:54

But that will change as we get to disclosures way closer and closer to today.

Speaker 1: 00:46:00

Yeah. Yeah. Yeah, this should be the largest batch because it includes many versions. And in general, the frequency of bugs, hopefully, and also it seems like that is the case, has been decreasing.

Speaker 0: 00:46:17

Yeah, but also, these aren’t even all the disclosures, right? Like when I was looking at orphan stuff and there was like PR, I think 900 something, like absolute, like a bound to it. And there’s like another PR around that time that’s like, improved denial of service in P2P or something like that. And they didn’t file disclosures for that, even though that was live on the net. Just because the stakes were so much lower, this doesn’t include every problematic piece of code that’s ever existed in Bitcoin Core.

Speaker 1: 00:46:50

There’s a lot of ones that are already public. This was specifically focused on the ones that we could find and were not public already. I’m sure there’s ones that we just forgot about. There may be some that were just some PR and it was just fixed and I don’t know. We didn’t keep track of it.

Speaker 0: 00:47:11

Are there like secrets that people left?

Speaker 1: 00:47:12

Yeah, lots of ones that are already public. Yeah, exactly.

Speaker 0: 00:47:17

Yeah. Okay, I guess I was asked to include a section where we talk about how many of these are quote, unquote, real bugs, like Satoshi’s bugs, and how many were introduced by quote, unquote, devs meddling with the code. Just because I think there’s like a perception, there’s like on one hand, there’s a perception that like Bitcoin Core is like so robust, it’s got no bugs, like it’s perfect. And then there’s, on the other hand, oh, Satoshi wrote perfect code, and development today is probably doing more harm than good because it introduces bugs in addition to fixing them. And so, obviously, we’re biased, and we have more information than a lot of Twitter people do. But These bug fixes are part of thousands of bug fixes and tens of thousands of commits of code changes that have not introduced new bugs. I don’t know, I felt like adjusted time was a very good example of, I mean, not us trying to like, you know, hit like punch on Satoshi but just like that’s kind of the purest example of a bug fix that I could find, which happens to be in this batch of disclosures. But quite a few of these are just like, you add security things and then the security bar increases.

Speaker 1: 00:48:58

Yeah, and also Writing bug-free code is not really a problem that we’ve been able to solve yet. So…

Speaker 0: 00:49:08

We as in humanity?

Speaker 1: 00:49:10

Humanity, yes. Yeah, any dev who writes code will have bugs at some point. Like Satoshi had lots of bugs, we write lots of code, there are always bugs. The only thing you can really do is do your best at trying to avoid them by testing it. Yeah, like testing and review is pretty much the only thing we can do. And yeah, I think I would just reiterate what you said, like we needed, like changes needed to be made to the original code base. And I would argue we still need to make changes to the current code base, at the very least just to support new operating systems, hardware, whatever, and any change comes with a risk of having a bug, so. Right. I don’t know. I wouldn’t bash Toto or us for having bugs, But I would say we should learn the lessons from the bugs and try to avoid them in the future. And that’s pretty much the only thing we can do, I think.

Speaker 0: 00:50:08

Yeah, I think also it’s like code that you write today that’s bug free today. Like it’ll have a bug at some point in the future. Also, like this, this just talk like these disclosures are just bugs that were in live versions. Today, I feel like we catch way more bugs either in review or just the other day, for example, HL found a bug that was merged into master, I think maybe a few weeks ago, and then found a deadlock, and then we merged a fix for it a few days ago. So it did exist on master, but it never actually made it to a release. So, you know, we catching bugs all the time. Y’all just not hearing about them.

Speaker 1: 00:50:53

Probably not all of them, but we do our best.

Speaker 0: 00:50:57

Yeah, I mean, yeah. So anyway, I think we can wrap up. Thanks for listening. Let us know if you have any questions that you want us to discuss in the future podcast for the disclosures. And bye bye.

Speaker 1: 00:51:14

Bye bye. Bye bye. Bye Bye.