LND 0.6 Beta Deep Dive
Speakers: Conner Fromknecht
Date: May 2, 2019
Transcript By: Michael Folkson
Media: https://www.youtube.com/watch?v=XkyPw2_l6YE
Intro
lnd 0.6 was about seven months in the making. There’s more things that we could fit into this talk alone but I cherrypicked some of the highlights and things that will have some real world impact on users of the network or people who are using the API or just generally users of lnd. The high level impact on the network going forward and how we plan to scale up for the next 100K channels that come onto the network.
Overview
lnd 0.5 was released in September and back at that time the network had 9000 channels and 1600 nodes. Since we’ve quadrupled in channels, tripled in nodes and more than 10x the capacity of the network with the channel constraints of 0.16 BTC. Some of the reasons that might have increased are probably one, all of the major implementations are way more stable than they were when we launched them last spring so that leads to increased confidence in the safety of funds. Another one is that the Casa Lightning node was announced two weeks before 0.5 shipped. With the success of that delivery that brought a whole new class of users onto the network who were able to plug-and-play and get going right away. There was also LN-BIG who maintained a very well capitalized portion. This is generally increasing awareness of Lightning. You all being here is a testament to that. All in all it has been a successful seven months. We’ve also learnt a lot in that time. lnd 0.5 held up for the most part but if you guys are running one you’ll know there were some crazy things happening earlier this year as the network grew. I’ll dive into a couple of those a little bit and see what the issues were, how we solved them and also new features that were added to 0.6.
Static Channel Backups
The first flagship feature of 0.6 is the static channel backups. This was one of the flagship features that Laolu worked on quite a bit. More or less static channel backups produces an encrypted file that contains information that you need to derive keys and reconstruct any channel scripts or channel parameters. This file lives on disk next to your node. Typically you’ll subscribe to fsnotify and you can see that it got notified and then sling it to S3, Dropbox whatever you like. Some of the cool things are one, it is encrypted using a key derived from your seed. No one else can decrypt it unless you also have the seed. This file gets updated when a channel is opened and closed. The reason being is that it is called a static channel backup. It only backs up the channel parameters, it doesn’t back up the states of the channel. When you create a channel you don’t know the HTLCs that you are going to forward, you don’t know the preimages of the payment hashes that might come. You back up all the channel parameters that can allow you to sweep outputs from that channel back to your wallet in case of total disaster and you need to recover the funds. The other thing it contains in relation to the key derivation and CSV constraints is the addresses that you need to reach your peer. That’s because when you restore with the static channel backups you’ll enter a DLP protocol where you’ll connect out to those peers, tell them “Hey I need you to close a channel, I don’t think we can continue.” The remote party will force close and you’ll be able to sweep your funds back to your wallet. That allows you to sweep the funds from their remote commitment back to your seed. You can take those funds and start a new wallet if you’d like. It doesn’t allow you to recover state and continue operation as if you had never lost it. That doesn’t exist because that has some potential flaws that we tried to avoid with this one. This was a pretty major step in terms of increasing the safety of people’s funds. There have been people who have lost data and lost some money. This is a first step in trying to make a safe way to start encouraging people to use and increasing the safety of people’s funds within the network.
https://github.com/lightningnetwork/lnd/pull/2313
Static Channel Backups (continued)
There are five new RPCs that come with 0.6. They are ExportChannelBackup, ExportAllChannelBackups, VerifyChanBackup, SubscribeChannelBackups and RestoreChannelBackups. The first one allows you to create a backup for a single channel. If you have a channel that is very valuable or a channel that has some specific constraints in how you want to back it up, you can target that specific channel and distribute it to different avenues. ExportAll does everything all at once. Verify will allow you to take a backup created by either of the first two and then you submit it back and say “Is this valid? Can I decrypt it?”. SubscribeChannelBackups is an alternative for the fsnotify that we discussed earlier. Instead of listening to the file for modifications you can subscribe to that and that will give you them over the RPC. The data contained in those is essentially equivalent, it is up to you how you’d like to use it. Finally, there’s RestoreChannelBackups which will take any of these backups produced by the RPCs and then initiate a recovery protocol of trying to restore them, contacting peers and recovering funds and sending them back to your wallet. Laolu was kind enough to bless us with this recovery doc which is an incredibly thorough documentation of how of all the recovery procedures available in lnd what your seed can and cannot do for you, how you use static channel backups and a host of things like that. I suggest checking it out if you run a lnd node. It is good to be aware of the things that you are and are not protected by. Hopefully the list of things not covered will get more complete.
https://github.com/lightningnetwork/lnd/blob/master/docs/recovery.md
New Channel Status Manager
That was one of the flagship features. I’m going to go into a lot of things in the gossip improvements because this was a major concern of ours beginning in March as the network topped up with 40K channels. The gossip is probably one of the heaviest systems of the Lightning Network right now. There’s a lot of traffic, a lot of nodes, a lot of verification. It is more or less like the Lightning Network’s mempool in some sense. As the number grew it became even more costly to maintain this and verify and keep up with it all. What wasn’t helping with that issue was there was a bug in 0.5. Let me describe the problem first. Channels can be enabled or disabled via a bit set on your ChannelUpdate. You can then change that by broadcasting to the network and flipping the bit or other parameters in the ChannelUpdate. There was a bug in prior versions that would cause that to flip very fast. If anyone was watching the gossip traffic come in you’d see enable, disable, enable, disable. That’s a lot of verification, signature checking, database operations. In 0.6 we rewrote that from scratch. We designed a better state machine that would dampen any effects of peers flapping. You won’t send an update exactly when every peer flips on and off. It will dampen all of those effects and try to converge on some steady state. It will give us a greater ability to test that. We have a lot of unit tests around this so we’re pretty confident that this will dampen this oscillation on the network. Part of the reason those nodes were flapping was because of the gossip traffic. They would get overwhelmed doing gossip syncing with their peers and they’d disconnect. And then they’d reconnect. And then they’d do the whole thing over and over again. A lot of these improvements are very well intertwined. They all collectively improve the stability. This is one of the ones that if you now… gossip traffic you’d see far fewer number of these flipping updates. That fortunately has now been fixed.
https://github.com/lightningnetwork/lnd/pull/2411
Zombie Channels Overview
Zombie channels. How many people know what zombie channels are? A fair number. lnd will prune a zombie channel after two weeks. This is the recommended parameter in BOLT 04. Does c-lightning prune after two weeks? I don’t know. Basically a zombie channel is one where you have not received a new update with a timestamp within two weeks. If I have a channel but it has no update that has been signed and broadcast by either node on either side in two weeks it is considered a zombie. We assume that the node has gone permanently offline, it’s stale, it can’t reconnect or something. For those reasons we used to delete them on disk. There were a couple of issues with that. One is not all the nodes prune at the same time in the network. They might determine that a node is a zombie at different times than others. Not all the nodes may apply the same policy. A new implementation might choose to do this after a week. Some may do it after three. It’s really up to them. The biggest issue of all is after I delete all data about this channel how do I know it was a zombie? If I have no information about this channel how do I know in the past that I had deleted it? A result of this is you can imagine if a bunch of people were all pruning at different times and syncing up with each other and no one knows which zombie channels they’d deleted. You end up with this effect of the zombie channels sloshing around the network. I would delete it. Then a minute later someone would secretly go “Oh that’s a new channel. I’ll get it and then I’ll broadcast that to all my peers” right after he prunes it and then I’ll send it to him. The process keeps going. You end up with droves, it was like the whitewalkers. You would watch it. Every hour you would get hammered with all these things. That eats up a tonne of things. A lot of bandwidth, CPU and database operations involved with that. This was a huge issue that we saw. It was probably one of the major things that sped up the 0.6 release. We had a couple of milestones planned but we decided that we should fix this up just because the network stability was really suffering as a result.
Persistent Zombie Index
Here are some of the things we did to address it. Number one, we started persisting which channels we deemed zombies. This allows us to delete most of the data. There’s probably at a guess 800 or 900 bytes you’re able to delete. The one thing we do keep is the pubkeys of the nodes involved. The reason is because if you see a future channel announcement that resurrects the zombie into a live channel, you have to know how to validate that channel. We keep the pubkeys around just for that sake. Other than that we’re able to delete virtually all data. Combined with the short channel ID it is about 74 bytes a zombie. At 3,300 on the network that puts it around 250 kilobytes on disk which is pretty minimal but it pretty much prevents this whole problem of the whitewalkers coming. The improvement here is that now I can check if I have historically deemed this channel as a zombie. I can immediately reject it from my peers if they send it to me. Maybe the other person doesn’t have the persistent zombie index. If they send it to me I can immediately verify that I’ve already done that and discard it. You can avoid when syncing with other peers and they send you channels that they know about to filter through and then request from them, I can say I already know that’s a zombie, don’t request that. And vice versa. If someone is syncing from me I can use this index to say “I’m not going to give that zombie channel to you because I don’t want to burden you with something that is probably a zombie. If you get it from someone else that’s fine but you’re not going to get it from me. All in all there’s two PRs at the bottom that you can check out if you’re interested. This for the most part from what I’ve seen has mostly solved the issue. Once a majority of the nodes upgrade to 0.6 this issue for the most part ceases to exist. That was a pretty major win and one of the things that we really wanted to ship out to the network.
https://github.com/lightningnetwork/lnd/pull/2777/
https://github.com/lightningnetwork/lnd/pull/2893
Gossip Sync Manager
Another related change which is a lot of work that Wilmer put in is the GossipSyncManager. With 0.5 you would come up and with all your peers you’d do this gossip sync protocol as soon as you’d connected. What that would do is especially when you consider restarting your node and you have 200 peers or 1000 peers like Rompert… Maybe the zombies got him. You basically come up and immediately start syncing with 200,000 peers all trying to give the exact same data and a lot of people will have issues where they… on restarts or they have to block ports so that you can restart to a sufficient portion of your peers. Then the other ones. There are some issues there. What we decided to do with 0.6 was be more selective on who we do syncs with and be more intelligent about it. Now when you come up you will only do gossip syncs with three peers at a time. Actually that’s incorrect. You will actively listen for new updates at tip. Think of this like a Bitcoin mempool. You’re going to listen for new updates in the mempool from only three peers at a time. As soon as you come up you will try to do a historical sync with someone. You’ll basically pick one peer, do a full sync with them and then every twenty minutes you’ll try to spot check that with all your other peers. This is mostly going to be a NO_OP. You’ve already synced all the channels, you might find one or two but for the most part it will be pretty bandwidth efficient. That prevents a lot of this redundant bandwidth you see on start up. It is more intelligent in that sense. The three peers at a time greatly reduces your bandwidth. I can tell by looking at my peers who is updated and who is not. If you’re in this room… The prior version of lnd had a really deterministic bandwidth profile. If you’d look at them all in a list you can see who has updated and who has not. Now it is a little bit more sporadic because there’s some rotation and randomization in there. That will be the eventual discovery of all the channels from all your peers by doing this rotation thing. My node which has about 55 channels or so saw a 95% reduction in bandwidth and that was before anyone updated. I saw 95% reduction in incoming because most of what this optimizes is how much I request from my peers. As soon as everyone updated to 0.6 or a majority you get a similar savings looking outbound, people requesting from you. If you updated to 0.6 you probably saw similar improvements. If you were a large routing node like Y’alls or something you’d have even more.
https://github.com/lightningnetwork/lnd/pull/2740
Sweeper
Another cool feature. This was the work of Joost who is one of our developers from the Netherlands. The Sweeper is a new batching engine in lnd. This is the future of how we do our transaction management and transaction batching internally. We used to use this subsystem called the UtxoNursery which was responsible for raising children and outputs that were timelocked until they were mature enough to sweep. We’ve sort of ditched that in favor of this more generic abstraction of this batching engine. More or less what it does is that if I have channels closing at any given time it will ask them “Hey do you need any sweeps?”. If there is a sweep they’ll take those transactions, sign it, broadcast it. It will continue to do this for all the channels that have active sweep requests at the time. It will sweep all outputs. Let’s say I have an output that only has 20,000 sats in it but the fee rate is 50 sats/byte. That output cannot pay for itself. It costs too much to spend it versus what I’m going to get in return. The Sweeper will actually compute the yield from each input and determine if the input is even able to pay for itself. There’s a metric you can tune there to say “I’d only like to sweep it if I get a third of the output back” or something like that. That’s pretty cool. It will basically postpone those ones that can’t pay for themselves and literally are not worth sweeping until fee rates subdue and you can get them in. That was a cool feature. In the future we will probably move to a design where almost all transaction flows into channels, out of channels, onto chain, all that sort of stuff will go through the Sweeper primarily because this is the central batching engine. It can do things like “I want to sweep this channel but instead of sweeping it back to the wallet I’m just going to go create a channel.” You skip an onchain transaction going into the wallet and then back out to the channel. You save fees, all sorts of stuff. Those sorts of shortcuts can be taken in a lot of different places. All in all we’re pretty excited by the future direction of that. For any of those who have been running a node long enough to remember what a strategic continue is, those should be gone as a result of this. One of the things that we found is that the UtxoNursery was too rigid in its initial design. It wasn’t as tolerant to fees spiking, things like that. There would be an issue on start up where you literally could not start lnd and your option was don’t start lnd or add this one line of code called a continue that would skip the error and proceed as if nothing happened. That broke a lot of the assumptions in the UtxoNursery like the order of things that happened and what things it had processed and what it didn’t. When the alternative was not starting your node until it gets fixed which was 0.6 and this was 0.4 that was your only option. I think people made a T shirt that said “strategic continue” on it. It was one of the earlier jokes in the lnd Slack. Thank you Rompert.
https://github.com/lightningnetwork/lnd/pull/1960
New Primary RPCs
There are two new primary RPCs added to lnd’s main RPC server that you guys might be interested in. One is ListUnspent. This is the work of Adam Gibson who added that for us. It mimics bitcoind’s listunspent command. It will show you all the UTXOs in your wallet, their confirmation depth, the addresses and how much is in each one. I find this pretty helpful because I have a lnd top command and I run it and always see the confirmations updating, things like that.
https://github.com/lightningnetwork/lnd/pull/1984
SubscribeChannels, this was added by Valentine. This is a useful RPC. It will give you notifications on certain channel events like when it becomes active or inactive. That means the peer connects, we send channel reestablish and the channel can be used for actively forwarding payments. The opposite of that, when they disconnect that no longer becomes the case. Also when a channel closes. In the future it will also do channel openings or does it already? It does all of them, everything. This is especially useful in developing the Lightning App. The Lightning App prior would sit there and poll ListChannels and basically give you this page of all your channel statuses. Long polling is not the best. It leads to a lot of overhead and CPU, stuff like that, deserialization. Instead you’re going to have lnd tell you when all these events happen and you can update the UI responsively. We found this incredibly helpful. It makes your UI snappier and more responsive.
https://github.com/lightningnetwork/lnd/pull/1988
New RPC Sub-Servers
In addition, there’s some new products that we worked on. Where any in 0.5? We’ve been doing a lot of under the hood work. There’s six new optional sub-servers that you can pile into lnd with what we call build tags. If anyone has done lncli -l there’s a huge list of things. The main goal here was to start breaking that up. That means we have one profile, they all intertwine and really they can be packaged up in much more modular units. It also allows us to be a little more ambitious in terms of refactoring, redesigning, being a little more experimental and allowing breaking changes to happen while we work out the kinks of the specific RPC commands and semantics. I’ll go through briefly what they all are. The AutopilotRPC you can use to manage specific aspects of Autopilot. Prior to this you had to restart lnd to turn Autopilot on and off. Autopilot is the automated channel opening subsystem within lnd. It will look at the graph and various heuristics and decide I want to put funds here. This allows you to toggle that on and off. That’s useful to have a button in the app where you hit on and off and it doesn’t need to kill the app and restart the app for that to take effect. There’s also some cool things there. You can provide your own heuristics. If I have some scoring system on the nodes that I connect to I can weight them and Autopilot will take in those scores and when it goes to create new channels it will weight them according to that. That will be the probability of how they will connect to those nodes. That’s really useful if you have your own scoring preference or metrics you’ve been gathering on specific nodes or channels, you can use that to inform where your automated service opens channels for you. The ChainRPC is also really useful. It exposes one of the internal subsystems of lnd which is the ChainNotifier. This allows you to be notified on block events. Whenever a new block comes in you can ask for confirmations or spends of transaction IDs or outputs. You can also do this based on scripts. The reason for that is that Neutrino matching is all done via scripts so you can basically do outpoint scripts, txids, pretty much all you want. It will all handle that for you depending on how you like to have your application set up. There is the InvoicesRPC which allows you to create and modify invoices. This one has its own macaroon that you can say “I only want this person to be able to create invoices.” With this macaroon that’s all you’re able to do. The RouterRPC is in charge of handling all the payments. You can request routes or facilitate payments, pay invoices, stuff like that. This is actually a really big ongoing effort, it is something that will continue to be refactored in 0.7. We’re working on fine-tuning the kinks of what information is critical when you send payments, how do you deliver error messages back to the user. It is a very difficult problem so expect to have some more changes there in the exact semantics of that RPC sub-server. I think in the long run it will be very powerful because with the current one in the main RPC server we’re locked in to people’s behaviors and what they expect of that RPC server. This gives us a lot of freedom to redesign that from scratch and make it an even better experience for people. I highly suggest playing around with it and giving feedback. The SignRPC allows you to access directly the signing keys of the wallet. You can give it a BIP 44 derivation path and it will sign a message for you. It will give you direct signing access to a lot of things. This will allow you specifically craft transactions using the wallet over RPC and make onchain transactions if you wish. Similarly the WalletRPC has similar functionality but doesn’t expose the direct signing capabilities if I’m not mistaken. You can send transactions, you can publish stuff like that. For example Loop uses the Chain, Wallet, Sign and InvoicesRPC. It uses them all but Autopilot. Loop is a new service that we offer for trustlessly moving your balance on and off the Lightning Network. You can swap a balance in your Lightning channel to an onchain output. Soon we will have vice versa where you can take an offchain output and push funds back into your wallet. This allows you to actively maintain the balance in your channel. I can make a channel with my partner, put in 1BTC, move that out to onchain so I still have 1BTC in my custody but now I can actually receive 1BTC back through my channel. If you’re a merchant that needs to be able to facilitate a lot of volume in the incoming direction, you can establish these channel relationships with peers and use this to receive inbound capacity without the other person having to put it up for you.
Q - You could request it and the service buyer is going to have to use the onchain escrow so if the user doesn’t carry out the second part of the operation?
A - In theory yes but we have a technique against that so don’t try.
Performance Improvements
I guess there’s some final things. I only put miscellaneous because there’s too much to write about them individually. These were a lot of work that was ongoing for a couple of months before 0.6 was finally released. I’ll try to go through them a little bit to give you an overview of what they are. Going back to all we talked about gossip and what an issue that was and the performance penalty there. A lot of those operations that we were doing were hitting the database every single time. Have I seen this node? Scan the database, check the database. Is this a zombie channel? Check the database. When you have 200 peers or the network grows anymore than it is today, it is going to be a lot of contention on the database just to do these very simple sanity checks. The Reject Cache allows you to withhold 25 bytes per channel of the entire graph in memory and we can do most of the sanity checks of is this a new update? Does this channel exist? Is it a zombie? All those queries can go from memory without touching the database now. The channel caches allow you to respond to other peers’ queries. When someone says “Hey I want to do a huge sync”. You basically keep a cache, we keep about 20MB in memory right now. Channel updates and channel announcements that are pre-deserialized. Prior you’d go to disk for every peer that requested them and deserialize them and then send them out. We can respond to them much faster because the majority of them are already in memory, they just need to written out on the wire. If you restart the 0.6 node a lot of the performance improvement on initial restart is due to these caches being able to swiftly filter out traffic from your peers that it doesn’t need and also respond to the queries that they have. That is one of the major reasons that you’ll see a pretty big improvement on start up. Batch preimage writes. This has to do with HTLC forwarding. If I forward a payment and the preimage comes back in the HTLC being settled, prior versions of lnd would write these serially in a background task to disk and this version of lnd batches them at the level of the commitment transaction. Whenever we write a commitment transaction to disk it will take all the preimages in memory and commit them at one point. This turns out to be a performance improvement even with the extra write in the critical path just due to the reduced contention around the number of updates. You can batch 400 or 500 all into one transaction. Another major improvement was deprioritizing gossip traffic. I think Eclair already did this and c-lightning started to do this a little bit back, we may have been the last ones. There’s two primary classes of traffic in the Lightning Network. One is the gossip traffic, all this endless mempool stuff that is floating around. The other is very important, critical messages like “I want to make a channel with you”, “Here’s a HTLC”. There is also channel reestablish because there is a deadline when I connect to someone that if I don’t give them a channel reestablish at a certain time the channel will be inactive for the duration of the connection. Those messages are very important. We now segregate those into two distinct queues to the peer. All important messages go through first and only once those are done do we send any gossip messages. You’ll notice that reconnecting with peers is more reliable, you’ll be able to reestablish channels even in the face of large gossip bursts, all that sort of stuff. That was a pretty big improvement Unified SigPool. Before each channel would have its own sigpool. A sigpool is we keep a couple of goroutines around that are there for signing and validating HTLCs. You only have maybe 8 CPUs at most so it doesn’t really make sense to have 8 times the number of channels that you have. This version of lnd consolidates that down into a single sigpool that they all share since you can only do as much computation as you have CPU cores. The final one is the read and write pools. Those were a pretty big improvement in terms of conserving memory when allocating and reading messages from a huge number of peers. Similar to the way the sigpool works, we have serialization and deserialization pools on the go in lnd. You can have like a thousand peers and only use 500 kilobytes of memory. Prior versions of lnd would assign a 65 kilobyte buffer each peer and connection. Especially if you have a flappy peer and this fixes all those issues. There’s even more improvements in 0.6.1 which will be coming out tomorrow maybe.
https://github.com/lightningnetwork/lnd/pull/2847
https://github.com/lightningnetwork/lnd/pull/2501
https://github.com/lightningnetwork/lnd/pull/2690
https://github.com/lightningnetwork/lnd/pull/2328
https://github.com/lightningnetwork/lnd/pull/2474
Q&A
Q - You said that you’re tracking the Sweeper. It tracks until the fee rate is at a certain level and then it broadcasts transactions. Why don’t you just submit them at the specific transaction fee rate and wait for them to confirm?
A - We could do that. Some of them are time sensitive. At some point you have to make those distinctions as well. It will basically say this was too slow. Part of the reason also is that at the moment it only tracks one single transaction at a time. There’s ongoing work to split that up and have multiple fee buckets and stuff like that. It will get more advanced as we continue. It is a good idea also to submit them and wait. There are certain things that have a higher time preference. You want to sweep breaches with a higher fee rate than a timelock that expires next week.
Q - … is quite a bit slower now. Especially if there’s nodes… on the network it might take you quite a bit longer to figure out state changes there. I’m trying to work around that and gauge uptime given the information that is available to me on the global network graph. How am I going to do something like that?
A - If you want to gain uptime you should probably just connect to them. In lnd… the interval that we broadcast batches of gossip information. We listen to all our peers and then after now 90 seconds we’ll duplicate anything that comes in and send the latest ones to any peers that want that information. That delta has become greater in 0.6 which means the assimilation of traffic will be a little bit slower than it was before. You probably shouldn’t rely on that either. Because of the reworking of the channel status manager, a node could go down and come back up and you wouldn’t get an announcement because it has been abandoned. I’d recommend to disconnect people.
Q - Hypothetically if I have an invoice and I try paying it with one node and it gets stuck for some reason. If I try to pay it with a second node is there potential…
A - Yes and lnd will take both. That’s one thing we’re going to work on as we work towards developing AMP. Atomic Multi Path payments. There’s a couple of implementations of it. Also Sphinx send or spontaneous payments, in the process of doing that we will… protection. There’a a potential condition in lnd where if you pay it twice you could settle it in either one because it doesn’t have information about which one actually occurred. We take both to prevent that from happening. There’s also a privacy concern related to rejecting the duplicate. If I see a payment go through and then try to pay a bunch of hops with that payment hash and guess the payment amount I can tell that was the hop it was intended for. Accepting both is also good from a sense of privacy. There are certain use cases for merchants where they may not want to accept duplicates because they’d have to do refunds or things like that. Those need to be worked out. In the future you will have the option probably to disable it if you want.