Last night, during the rollout of the
2021.05.04.2 release including the code for
members of the community observed a long election including multiple proposed Consensus Groups where
only 15 of the 16 new members succeeded in their Distributed Key Generation (DKG). After several
hours of monitoring the Consensus Group, the Decentralized Wireless Alliance (DeWi) in conjunction
with several core developers ended the 450 block epoch with a Rescue Block, which later restored the
network to a functioning state. The network has been operating correctly since. All funds remained
safe through the entire process.
DKG is a process which ensures that no single one consensus group member knows the entire key for an epoch; they each have shares of a shared key which is generated per epoch for signing blocks. Generating a new distributed epoch key is a prerequisite for completing a new Consensus Group election and generating the associated reward transaction for the epoch that just ended. As part of the protocol we ensure that every member of the proposed new group completes the DKG thus requiring 16 of 16 signatures.
After several hours of monitoring it was clear that a new election was unlikely to complete. In this instance, Rescue Blocks are used to move the blockchain forward by selecting a new Consensus Group. The first Rescue Block nominated a group member that subsequently went offline, so a second Rescue Block was issued with that member replaced. This action proved successful.
- No transactions were lost or rolled back and finality was always maintained.
- No assets were ever at risk.
- No loss of any IOT traffic utilizing the network.
- Block times were reasonable throughout the event, except when the Rescue Blocks were issued.
- The majority of the existing Consensus Group was maintained when the new group was nominated.
- Rewards were not issued during the 450 block epoch
- New blocks were delayed while during the creation, signing and deployment of the Rescue Blocks.
A Rescue Block is a block the network can use to nominate a new Consensus Group in the event that the current group is unable to fulfill its duties. A Rescue Block can also contain chain variable updates in case a chain variable change is what is causing a Consensus issue.
Throughout the Helium blockchain's 830,000+ blocks, Rescue Blocks have only been used on three prior occasions (11578, 22462 and 22508). In this instance, the issue was identified quickly, diagnosed correctly, and the community assembled quickly to take action and restore network health.
What triggered the long election?
The release of HIP-30 contains fundamental changes to the threshold cryptography underpinning the election and consensus logic. Despite several tests beforehand, including multiple trials on the Testnet, this release resulted in a long epoch on Mainnet.
This issue was exacerbated by the arm64 Docker build. This issue took several hours to identify and resolve in the middle of the night. As a consequence, Hotspots (notably DIY miners and Bobcat) using a Docker image lagged significantly on upgrades resulting in partial rollout of the crucial changes included in HIP-29 and HIP-30.
Several elections came extremely close to completion (15/16 signatures) but the community became concerned with the time it was taking. Although a majority of the group was running the new release, not all of the members had upgraded and the new group members were either unable to complete the DKG because they were offline and/or not on the updated threshold cryptography curve. Eventually it was decided that while the arm64 Docker container was going out, that alone was not enough to solve the problem in a reasonable time.
With long elections, there are 2 issues that cause concern: the size of the rewards (and the calculation complexity) and routers running out of state channels. The first issue led to a quick realization that even if things were to eventually resolve, the resulting rewards block could have been so large that it would not be computable/absorbable on some of the older Hotspot hardware. Secondly, the number of state channels a router can run in an epoch is limited, so extremely long elections can lead to state channel exhaustion, which leads to routers being unable to open more state channels, which eventually leads to disruption of the IoT packet routing. Both of these concerns led to the decision to use a rescue block to resolve the situation.
Lessons and Prevention
While this situation was somewhat unique, there are definitely some lessons to learn:
- The arm64 Docker image build needs to be more reliable (this work is already in progress)
- Rescue blocks need to have some facility for including rewards in the future
- If a complicated upgrade must be done again (which is not foreseeable), a better schedule should be set up for it and with a clear contingency plan
The validators branch that is planned to be deployed to Mainnet soon includes several improvements that will help with reliability in these kinds of cases, and several more improvements are in-progress and will be incorporated into the final release of that feature.
Rewards and Total HNT Supply
Understandably, one may wonder what happened to the rewards during the 450 blocks of the Rescue Block exercise. The short answer is that they were never minted. Since they were never mined, this will decrease the total supply of HNT, as described in HIP-20, by 52,000.
While the community is working on mitigating the effects of lapses like this, the nature of blockchains is that the past is immutable and all participants were uniformly affected. Although some short-term pain was felt by participants in the network during the outage, this event should have a minimal long-term economic impact to the network.
We’d like to thank the Decentralized Wireless Alliance and several members of the community and core developers for working through the night to diagnose and resolve this issue quickly. Decentralized coordination is not easy. Swift action helped contain and resolve this issue in a timely manner.
While this event was evidence of growing pains, the network is back to operating properly, flowing IoT data, and growing at a rapid clip.