Chain Halt Postmortem

June 25, 2021 · 4 min read

The core developers prepared an emergency release 2021.06.25.3 to fix the Mainnet blockchain halt noticed around 4:00 PM PST, June 25th, 2021. This was a mandatory release for all blockchain participants on the network. Hotspot manufacturers who have not picked up this release should update their fleets. Please read further for more details.

What Happened?

The core developers noticed that blockchain had stopped making blocks and came to halt. A rundown of the timeline of events is as follows:

Timeline

Fri Jun 25 04:13:00 PM PDT 2021

The core developers receive an alert for block production halt.
Fri Jun 25 04:30:00 PM PDT 2021

A potential bug is identified. The issue seems to be related to prematurely garbage collected state channels.
Fri Jun 25 05:17:00 PM PDT 2021

State channel garbage collection fixed in blockchain-core#883 and miner tagged 2021.06.25.1.
Fri Jun 25 07:00:00 PM PDT 2021

Another issue in miner identified wherein HoneyBadgerBFT state was not getting set properly on restore. Fixed in miner#855 and tagged 2021.06.25.2.
Fri Jun 25 07:06:00 PM PDT 2021

Another state related bug identified in miner. Fixed in miner#856 and tagged 2021.06.25.3.
Fri Jun 25 08:21:00 PM PDT 2021

Team confers that 2021.06.25.3 is an eligible candidate for GA. miner tagged 2021.06.25.3_GA.
Fri Jun 25 08:30:00 PM PDT 2021

As the blockchain continues to be halted, core developers issue a rescue block 896680 to unstick the chain and continue to monitor. Rewards between 4 PM PST - 8:20 PM PST are never minted and chain variable max_open_sc is lowered from 10 to 5 to reduce potential size of the next election block containing reward transaction.
Sat Jun 26 02:00:00 AM PDT 2021

Block production still not back to normal but since there was a successful election and blocks are being produced slowly, the team decides to allow the chain to recover by itself overnight.
Sat Jun 26 05:00:00 AM PDT 2021

Final HoneyBadgerBFT deserialization issues fixed in miner#860 and miner tagged 2021.06.26.0_GA. Blockchain returns back to normal block time and election cycles.

Incident Summary

At some point prior to the Consensus Group election at that caused the halt, one of the State Channels that records packet transmission data was inappropriately deleted. This has never happened before. When the Consensus Group tried to calculate the rewards for the election epoch, all members ran into an error, crashed, and halted the chain. This is a straight forward halt that typically can be resolved with an OTA but the core developers had recently released a new image that contained numerous changes to consensus handling, and had changed chain variables related to how many state channels could be open at any one time. All three of these issues contributed to the extended recovery time and the need to issue a Rescue Block and replace the Consensus Group.

While we had a fix in place for the consensus group quickly, we were unable to move forward with it because of some bugs in some quite hard to test code which had been recently refactored. This was fixed in small increments until we found that while consensus members were able to boot, they could not construct the rewards for this epoch without running out of memory. At this point, we decided that, although we had the code in place to calculate these rewards, the risk was too high that this epoch was too complicated for the existing rewards code to calculate. The decision was made to omit the rewards for this epoch, intentionally, while also reverting the chain variable change to state channels so that rewards would no longer be risky to calculate.

After the rescue block, the chain proceeded and made blocks, but extremely slowly. At this point the team decided to go to sleep with alarms re-enabled and see how things were in the morning. Come the morning, an east coast team member discovered the bugs in the recent refactor and issued a new image to resolve them, restoring block times to normal.

Future

A chain halt is an undesired consequence of a major change in blockchain behavior and issues like these occur when there are big feature updates, the previous release included Validator and updated State Channel support and some missing code does eventually slip through the cracks. The core developers are committed to do regular maintenance updates and be vigilant for such issues in the future.

Some specific action items:

Conclusively determine the cause of the premature State Channel GC.
Improve tests around restore after upgrade if possible.
Refactor reward generation to be less memory intensive.

What Happened?​

Timeline​

Incident Summary​

Future​

What Happened?

Timeline

Incident Summary

Future