Skip to main content

Network Disruption Postmortem: 2022-08-08

· 4 min read

 

On Monday 2022-08-08 at approximately 19:34 PM UTC Validators stopped producing blocks at block 1,469,951 and a chain halt occurred. Normal chain operations resumed at approximately 08:00 AM UTC just over 7 hours later.

Throughout the disruption, devices continued to transfer data and operate normally. There were no risks to Hotspot owners or HNT holders, but they did experience a halt in transaction processing, for example, adding and transferring Hotspots, asserting location, and completing payment and burn transactions.

Cause

The latest chain variable activation revealed a previously unknown chain variable update hook bug which caused the chain to halt. This chain variable was intended to disable legacy Hotspot indexing operation in accordance with HIP 54, h3dex poc targeting.

Recovery

The team identified the issue and requested operators of Validators in the Consensus Group to restart.

However, although a restart resolved the issue during testing, on mainnet a restart did not resolve the issue and the core team issued an emergency Validator release, v1.13.5. This release disabled the variable update hook and included a new blessed snapshot, generated on a chain follower with the hook disabled.

Validators were requested to upgrade to the emergency release, and those in the Consensus Group were instructed to manually load the new snapshot.

Once the Consensus Group’s ledgers were in agreement, the block skip process allowed the chain to begin moving again, but with a lower number of transactions compared to normal operations

Resolution

Following Validator release v1.13.5, which addressed the variable update hook issue, release v1.1.158 for ETL maintainers and v1.1.70 for Blockchain Node operators were issued on 2022-08-09 at 5:38 AM and 5:49 AM UTC, respectively, along with a snapshot to synchronize users past the halt.

Certain Validators who updated after the chain resumed experienced ledger drift, which caused Validators to go offline and remain halted. The team decided to issue release v1.13.6 as a non-mandatory release for halted Validators. This enabled the chain to continue to progress more normally as Validators updated, aligned with the latest blocks based on the latest snapshot.

The average PoC rate before the halt was ~270 receipts per block. The post-halt average initially moved up to 210 receipts per block because of grpc blocking by certain Validators. Validators have unblocked grpc connections and PoC receipts have returned to normal operations.

The team provided an updated Validator release v1.13.7 which included a number of performance updates, but also included a fix to prevent the variable update hook issue from recurring.

Power of Community

We want to extend our deep appreciation to our community members that host Validators. Thank you for your attentiveness and willingness to troubleshoot issues that affect the entire network. Your efforts and support move the network forward.

Timeline of Key Announcements

Throughout the disruption, the team worked to keep users up to date on the status and required actions.

2022-08-08 19:34 PM UTC Discovery and investigation of chain halt that occurred at block height 1,469,951 as a result of activating the latest chain variable.

20:26 PM UTC Core team requests Validators in the Consensus Group to restart their Validators in order to resume block production.

22:45 PM UTC Chain halt again at block 1,469,988, continued exploration by the core team into causes and solutions.

2022-08-09 03:04 AM UTC Emergency release v1.13.5 sent to Validators which included a new snapshot that all Validators were required to sync to in order to realign to block 1,469,990.

05:09 AM UTC Block moving at slow rates (approx 10% of normal transaction rate). No reward block yet, team monitoring. Testing for ETL and Node releases.

05:49 AM UTC Release v1.1.158 for ETL maintainers and v1.1.70 for Blockchain Node operators issued along with corresponding snapshot to synchronize users past the chain halt.

08:02 AM UTC Chain progressing as expected, PoC rate to continue to drift back to normal rates (approx 270 receipts per block) over time.

18:31 PM UTC Ledger drift identified among Validators that updated after chain resumed. Core team issues release v1.13.6 for general availability containing a newer blessed snapshot for any Validators that are halted because of ledger drift.

2022-08-10 22:34 PM UTC Core team delivers beta release v1.13.7 for Validators. This release included a number of performance updates, but also included a fix to prevent the varhook issue from recurring.