Skip to main content

Network Disruption Postmortem: 2022-10-19

· 4 min read

 

On Wednesday 2022-10-19 at approximately 2:20 AM UTC Validators stopped producing blocks at block 1,575,935 and a chain halt occurred. Consequently, at 5:55 AM UTC a second chain halt occurred at block 1,575,955. Normal chain operations resumed approximately 5.5 hours later at 7:49 AM.

Throughout the disruption, Hotspots did not participate in Proof-of-Coverage. All funds were safe, but pending transactions were delayed until the halt was resolved. Pending Transactions include adding a Hotspot, changing a Hotspot Location, transferring a Hotspot, payments, and token burns. During the disruption, data transfer was largely unaffected by the chain halt.

Cause

The core team identified a root cause of slow blocks. The string of slow blocks triggered a latent bug in Hotspots to continually reach out to Validators for the latest block information and not back off when they received a block they perceived as stale.

Recovery

The temporary solution was to block gRPC port (default 8080) so Hotspots stopped making requests to Validators and allowed Validators to resume block production. Once block production resumed, Validators were requested to gradually reopen gRPC port(s) across their fleets.

There was a second halt after this that the team decided required a new Validator release. The problem was some nodes were still online, but unable to absorb blocks because they were running out of file handles. This led to several members of the Consensus Group unable to follow the chain. The Hotspot swarming issue is believed to be a secondary effect of the file-handle issue that initially caused slow blocks.

The core team then tagged mandatory release v1.15.6. This release included a patch to fix a latent bug in environments where multiple Validator instances may have been running on the same machine causing a resource exhaustion (due to too many open files on the server).

Validator operators then were asked to gradually re-enable the gRPC port (default 8080).

Resolution

In order to prevent this from happening again, we are implementing several changes. The first, released in v1.15.6, fixed what is believed to be the original bug.

Following that, we plan to make the system more resilient to gRPC storms both on the Hotspot and the Validator side.

Power of Community

Thank you again to our Validator community for working with us and acting swiftly to help resolve this issue. Your continued dedication and support has such a huge positive impact on the entire Helium community that does not go unnoticed.

Timeline of Key Announcements

Throughout the disruption, the team worked to keep users up to date on the status and required actions. All below times in UTC.

2022-10-19 2:20 AM Core team started to investigate the first chain halt at block height 1,575,935 with a plan to update the community as soon as they learned more.

3:00 AM Root cause was identified and the core team began to work with the Validator community to resume block production.

4:07 AM It was believed the root cause stemmed from Hotspots constantly reaching out to Validators was due to a latent bug triggered by a series of slow blocks. Validators were requested to close the communication channel that Hotspots use to get ledger information and challenges.

Once block production resumed, Validators were instructed to gradually reopen ports across their fleets to communicate with Hotspots once again.

5:55 AM Chain halted a second time at block height 1,575,955. The network hit the same issue that was cleared earlier and was most likely the root cause of the initial outage. The core team and Validator operators in the community worked together to investigate.

7:23 AM The core developers and the Validator community identified the root cause of chain halt (and the prior one) and a new Validator release, v1.15.6, was prepared. In the meantime, a hotfix was applied to certain Validators in the Consensus Group and block production resumed.

7:29 AM The core team tagged mandatory release v1.15.6. This release included a patch to fix a latent bug in environments where multiple validator instances may have been running on the same machine causing a resource exhaustion. Validator operators were requested to upgrade to v1.15.6 without delay as failure to upgrade could result in another chain halt.

7:32 AM Validators were notified to gradually re-enable gRPC port (default 8080) across their fleets.

7:49 AM The Validator release was issued and block production continued. All Validator operators were urged to follow upgrade instructions to v1.15.6.