On 9/16/2020 the team activated PoC v9, which didn't go as planned, causing a complete blockchain halt for ~5 hours and subsequent slow down of blocks for ~7 hours. Longer term effects included several hotspots going out of sync for several hours after the chain resumed. Please read on for the details on what happened, what we did in the short term to fix it, and the steps we'll be taking to address these kinds of issues longer term.
What went wrong?
With the introduction of PoC v9, we require validation of
poc_witnesses at the
time of a
poc_receipts_txn and a
rewards_txn. Both these operations are relatively expensive
to perform, especially the rewards transaction, which grows in size relative to the number of
receipts in the election epoch. Note that the rewards transaction appears at consensus epoch change
(ideally every 30 blocks) and gathers PoC details about the past
poc_receipts_txns which occurred
within the epoch.
In order to validate witnesses at
rewards_txn time, we walk the chain backwards to find the
poc_request transaction for each corresponding
poc_receipt transaction to
cross-check the reported witnesses and receipts. This is a tremendously costly operation which we
did not consider during development as even a developer's slow laptop is much faster than the
deployed Generation 1 Hotspots.
Activation of PoC v9, Wed Sep 16 07:09:40 PM UTC 2020
- PoC v9 was activated in block 502316
PoC v9: Blockchain Halts, Wed Sep 16 07:09:40 PM UTC 2020 - Wed Sep 16 12:09:40 PM UTC 2020
- At block
502342, the blockchain came to a halt due to disagreement over the block that was produced by the consensus round.
- We were monitoring the situation closely and ensured that we did not have any latent ledger drift, which could cause ongoing problems in terms of consensus agreement.
- The team narrowed the root cause to two potential issues:
- A subsequent release with the above two fixes was GA-ed and deployed immediately
Continued Troubles: Extremely long election blocks, Wed Sep 16 12:09:40 PM UTC 2020
- Despite the emergency release, the team noticed that election blocks were taking close to 15 minutes to get accepted. This is unacceptable behavior and was causing eventual network gossip collapse.
- In order to maintain ~60s block times, the team agreed to downgrade PoC v9 back to v8 while a more optimized fix for PoC v9 was being discussed internally.
Downgrading back to PoC v8, Thu Sep 17 02:34:45 AM UTC 2020
- After internal discussion and ensuring that downgrading PoC back to v8 was safe and backwards compatible, the team issued another chain variable transaction to revert back to PoC v8. Details
- This was sufficient to get back to normal block and election times.
Addressing stuck hotspots, Thu Sep 17 06:15:00 PM UTC 2020
- We let the blockchain recover overnight Sep 16th, 2020 after reverting back to PoC v8, however,
early morning Sep 17th, the team noticed that we had ~1000 hotspots stuck at block
- In order to address that, we started working on an immediate GA release
2020.09.17.0with a new blessed snapshot targeted at block height 503281.
- A relatively simple fix is to avoid walking the chain backwards altogether by attaching the
poc_request_txnblock hash to the
poc_receipts_txn, this would in theory be a constant time lookup at transaction validation time. Please refer the following PRs for progress: helium/proto, helium/blockchain-core, helium/miner.
- Due to the nature of blockchains, protocol buffers, and immutable ledgers, we cannot reactivate PoC v9 with different and less problematic code. Instead, we will skip over PoC v9 altogther and jump straight to PoC v10.
- Finally, we plan on accelerating setting up a testnet to catch such issues before they make it to production. We look forward to engaging with the community to participate in the development of this testnet.
- Once the team has confirmed that the fixes indeed work as intended, we shall deploy a beta release and apprise the community of eventual GA release.
- Current timeline is to activate PoC v10 on Monday, Sep 21st, 2020.