On 2023-04-18, a new Config Service release was deployed as part of the Oracle 1.0.0 release, which included a more rigorous request signature verification. This caused a protobuf mismatch with Packet Router and Helium Router.
As a result of that, there were two data transfer outages, separately occurring on 2023-04-18 and 2023-04-19.
- On 2023-04-18, the outage was triggered by the protobuf mismatch between the Packet Router and Config Service. An emergency Packet Router update 0.0.22 was deployed to reconcile the protobuf mismatch.
- On 2023-04-19, the outage was triggered by the protobuf mismatch between the Helium Router and Config Service, which caused the Helium Router hotspot location request rpc to Config Service to fail. As a result, devices cannot join since DevAddr allocation is based on hotspot location, and it created load & memory issue that leaked into other processes, which also led to degraded uplink & downlink. Emergency Oracle update and Console update 2.2.36 were deployed to reconcile the protobuf mismatch.
At its core, this incident was caused by miscommunication and oversight. In order to minimize the chance of exposing us and the Helium Network to preventable mistakes like this, the team has agreed upon the following area of improvements.
Teams cannot work in silos, especially when breaking changes can be introduced by each other. Config Service, Packet Router and LNS such as Helium Console can only form a strong triangle when they work together in harmony. Therefore, any change could be a breaking change. When in doubt, ask! Going forward, the Oracle team and Data Transfer team will perform a sanity check with each before prior to a release if it contains work that can impact each other.
Better Safety Net
While better communication certainly helps, ultimately there will be human errors. That’s why we also need to have another layer of safety net.
- First, before any of the three components (Config Service, Packet Router, and LNS) are rolled out, an end-to-end data transfer test must be carried out, verifying success to join, uplink and downlink for all device profiles supported by the LNS. The test shall be initiated by the team that tries to introduce the change, and assisted by the LNS team. There are also tests that can be automated into the Github Actions.
- Second, make sure alerts are in place. In the 2023-04-19 incident, if alerts were more exhaustive, our response time could have been faster. Specifically, we have grafana alerts for when join & uplink rate approaching 0. However, turns out grafana could categorize 0 as NONE, hence requiring a separate alarm that monitors for that state.