Root Cause:
iOFFICE investigation has pointed to a degradation caused by server node(s) not resolving correctly to DNS.
Remediation:
iOFFICE received system alerts notifying both the Customer Support and Engineering teams of nodes becoming unstable. Once identified, they began restart sequences to bring services back online. These restarts initially failed due to a service becoming out of sequence, which required a production change of DNS configurations before engineering team could restart the services successfully.
Timeline:
iOFFICE system alerts began at 8:15am CST, our Engineering team begins investigation as Customer Support started to receive reports of site outages. Engineering at 8:49am CST conducted a full restart attempt to restore services quickly. However, not all services came back online. As Customer reports continue and internal testing has confirmed all sites are affected, at 8:50am CST iOFFICE declared Severity 1 incident. Engineering continued to investigate the cause of the service restart failure and noticed that the nodes were not resolving correctly. At 9:25am and 9:37am CST nodes were restarted and services attempted to pick up new DNS configurations. These restarts failed to bring in the required DNS configuration. Engineering investigation continued at 9:50am. At 11:50 am CST, Engineering turned focus to server component that is responsible for leader elections. At 12:28pm CST, they discovered that cache didn’t match on all servers and cleared it. After that, knowing that not all services were running, Engineering attempted to bring these back into alignment by conducting another restart around 12:45pm CST. It was decided at 1:15pm CST that a configuration would be required and needs to be released. At 1:50 pm CST, the configuration changes are tested and deployed, and iOFFICE was able to begin a monitoring phase at 2:10pm CST. iOFFICE was confidently able to declare an all clear after a lengthy monitoring phase at 4:06:pm CST
Preventative Action and Analysis:
The prolonged downtime was preceded by a lack of restart sequence documentation as well as required configuration changes needed to bring systems back online. iOFFICE engineering is conducting a review
and enhancing our system restart documentation. Continued efforts to add appropriate monitoring and alerting for our systems, as well as system evaluation of services used are also being considered.