iOFFICE Root Cause Analysis (RCA)
Severity 1 Event July 22, 2021
Note: All times listed in Central Daylight Time.
Background
On Thursday, July 22, 2021, iOFFICE experienced a critical Azure Virtual Machine failure that exposed some gaps in the failover functionality of the application hosting framework. The result was that iOFFICE customers were unable to log into the desktop platform and the Hummingbird mobile application. iOFFICE Customer Support and Engineering identified the issue at 3:30 a.m. and after multiple client reports and internal verification logged it as a Severity 1 (S1) event at 5:18 a.m.
Investigation & Cause
On July 22, the iOFFICE Engineering team received an automated notification from the PagerDuty application monitoring tool indicating a performance issue. At 3:30 a.m. Support received the first customer notification of difficulty logging in and accessing the iOFFICE platform. Support staff on duty were unable to recreate the access issue, which resulted in an improper assessment of the initial incident. As more customers reported similar login and access issues, Support again reviewed and were able to reproduce the issue consistently at 5 a.m. At 5:18 a.m., Engineering pulled the fire alarm to enter S1 incident protocol.
At 5:47 a.m., DevOps identified a master node in the high-availability cluster that hosts iOFFICE’s services had become unhealthy. This appeared to be at a hardware or cloud platform managed operating system (OS) level. As of the date of this RCA, iOFFICE is still working with Microsoft to identify the possible source of the problem and to confirm if it was hardware-related or OS-related. At 6:03 a.m., iOFFICE updated the public system status page to reflect the S1 incident and began sharing hourly updates.
Starting at 6:13 a.m., Engineering also identified a low storage issue, began restarting services, and assessed the impact of these two fixes on customers. At 11:21 a.m., DevOps removed the unhealthy node from the cluster and at 11:27 a.m. reported that systems were beginning to come back online. Initial feedback from customers was that sites were experiencing slow performance. To address that, DevOps started more instances of iOFFICEConnect and moved the incident to monitoring at 11:49 a.m.
At 12:23 p.m. as Support followed up, customers reported that emails and other services were still not working and that they were experiencing slow platform performance. These reports triggered iOFFICE to move the incident back into the
investigation phase to assess the email notification problem. Engineering was aware from the initial investigation that the current configuration of the container orchestration servicing tool was not optimal and were still working to change configurations that were identified as problematic. Once the issue reentered investigation, Engineering identified other areas of the configuration requiring remediation to ensure that a change in the master node does not prevent the cluster from handling a server reboot or a total failure. These configuration issues are the root cause that prevented automatic recovery within the clustered environment of iOFFICE.
At 4:56 p.m., Engineering made configuration changes to redirect the docker containers away from the faulty master node that was removed from the cluster. Engineering moved to the incident back to the monitoring phase. Between 7:28-9:20 p.m., customer reports indicated that email notifications were partially resolved for some modules and that the performance issues were now centered around rendering maps. Support tested the map issue separately and by 9:22 p.m., iOFFICE determined that the maps issue was unrelated to the outage and was the result of a decreased capacity in the maps service licensing which was addressed immediately. At 10 p.m., Tier 2 Support were added to the night shift to monitor incoming reports and engage with Engineering in the ongoing investigation. At 1 a.m. on Friday, July 23, the incident was moved to an S2 level with systems up and running and the reported issues limited to email notifications not going out. As per the S2 protocol, the team continued to post status page updates every four hours.
Through the night, Engineering continued to investigate other remedies to address the email notification issues. At 6:58 a.m., Support reported that some users continue to experience errors loading maps and were able to replicate the issue internally. At 7:59 a.m. the incident was re-escalated to S1 status as site stability was deemed intermittent with email notifications and map loading as the two main issues. At 9:56 a.m., DevOps identified an issue with the public DNS resolving to the removed master node. Engineering proceeded to restore the master node from a backup. At 10 a.m., Engineering declared fixes complete and Support moved to monitor, test, and validate with customers. At 11 a.m., it appeared that all services were back online with customers reporting stability and better performance. After an additional hour of monitoring was deemed necessary due to the outage lengthy timeline and out of precaution. At 12 p.m. on July 23, Support declared the S1 incident resolved.
Remediation & Preventative Action
Early remediation included Engineering and DevOps removing the unhealthy node from the cluster and failing over to the new master node. While troubleshooting the failure, iOFFICE discovered that the public DNS was resolving to the removed node and configuration changes were made to services that were resolving the removed master node. The configurations were only slightly successful and resulted in email notifications not coming fully back online. At that time, iOFFICE identified that a restore of the master node from a backup was required.
We identified two main reasons for the timeline of this event. An Azure-hosted server in our high-availability cluster suffered a hardware or cloud platform managed OS-level failure. That failure, combined with the identified problematic configuration of the container orchestration servicing tool, resulted in the inability to automatically recover. To ensure this is prevented in the future, Engineering has scheduled a maintenance window on Saturday, August 7 to add a new sever into the cluster with a restored image of the one that is now questionable. Currently, the maintenance window is set to start at 9 a.m. on August 7. Please refer to iOFFICE’s Status page for the latest scheduling and more details about the impact of this window on users.
As an additional precaution during this event, Senior Tier 2 Support was brought in to monitor overnight on July 22 and July 23. iOFFICE is committed to engage the incident response teams in extended monitoring until the re-configuration on August 7 is complete. This means that iOFFICE will actively have additional resources in place for 24/7 monitoring and issue detection, in addition to the automated notification systems in both Engineering and Support and regular coverage per our SLA terms. This will be in place until August 8, 2021.