iOFFICE - Email notifications not processing and intermittent Hummingbird & map access

Incident Report for Eptura Workplace

Postmortem

iOFFICE Root Cause Analysis (RCA)

Severity 1 Event July 22, 2021

‌

Note: All times listed in Central Daylight Time.

‌

Background

On Thursday, July 22, 2021, iOFFICE experienced a critical Azure Virtual Machine failure that exposed some gaps in the failover functionality of the application hosting framework. The result was that iOFFICE customers were unable to log into the desktop platform and the Hummingbird mobile application. iOFFICE Customer Support and Engineering identified the issue at 3:30 a.m. and after multiple client reports and internal verification logged it as a Severity 1 (S1) event at 5:18 a.m.

‌

Investigation & Cause

On July 22, the iOFFICE Engineering team received an automated notification from the PagerDuty application monitoring tool indicating a performance issue. At 3:30 a.m. Support received the first customer notification of difficulty logging in and accessing the iOFFICE platform. Support staff on duty were unable to recreate the access issue, which resulted in an improper assessment of the initial incident. As more customers reported similar login and access issues, Support again reviewed and were able to reproduce the issue consistently at 5 a.m. At 5:18 a.m., Engineering pulled the fire alarm to enter S1 incident protocol.

‌

At 5:47 a.m., DevOps identified a master node in the high-availability cluster that hosts iOFFICE’s services had become unhealthy. This appeared to be at a hardware or cloud platform managed operating system (OS) level. As of the date of this RCA, iOFFICE is still working with Microsoft to identify the possible source of the problem and to confirm if it was hardware-related or OS-related. At 6:03 a.m., iOFFICE updated the public system status page to reflect the S1 incident and began sharing hourly updates.

‌

Starting at 6:13 a.m., Engineering also identified a low storage issue, began restarting services, and assessed the impact of these two fixes on customers. At 11:21 a.m., DevOps removed the unhealthy node from the cluster and at 11:27 a.m. reported that systems were beginning to come back online. Initial feedback from customers was that sites were experiencing slow performance. To address that, DevOps started more instances of iOFFICEConnect and moved the incident to monitoring at 11:49 a.m.

‌

At 12:23 p.m. as Support followed up, customers reported that emails and other services were still not working and that they were experiencing slow platform performance. These reports triggered iOFFICE to move the incident back into the

investigation phase to assess the email notification problem. Engineering was aware from the initial investigation that the current configuration of the container orchestration servicing tool was not optimal and were still working to change configurations that were identified as problematic. Once the issue reentered investigation, Engineering identified other areas of the configuration requiring remediation to ensure that a change in the master node does not prevent the cluster from handling a server reboot or a total failure. These configuration issues are the root cause that prevented automatic recovery within the clustered environment of iOFFICE.

‌

At 4:56 p.m., Engineering made configuration changes to redirect the docker containers away from the faulty master node that was removed from the cluster. Engineering moved to the incident back to the monitoring phase. Between 7:28-9:20 p.m., customer reports indicated that email notifications were partially resolved for some modules and that the performance issues were now centered around rendering maps. Support tested the map issue separately and by 9:22 p.m., iOFFICE determined that the maps issue was unrelated to the outage and was the result of a decreased capacity in the maps service licensing which was addressed immediately. At 10 p.m., Tier 2 Support were added to the night shift to monitor incoming reports and engage with Engineering in the ongoing investigation. At 1 a.m. on Friday, July 23, the incident was moved to an S2 level with systems up and running and the reported issues limited to email notifications not going out. As per the S2 protocol, the team continued to post status page updates every four hours.

‌

Through the night, Engineering continued to investigate other remedies to address the email notification issues. At 6:58 a.m., Support reported that some users continue to experience errors loading maps and were able to replicate the issue internally. At 7:59 a.m. the incident was re-escalated to S1 status as site stability was deemed intermittent with email notifications and map loading as the two main issues. At 9:56 a.m., DevOps identified an issue with the public DNS resolving to the removed master node. Engineering proceeded to restore the master node from a backup. At 10 a.m., Engineering declared fixes complete and Support moved to monitor, test, and validate with customers. At 11 a.m., it appeared that all services were back online with customers reporting stability and better performance. After an additional hour of monitoring was deemed necessary due to the outage lengthy timeline and out of precaution. At 12 p.m. on July 23, Support declared the S1 incident resolved.

‌

Remediation & Preventative Action

Early remediation included Engineering and DevOps removing the unhealthy node from the cluster and failing over to the new master node. While troubleshooting the failure, iOFFICE discovered that the public DNS was resolving to the removed node and configuration changes were made to services that were resolving the removed master node. The configurations were only slightly successful and resulted in email notifications not coming fully back online. At that time, iOFFICE identified that a restore of the master node from a backup was required.

‌

We identified two main reasons for the timeline of this event. An Azure-hosted server in our high-availability cluster suffered a hardware or cloud platform managed OS-level failure. That failure, combined with the identified problematic configuration of the container orchestration servicing tool, resulted in the inability to automatically recover. To ensure this is prevented in the future, Engineering has scheduled a maintenance window on Saturday, August 7 to add a new sever into the cluster with a restored image of the one that is now questionable. Currently, the maintenance window is set to start at 9 a.m. on August 7. Please refer to iOFFICE’s Status page for the latest scheduling and more details about the impact of this window on users.

‌

As an additional precaution during this event, Senior Tier 2 Support was brought in to monitor overnight on July 22 and July 23. iOFFICE is committed to engage the incident response teams in extended monitoring until the re-configuration on August 7 is complete. This means that iOFFICE will actively have additional resources in place for 24/7 monitoring and issue detection, in addition to the automated notification systems in both Engineering and Support and regular coverage per our SLA terms. This will be in place until August 8, 2021.

Posted Jul 30, 2021 - 23:29 UTC

Resolved

After an extended monitoring and remediation period iOFFICE engineering and incident response teams are reporting that system health has returned and stabilized. We want to take a moment and thank our engineers who worked tirelessly to bring our platform back online. We also want to express our gratitude to you, our customers, who have worked with our support team and provided continuous feedback on remediations. We sincerely apologize and recognize the effects that this had on your companies. An official RCA will follow to empower iOFFICE champions in explaining the extended outage for their internal stakeholders. For any further issues or questions please do not hesitate to reach out to Customer Support.

Posted Jul 23, 2021 - 18:07 UTC

Update

We are pleased to report that after our fixes and testing we are seeing our systems return to normal, but as a precaution our incident response team will remain actively engaged in monitoring all services health. Again, we appreciate your patience during this time and are truly sorry for all inconveniences experienced. If you're being impacted of an issue not mentioned please do not hesitate to reach out to Customer Support right away. Our next update will be at 1:00 pm CST

Posted Jul 23, 2021 - 16:56 UTC

Update

We are happy to report that the fix for the Email Notification issue and the intermittent performance in loading floorplans was a success. Also, the ability to login into our Hummingbird App is now back to operational. We empathize and understand the impact on your businesses and are truly sorry for all inconveniences experienced. We will continue monitor these issues very closely. If you're being impacted of an issue not mentioned please do not hesitate to reach out to Customer Support right away. Our next update will be at 12:00 pm CST

Posted Jul 23, 2021 - 16:00 UTC

Update

We have identified and are beginning to implement the fix for the Email Notification issue affecting reservations and password resets and the intermittent performance in loading floorplans. The ability to login into our Hummingbird App is still intermittent affecting a subset of users. We continue to express our full understanding of the impact on your business as we continue mentor these issues very closely.
If you're being impacted of an issue not mentioned please do not hesitate to reach out to Customer Support right away. Our next update will be at 11:00 am CST.

Posted Jul 23, 2021 - 15:08 UTC

Update

We are investigating the Email Notification issue affecting reservations and password resets. In addition, a subset of users may still be experiencing intermittent performance in loading floorplans and the ability to login to our Hummingbird App.. We continue to express our full understanding of the impact on your business, and we continue to deploy all available resources to resolving these issues.
If you're being impacted of an issue not mentioned please do not hesitate to reach out to Customer Support right away. Our next update will be at 10:00 am CST.

Posted Jul 23, 2021 - 14:01 UTC

Update

We are still investigating the Email Notification issue and monitoring the situation. However, a subset of users may still be experiencing intermittent performance in loading maps. We are aware of this issue and are investigating it. Our next update will be at 9:00 am CST.

Posted Jul 23, 2021 - 12:59 UTC

Update

We are still investigating the Email Notification issue and monitoring the situation. Due to the length of time and nature of the issue another update will be posted 9:00 AM CST to allow our team time to investigate.

Posted Jul 23, 2021 - 10:02 UTC

Update

We are still investigating the Email Notification issue and monitoring the situation. Due to the length of time and nature of the issue another update will be posted 5:00 AM CST to allow our team time to investigate.

Posted Jul 23, 2021 - 06:00 UTC

Update

We are still investigating the Email Notification issue and monitoring for further impact. We will post a another update at 2:30 CST.

Posted Jul 23, 2021 - 05:30 UTC

Update

We are still investigating the Email Notification issue and monitoring for further impact. We will post another update at 12:30 CST.

Posted Jul 23, 2021 - 04:30 UTC

Update

We’ve resolved the issue with our map rendering service, we are continuing to investigate the e-mail notification issue. We will update at 11:30 CST

Posted Jul 23, 2021 - 03:07 UTC

Update

We’ve identified a lagging issue with our map rendering service as well as e-mail notifications. We continue to monitor and evaluate our modules for these types of interruptions and test and prepare appropriate fixes. We will continuously update you on any progress. if you're being impacted of an issue not mentioned please do not hesitate to reach out to Customer Support right away. Next update will be 10:30 CST

Posted Jul 23, 2021 - 02:20 UTC

Update

We are continuing to monitor for any extended issues. If you are experiencing issues, we encourage you to reach out to the support team immediately.

Posted Jul 23, 2021 - 01:14 UTC

Monitoring

Our engineers have completed all the fixes and services restarts. We are now moving the incident to monitoring and our support team will continue to be here to answer any questions. If you are still experiencing issues, we encourage you to reach out to the support team immediately.

Posted Jul 23, 2021 - 00:16 UTC

Update

Our preliminary investigation is pointing us to an Azure-hosted server failure that occurred this morning, followed by DNS-related issues that prevented automatic recovery within the clustered environment of iOFFICE. The application server has been restored along with the configuration of the cluster which will result in improvement of the distribution of workload across the cluster for additional future resiliency. The Engineering team is continuing to closely monitor the environment and work with Microsoft to identify the underlying cause of the failure within Azure. A formal Root Cause Analysis will be provided within 5 business days. We ask customers that were affected by this incident to continue to engage with our support team. Our next update will be at 7:30 CST.

Posted Jul 22, 2021 - 23:43 UTC

Update

Engineering has completed a series of fixes and restarted core services related to this incident. We are seeing services begin to come back online again but continue to diligently work through these challenges. We understand how this incident impacted your business, and your patience is greatly appreciated. Please be assured that our engineers and incident response teams are committed and we will continue to provide hourly updates as we monitor the situation. Next update 6:30 pm CST

Posted Jul 22, 2021 - 22:33 UTC

Update

Our engineers and incident response teams are fully engaged investigating the continued intermittent inability to login and e-mail notifications disruptions. We continue to express our full understanding of the impact on your business, and we continue to deploy all resources to resolving the issue. Next update 5:30 PM cst

Posted Jul 22, 2021 - 21:35 UTC

Update

We are investigating continued issues; Engineering is working on recovering from an application failure that also exposed some additional issues in the failover functionality of our application hosting framework. We have partially recovered but have some additional configuration changes and service restarts that will be necessary to restore services, some customers may continue to experience inability to login and e-mail notifications. We continue to express our full understanding of the impact on your business, and we continue to deploy all resources to resolving the issue. Next update 4:30 PM

Posted Jul 22, 2021 - 20:43 UTC

Identified

We are investigating a continued issue with our e-mail notifications, please be assured that our engineers and incident response teams are fully engaged and monitoring this issue closely. We will continue to provide hourly and immediate updates until we are confident service health is optimum. Our next update is at 3:00pm CST

Posted Jul 22, 2021 - 19:00 UTC

Update

We have been receiving reports of e-mail notifications not fully operational, please be assured that our engineers and incident response teams are fully engaged investigating and monitoring these reports closely. We will continue to provide hourly and immediate updates until we are confident service health is optimum. Our next update is at 2:00pm CST

Posted Jul 22, 2021 - 18:01 UTC

Monitoring

Engineering has identified and released a remedy for this incident. We are seeing services begin to come back online. We understand how this incident can impact your business, and your patience is greatly appreciated. Please be assured that our engineers and incident response teams are committed to a fully vetted Root Cause Analysis that will be provided in the coming days. We will continue to provide hourly and immediate updates until we are confident service health is optimum. Our next update is at 1:00 pm CST

Posted Jul 22, 2021 - 16:48 UTC

Update

Engineering are still investigating this incident. We understand how an incident can impact your business, and we’re treating this very seriously and with urgency. Please be assured that our engineers and incident response teams are fully engaged to restore services as quickly as possible. We will continue to provide hourly and immediate updates once we have them. Our next update is at 12:00pm CST

Posted Jul 22, 2021 - 16:00 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jul 22, 2021 - 15:01 UTC

Update

We are currently working on a fix, and will continue to work towards a resolution.

Posted Jul 22, 2021 - 15:01 UTC

Update

Engineering is continuing to work on a fix for this issue.

Posted Jul 22, 2021 - 13:57 UTC

Update

Engineering is still working on returning service back to normal.

Posted Jul 22, 2021 - 12:54 UTC

Update

Engineering has identified the issue and working on returning service.

Posted Jul 22, 2021 - 11:42 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jul 22, 2021 - 11:05 UTC

Identified

iOFFICE login services are down, affecting the web application and Hummingbird app. The issue has been identified and remediation is in progress.

Posted Jul 22, 2021 - 11:03 UTC

This incident affected: Eptura Workplace Modules (Space Module, Move Module, Reservation Module, Service Request Module, Asset Module, Inventory Module, Mail Module, Copy Module, Visitor Module, Insights Module), Apps (Hummingbird App, Service Request App, Asset Manager App, Mail App, Visitor App), and Public API.