Production Sites Unable to Be Accessed

Incident Report for Eptura Workplace

Postmortem

iOFFICE Root Cause Analysis (RCA) – Severity 1 Event October 11th, 2021
On October 11th, 2021, iOFFICE had a server communication/configuration issue that caused site access issues on all iOFFICE desktop sites. This was identified by our server monitoring at 10:34am CST and after an attempted rolling restart failed to bring services back online was logged as a Severity 1 event.
Root Cause
Our investigation found two issues which contributed to the outage time.

A host running services critical to the application was unable to communicate with other service hosts. This prevented the overall application from functioning.
An initial unsuccessful routing configuration load occured when service restarts were performed in an attempt to reestablish normal function of the application.
Remediation
Normal function of the application was accomplished by moving all services to hosts that are able to communicate properly with the overall system, and by correcting a configuration error.
Timeline
At 10:34 AM CST our monitoring tools began registering an issue with our Visitor module; a health check was preformed. At 10:51 AM, Support and Engineering saw evidence that the system is experiencing more widespread issues. At 10:54 AM, the engineering team performed service restarts. These restarts introduced a configuration change that caused an additional problem with the gateway service responsible for all client communication, however this was not immediately identifiable. At 11:04 AM, a severity 1 incident was logged as engineering continued to investigate. At 12:08 PM, Engineering was able to replicate the issue experienced with our gateway and identified the configuration changes needed. At 2:18 PM, after review and testing has been completed, the fixed configuration is released restoring the gateway services to proper function. At 2:41 PM, Engineering discovers that the host running certain critical services to traffic routing are running on a host that cannot communicate with other hosts in the system. The team began investigating the cause. At 3:49 PM engineering moves the services from the bad host to another host and the system returns to normal function. Engineering began post incident monitoring. At 3:58 PM, Engineering informs Support that the system has returned to normal function and Support began customer checks. Internal all clear was made at 4:15 PM.
Preventative Action and Analysis:
Engineering is conducting an investigation into the reason we had a host that was unable to communicate with other hosts on the system. We have also implemented additional testing requirements for configuration changes. Engineering have been able to identify and test interim software and operating approach that will prevent the issue that we experienced on 10/11/2021.
Engineering also continues to focus a large amount of resources on migrating the system to a more stable, maintainable and observable infrastruction configuration.
A Followup Message on Recent Outages from iOFFICE + SpaceIQ Executive Team
As communicated in a previous communication from iOFFICE + SpaceIQ CEO on October 1st, we have identified components of our infrastructure that would not be able to perform reliably long term and are contributing to the length of this outage and previous ones. iOFFICE + SpaceIQ has thus far taken three actions:
• Engaged highly experienced senior Thoma Bravo technologists to help diagnose our issues and help build an aggressive recovery plan
• Added 24 additional resources to replace those components as soon as possible
• Committed to find ways to keep the current architecture running while the components are replaced.
With our team in place, we have started replacing the under performing components with industry standard software that is proven reliable. With the expanded team, this work has already started and will be completed in weeks, not months.
We regret any sort of service issues and are a few weeks away from having the current issues permanently behind us.

Posted Oct 21, 2021 - 18:08 UTC

Resolved

We're now moving to a resolved state as we have had multiple confirmations from our Support team as well as Clients. We appreciate all the patience while we worked through this issue. If you're happening to still have any intermittent issues or concerns please don't hesitate to reach out to the support team or the success team.

Posted Oct 11, 2021 - 22:48 UTC

Monitoring

The Engineering team has identified the issue and put a fix into place. We are moving to a monitoring state as we make sure these issues are no longer causing issues with accessing the site. Please reach out to the support team to confirm. We will post an update at 4:55 CST.

Posted Oct 11, 2021 - 21:00 UTC

Update

The investigation process is still continuing with our Engineering team. All resources are applied to get this resolved. Our sincerest apologies as we continue the efforts on this task. We will also update again in an hour at 3:55 CST.

Posted Oct 11, 2021 - 19:57 UTC

Update

Our Engineering team is still investigating the issues at hand. We deeply apologize as we continue all our efforts into finding the resolution and get the site back online. We will still be doing updates every hour an our next is scheduled for 2:55 CST. Any questions or concerns you might have our Support team and Customer Success team are more than happy to assist on.

Posted Oct 11, 2021 - 18:58 UTC

Update

Our Engineering team is making progress on the issues at hand and we are still in an investigation state. We will be updating again at 1:55 PM CST. Again, we do appreciate the patience as we put our continued efforts into getting the site back online.

Posted Oct 11, 2021 - 17:57 UTC

Update

The Engineering team is still working diligently to find the source of the issue. We apologize for any inconveniences this may be causing. The next update will be at 12:55 PM CST.

Posted Oct 11, 2021 - 16:55 UTC

Investigating

Hello, as of 10:55 AM CST we are noticing that production sites are unable to be accessed in their current state. Our Engineering team is actively looking into what could be causing this and we'll be posting an update at 11:55 AM CST. We greatly appreciate the patience as we investigate the issues at hand.

Posted Oct 11, 2021 - 16:04 UTC

This incident affected: Eptura Workplace Modules (Space Module, Move Module, Reservation Module, Service Request Module, Asset Module, Inventory Module, Mail Module, Copy Module, Visitor Module, Insights Module).