Eptura Workplace detailed Root Cause Analysis | July 10, 2024
S1 – Inability to Access Eptura Workplace
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
On July 10, 2024, a minor upgrade of our Kubernetes cluster in France triggered a temporary service interruption, affecting a small subset of our European customers’ access to Eptura Workplace. Our team has promptly resolved the issue, ensuring minimal inconvenience and continuous service reliability. We appreciate your understanding and patience while we continue to enhance our systems to serve you better. \
Type of Event:
Outage
Services/Modules impacted:
Production Environments
Timeline: (Times are reported MST)
2:44am: Customers have reported the inability to access the system and internal team members confirm the issue and creates an alert that notifies all customers of the disruption via Status Page.
4:37am: Status Page has been moved to an identified phase. The team begins working on a resolution for customers based out of Europe.
7:11am: The fix has been implanted and the status page is moved into a monitoring phase for the next 2 hours.
9:21am: As customers confirm resolution and no further reports have been made, the status page has been moved to the resolved phase.
Total Duration of Event:
6 Hours 37 Minutes
Root Cause:
The issue stemmed from certain outdated elements in the deployment scripts used for managing the cluster’s infrastructure. Specifically, the older template files were unable to effectively restore the infrastructure following the upgrade.
Remediation:
Our engineering team has successfully updated and rigorously tested the template files, ensuring that the infrastructure now performs optimally. This update has effectively resolved the issue. Additionally, we have refined our processes to schedule future upgrades, even minor ones, during off-business hours. This strategic timing aims to minimize any potential impact on our customers, enhancing overall service reliability.
Preventative Action:
To prevent this issue from occurring again, we have taken the following steps:
We sincerely apologize for any inconvenience caused and appreciate your understanding as we continue to improve our systems and processes to better serve you.