Site not accessible
Incident Report for Eptura Workplace
Postmortem

Eptura Workplace detailed Root Cause Analysis | July 10, 2024  

S1 – Inability to Access Eptura Workplace 

 

We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident. 

 

Description: 

On July 10, 2024, a minor upgrade of our Kubernetes cluster in France triggered a temporary service interruption, affecting a small subset of our European customers’ access to Eptura Workplace. Our team has promptly resolved the issue, ensuring minimal inconvenience and continuous service reliability. We appreciate your understanding and patience while we continue to enhance our systems to serve you better. \ 

 

Type of Event: 

Outage 

 

Services/Modules impacted: 

Production Environments  

 

Timeline: (Times are reported MST)  

2:44am: Customers have reported the inability to access the system and internal team members confirm the issue and creates an alert that notifies all customers of the disruption via Status Page. 

4:37am: Status Page has been moved to an identified phase. The team begins working on a resolution for customers based out of Europe. 

7:11am: The fix has been implanted and the status page is moved into a monitoring phase for the next 2 hours. 

9:21am: As customers confirm resolution and no further reports have been made, the status page has been moved to the resolved phase.  

 

Total Duration of Event: 

6 Hours 37 Minutes 

 

Root Cause:  

The issue stemmed from certain outdated elements in the deployment scripts used for managing the cluster’s infrastructure. Specifically, the older template files were unable to effectively restore the infrastructure following the upgrade.  

 

Remediation: 

Our engineering team has successfully updated and rigorously tested the template files, ensuring that the infrastructure now performs optimally. This update has effectively resolved the issue. Additionally, we have refined our processes to schedule future upgrades, even minor ones, during off-business hours. This strategic timing aims to minimize any potential impact on our customers, enhancing overall service reliability.  

 

Preventative Action:  

To prevent this issue from occurring again, we have taken the following steps: 

  1. Updated and validated all template files to ensure proper functionality. 
  2. Established a new protocol where all upgrades, including minor ones, will be performed during off-business hours to avoid service disruptions. 

 

We sincerely apologize for any inconvenience caused and appreciate your understanding as we continue to improve our systems and processes to better serve you.

Posted Aug 21, 2024 - 14:22 UTC

Resolved
As we have not seen further service disruptions after the fix was implemented, we have moved to the Resolved Phase. A detailed RCA will be posted in 10 business days. Please stay subscribed to the page to receive post automatically.
Posted Jul 10, 2024 - 15:21 UTC
Monitoring
A fix has been implemented. We are moving into the Monitoring Phase for the next 2 hours.
Posted Jul 10, 2024 - 13:11 UTC
Identified
The issue with Eptura Workplace login functionality has been identified and a fix is being implemented. We will post another update at 7:30 AM CST.
Posted Jul 10, 2024 - 10:37 UTC
Investigating
We are currently investigating an issue with Eptura Workplace. We will update you when we have more information.
Posted Jul 10, 2024 - 08:44 UTC
This incident affected: System Status.