Eptura Workplace detailed Root Cause Analysis | 09/10/2024
S2 - Reservations - floor plan and calendar views not loading existing data and certain workflows are failing
We are truly grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Description:
On September 10th, our Engineering team identified a Severity 2 issue with the Reservations module, affecting data loading and workflows. This was tied to our September 5th release of recurring reservations. Our dedicated team rolled back the release, fixed the coding bug, and redeployed successfully. Additionally, our Operations team increased Redis memory to handle higher demands.
Type of Event:
Functionality Issue
Services/Modules impacted:
Production/ Reservation Modules
Timeline:
09/10/2024 (Reported MDT)
- 12:30 PM: After multiple reports of the inability to check in to reservations and floor plan views not displaying correctly in space availability and hummingbird application, the engineering and product team raised an S2 incident, and all customers were made aware that we are investigating the issue via status page.
- 1:44 PM: The engineering team has identified the root cause of the disruption is caused by the release that was deployed on September 5, 2024. All customers were notified of the news via the status page and that we are closely monitoring the situation.
- 11:11 PM: The rollback of the release is in progress.
09/11/2024 (Reported MDT)
- 3:21 AM: The status page is updated from identified to monitoring and customers are made aware of the roll back that was completed. Monitoring will continue throughout the day.
- 4:05 AM: During our detailed testing of the application, we have noticed that floor plan and calendar views are now working fine, however we are still observing errors while checking into reservation. The team continues to investigate this issue.
- 11:06 AM: Communication was sent to customers about the rollback and what was resolved and what we are still working on: Following the update on September 5, we encountered some performance issues with the Reservations feature and the hummingbird app. To ensure service continuity, we have reverted to the previous software version. This action has successfully resolved errors related to Reservation floor plans and connectivity issues with the hummingbird app. Our team is committed to fully resolving the check-in errors and restoring optimal functionality to the Reservations feature. We are making steady progress and will continue to keep you informed with the latest updates as we enhance the system.
- 3:22 PM: Customers were informed that the check-in errors previously identified have been successfully resolved without the need for a code release. Monitoring will continue into the next day to ensure stability.
09/12/2024 (Reported MDT)
- 12:01 PM: Customers have been informed that check-in errors previously identified have been largely resolved. Most functionality has been restored. However, we are aware of some residual issues with the Hummingbird Application that may still be affecting a few users.
- 4:23 PM: Monitoring continues for some users experiencing issues accessing the Hummingbird application. Monitoring continues through to 9/16/2024.
9/16/2024 (Reported MDT)
- 9:04 AM: The status page is updated from Monitoring back to Investigating. Customers were informed that our engineering team have diligently work through the weekend and continue to investigate the intermittent accessibility issue for the Hummingbird Application.
- 1:14 PM: The status page was updated from Investigating to Identified and customers were informed that the engineering team is working on a resolution for the intermittent accessibility issue to the Hummingbird application. The engineering team continues to work on a resolution through 9/18.
9/18/2024 (Reported MDT)
- 8:20 AM: Engineering team has developed a fix and is going through QA testing. The team plans to release Hot Fix on this issue for 9/23/2024.
9/19/2024 (Reported MDT)
- 10:34 AM: Status page message regarding the hot fix was edited from 9/23/2024 to 9/24/2024.
9/25/2024: (Reported MDT)
- 9:02 AM: Customers were informed of the following as the hotfix was not deployed: To ensure we deliver the best experience, our product and QA teams are thoroughly testing the hotfix initially planned for September 24, 2024. They've identified an issue that requires additional attention, so we're taking extra time to make sure everything is perfect. Our teams are working diligently to resolve this and will share an update as soon as possible. We appreciate your understanding and patience.
9/27/2024 (Reported MDT)
- 11:32 AM: The status page remains in an identified phase. Customers are informed of the hot fix that will be deployed on Thursday, 10/3/2024 at 10 PM CDT.
10/04/2024 (Reported MDT)
- 2:30 PM: The status page is updated from identified to monitoring. Since the hotfix was deployed, customers begin to confirm the resolution and monitoring will continue.
- 4:36 PM: The status page has been marked as resolved, and customers were made aware of an issue that was discovered during the hotfix. The following was communicated with customers that explains the impact of this discovered issue and what can be expected and when we anticipate a resolution.
Total Duration of Event:
26 Days 4 Hours 6 Minutes
Root Cause:
An issue in Eptura Workplace led to production errors, originating from a release deployed on September 5, 2024. The investigation revealed three primary root causes: a coding error where a variable was declared both locally and globally, a QA environment that did not accurately replicate production data and systems, and a Redis instance that encountered an "out of memory" error due to insufficient capacity and lack of monitoring. These findings provide valuable insights for enhancing our processes and ensuring a more robust system moving forward.
Remediation:
To swiftly address the recent incident, our Engineering team rolled back the affected release, identified and fixed the coding bug, and successfully tested and deployed the fixed release to production. Additionally, our Operations team increased the memory allocation for Redis to handle higher capacity demands.
Prevention:
To prevent future incidents, our Engineering team will address technical debt by reviewing and refactoring similar coding issues. Our Quality Assurance team will allocate more time for comprehensive testing, enhance test cases to better simulate production environments, and collaborate with Operations to improve the QA environment. Additionally, our Operations team will implement enhanced monitoring and alerting for Redis memory capacity to proactively address potential issues. These proactive measures will ensure a more reliable experience for our users.