Aerial Direct Reason for Outage Report
Outage -
Keevio Desktop Major Service Outage IP Cortex – Supplier to Aerial Direct of Keevio
Date -
24 th to 29 th May 2024
Author -
Kim-eleas Smith, Head of Service Delivery, Aerial Direct
Date of Report -
3 rd June 2024
RFO Outline & Impact
The IP Cortex Keevio Desktop App stopped playing audio on inbound and outbound calls from around 0845 on 24 th May 2024. 1018 Customers may have been affected.
Aerial Direct Ticket Reference T20240524.0096
Timeline from Friday 24 th May 2024
Multiple reports of no audio on inbound and outbound calls using Keevio Desktop. Major Service Outage ticket raised by Aerial. IP Cortex engaged to look at the issue. Aerial assesses the impact to their Keevio Desktop users. Aerial Engineers are also assisting IP Cortex Engineers both start working together to establish the root cause and to find a fix.
0850
0900
Major Service Outage update 1 sent to 1018 affected Customers.
1100
Major Service Outage update 2 sent to 1018 affected Customers.
1215
Troubleshooting continues. Engineers on both sides continue to try and find the root cause of the issue.
Major Service Outage update 3 sent to 1018 affected Customers.
1500
Aerial is advised that other Channel Partners of IP Cortex are experiencing issues with Keevio Desktop also. This gives confidence that Aerial kit is not at fault. The Aerial Direct Engineers suggest the STUN server is at fault. Investigations begin. Aerial asks IPC to check their STUN server details and functions are sound.
1500
Troubleshooting continues. Engineers on both sides continue to try and find the root cause of the issue.
Keevio Mobile App is offered as a work around. This is offered on an FOC basis.
Major Service Outage update 4 sent to 1018 affected Customers. This update advises that we are continuing to work on the issue over the weekend.
1730
Aerial and IPC Engineers continue work on the issue.
1730 - 2100
Timeline from Saturday 25 th May 2024
Aerial and IPC Engineers continue work on the suggestion that STUN is at fault.
0900
Major Service Outage update 5 sent to 1018 affected Customers.
1020
Aerial and IPC Engineers continue work on the suggestion that STUN is at fault, taking a deep dive on packet captures. Tried spinning up a brand-new PBX, using a new license, to rule out license issues and confirmed the same behaviour.
1020 – 19:30
Timeline from Sunday 26 th May 2024
Aerial and IPC Engineers continue work on the suggestion that STUN is at fault. Discovered that an Aerial engineer had working speech from his home IP address, but another engineer did not. Testing was seemingly inconsistent based on this, although no Aerial engineers had working speech from Aerial’s office. Thoroughly documented and compared working calls to non-working calls but still unable to isolate the root cause.
0900 - 1830
Major Service Outage update 6 sent to 1018 affected Customers.
1830
Timeline from Monday 27 th May 2024
Aerial and IPC Engineers continue work on the suggestion that STUN is at fault. Attempted spinning up a media proxy inside Aerial’s DC but couldn’t get this working. Tried using a 3 rd party STUN server but no change. Fully proven that all traffic between the client and the PBX was reaching the correct IPs but that traffic from PBX to client wasn’t passing through
0900 - 1830
the NAT gateway and that the PBX was rejecting requests from the client.
Major Service Outage update 7 sent to 1018 affected Customers.
1500
Timeline from Tuesday 28 th May 2024
Major Service Outage update 8 sent to 1018 affected Customers.
0920
Aerial and IPC engineers continue troubleshooting using Aerial ’ s test PBX.
Timeline from Wednesday 29 th May 2024
Major Service Outage update 9 sent to 1018 affected Customers.
0900
Aerial and IPC engineers continued troubleshooting. Retraced our steps for some of the testing that had been carried out with the test PBX and was able to get working speech by disabling Secure TURN. This suggested a certificate issue with the IPC ICE server. IPCortex tracked down the certificate in question and replaced with a valid cert. Discovered a bug when testing for failed certs using the Windows port of Nmap , where it doesn’t always return cert info. Tested with the Linux version of Nmap, which successfully returned cert info. Testing confirmed that this resolved the issue.
0900 - 1400
Major Service Outage Resolved update 10 sent to 1018 affected Customers.
1410
Root Cause and Prevention
Root Cause from IP Cortex 1. The one-way audio issue on Keevio Desktop clients at Aerial Direct was caused by expired SSL certificates on the IP Cortex STUN servers. 2. The notification of the certificate pending expiry was sent to the previous engineering manager, but never actioned upon. 3. Delayed Identification: The troubleshooting process took 6 days to identify the expired SSL certificates as the root cause, indicating potential gaps in the troubleshooting procedures or tools used. Prevention Methods from IPC 1. Set up comprehensive monitoring and alerting systems for all critical services, including STUN servers. 2. Implementing robust SSL certificate management and monitoring, along with improved troubleshooting protocols, will help prevent similar incidents in the future. 3. Develop and document standardised troubleshooting protocols that include checking SSL certificates as a potential issue. 4. Conduct regular training sessions for the support team to stay updated on best practices and new tools for efficient troubleshooting. Aerial Direct Conclusion Aerial supports the above implementations of measures that will mean a repeat of this issue will not happen. Our Engineers assisted to find the root cause and feel that it could have been found sooner should the above methods had have been applied at IP Cortex.
Page 1 Page 2 Page 3 Page 4 Page 5Made with FlippingBook - Online magazine maker