Communications System (811) Outage – Incident Report

Incident Number: 20202104 

Incident Type: NETWORK: Saturation of CenturyLink IP backbone slowed VPN connections causing phone & internet failure.

Description of Incident:
  • At 8:15 AM MST on April 21st, 2020, the Colorado 811 IT Department notified Administrative staff of intermittent outages related to our FortiClient VPN. At that time, contact center agents were unable to take phone calls to 811 and experienced one-way audio. Communications Administrator posted outage notifications on social media shortly after. 
  • At 8:18 AM MST, Systems Administrator created ticket 18572578 on the service provider portal. Trouble ticket stated that Colorado 811 employees were unable to stay connected to their VPN dialer and subsequently unable to take VoIP phone calls. 
  • At 8:27 AM MST, Systems Administrator escalated ticket 18572578 from TCAM to Level 1 and called the service provider management support line for further assistance. 
  • At 9:11 AM MST, the CO811 IT Department requested an update from service provider further stating our employees were unable to connect to their VPN and unable to take VoIP phone calls. 
  • At 9:33AM MST the IT Department escalated ticket 18572578 from Level 1 to Level 2. Service provider noted contact for Level 3 escalation as Kirsten. IT staff continued to provide VPN logs and other information to service provider support via customer portal. 
  • At 9:52 AM MST, Director of Operations gave instruction to post service outage message on the Colorado 811 phone system. IT staff worked with Director of Member Relations to record an outage message specific to the problem. 
  • At 10:32 AM MST, service provider support responded to ticket 18572578 stating they had tested our Managed Services with CO811 Systems Administrator and that they did observe latency on service provider managed router. 
  • At 10:45 AM MST, Communications administrator updated the Colorado 811 website with a pop-up directing excavators to our Online Services to process locate requests. 
  • At 12:50 PM MST, the following update was posted to ticket 18572578 by the service provider: 
    • 04/21/2020 18:16:49 GMT – IP NOC reports that at the first fiber cut location, efforts to pump out the secondary manhole remain ongoing at this time. Once the manhole is fully accessible, teams will assess the amount of fiber slack. At the secondary fiber cut location approximately 1200 feet of slack in the fiber span was located and has been pulled into a nearby hand hold. Preparations to the fiber span are underway at this time. 
    • 04/21/2020 17:59:17 GMT – Field teams have pulled slack to a hand hold approximately 1200 feet away. The team is proofing conduits in the event new cables will need to be ran.
    • 04/21/2020 16:58:13 GMT – IP NOC reports teams at the first cut site continue working to pull slack which is being slowed by having to pump out a manhole. Prep work is in progress on one side of the cable. Teams at the second cut site have identified the damage caused by construction excavation. A repair team is on site assessing the situation to begin repairs.
    • 04/21/2020 16:10:55 GMT – IP NOC reports that services are impacted by multiple fiber cuts that are impacting the CenturyLink network. Field Operations and all necessary repair personnel have arrived to the first fiber cut location and are currently working to pull in the slack on the fiber span. Approximately 3,000 feet of new fiber will be required to pulled in.
    • 04/21/2020 16:07:33 GMT – IP NOC reports that services are impacted by multiple fiber cuts that are impacting the CenturyLink network. Field Operations and all necessary repair personnel have arrived to the first fiber cut location and are currently working to pull in the slack on the fiber span. Approximately 3,000 feet of fiber will be required to pulled in.
    • 04/21/2020 15:34:02 GMT – On April 21 at 13:55 GMT, CenturyLink identified a service impact in multiple markets impacting IP services. As this network fault is impacting multiple customers, the event has increased visibility with CenturyLink leadership. As such, client trouble ticket associated to this fault have been automatically escalated to higher priority.

The NOC is engaged and investigating in order to isolate the cause. Please be advised that updates for this event will be relayed at a minimum of hourly unless otherwise noted. The information conveyed hereafter is associated to live troubleshooting efforts and details may evolve as the discovery process evolves through to service resolution, ticket closure, or post incident review. 

Next update by: 2020-04-21 19:20 GMT  – At 1:09 PM MST, service provider provided an update stating they were currently pulling new fiber to the damaged area to address the issue. 

  • At 2:30 PM MST, service provider stated that splicing was in progress at the second cut location and that some alarms were clearing. 
  • At 3:50 PM MST, Colorado 811 IT staff successfully routed calls to a back-up telephone number with the help of a second service provider. A recording at this number directed excavators to Colorado 811 online services. 
  • At 3:48 PM MST, service provider updated ticket 18572578 to reflect ongoing work on both fiber cut locations 
  • At 5:37 PM MST, service provider stated that splicing of a 96-count cable in Milwaukee, WI had been completed, and that a permanent fix of that line with a corresponding planned-outage would come at a later date. 
  • At 7:45 PM MST, a member of IT staff manually re-booted a router associated with service provider’s managed services at the instruction of service provider support. 
  • At 9:28 PM MST, Colorado 811 circuits continued to display a high amount of latency over the service provider’s managed services. A sub-ticket, 18572578-2, was opened to address the bandwidth latency and assigned the Managed Services team. 
  • At 3:45 AM MST, on 4/22/2020, the Colorado 811 IT Department determined that services had been restored to a level that could support VoIP phone traffic. Director of Information Technology gave the directive to switch phone services back to our primary service provider. 
  • At 4:00 AM MST, on 4/22/2020, phone system services were restored to our primary service provider and outage recordings were removed from the Colorado 811 phone system 
  • At 6:10 AM MST, on 04/22/2020, Communications Administrator removed website outage pop-ups and updated social media to reflect the restoration of services. 
  • At 11:08 AM MST, 04/23/2020, service provider sent Colorado 811 a Reason for Outage (RFO) document detailing the cause and summary of the outage that impacted managed services. 
Operational Impact 

Internal Impact: 719 calls entered the system during the outage with 6 of those calls answered. The longest call held for 3:51:44 and there was an average hold time of 1:38:02. A total of 4,195 tickets were processed, -15% below forecast. Outbound calls were made on cell phones to all emergency queue calls which held. 

External Impact: Excavators were able to process tickets through all online platforms. 84% (3,566) of tickets processed were through an online platform. 

Response – Cause and Resolution 

Incident Resolution Date: 04/22/2020 

Root Cause: Two concurrent fiber cuts in Fort Worth, TX and Milwaukee, WI caused extreme saturation to the CenturyLink IP Backbone, which caused clients to experience latency. 

Related Posts