
Reduce mean time to repair with AI-powered root cause analysis
As new service touchpoints are added to today’s networks, complexity continues to grow, particularly with the blending of next-generation and legacy network architectures. As a result, network operations teams are seeing a growing number of “alarm storms” generated by all these systems and services. These alarm storms not only extend the time needed to evaluate issues but also make it more challenging to balance the resources used to investigate issues and manage networks.
To maintain quality of experience (QoE) in the face of such dynamic and challenging situations, next-generation networks must be able to anticipate changing conditions and adapt immediately.
Enter the AI era
Recent advancements in artificial intelligence (AI) and machine learning (ML) enable a new, more agile approach to managing and optimizing networks, taming network complexity, and helping to maintain service assurance. With the ability to leverage both conventional AI and generative AI, network operations teams can increase reliance on intelligent automation, streamlining network management and troubleshooting.
In fact, network operators need to strike a balance between optimizing network operations for improved reliability and performance at a time when the competitive landscape is excessively challenging, while simultaneously improving efficiencies and reducing costs. Some of these costs include those related to service level agreement (SLA) fines, OpEx spent on investigating issues, customer churn, and more indirectly, reputational damage. As such, AI-driven, intelligent automation processes will form the basis of next-generation network operations, minimizing manual and repetitive operations, predicting network performance, and speeding time to resolution.
A perfect storm
The task of root cause analysis (RCA) troubleshooting can be difficult when trying to identify the source of one or two events. The amount of network data grows proportionally as networks densify and services multiply, with data sets reaching a petabyte or more in size. Massive alarm storms build up until the sheer scale becomes overwhelming with multiple monitoring points generating thousands of alarms for events of varying type and severity occurring across the network.
Network performance and profitability could be drastically impacted when network operations teams are experiencing a rapid escalation in number of alarm storms. This not only extends the time needed to investigate issues when thousands of alarms or more are generated within a short period of time but also makes it more difficult to balance necessary resources for troubleshooting and network management.
In addition, as the number of incidents rises and the average time to resolution increases, alarm storms can incur millions of dollars in direct costs, due to SLA fines and OpEx spent on investigating issues. These costs are further compounded by the high cost of customer churn resulting from poor customer satisfaction, as well as profits lost due to brand reputation damage from unreliable service.
As a result of these growing alarm storms, tracing the root cause of an incident can now take hundreds of staff hours. Indeed, it is no longer feasible for network operations teams to manually isolate and pinpoint issues in multi-vendor, multi-generational, and multi-domain networks. Management and RCA troubleshooting of these complex next-generation networks requires significantly greater adoption of powerful AI and ML processes across the full breadth of network operations.
Calming the storm
The relationship between the network and its network elements (NEs) is continually changing within the dynamic environment of today’s networks. However, legacy element management systems (EMS) or controllers use static rules defined by the equipment vendor or operator to categorize events. This process requires time-consuming manual intervention to update rules and event policies, which is no longer sufficient to manage NEs in multi-vendor networks. Furthermore, useful analytical data can be removed when network operations teams perform alarm filtering and correlation, leading to misdiagnosis and increasing mean time to repair (MTTR).
Event information can be quickly correlated to find and map relationships within network fault data using an AI-powered classification system based on neural network modeling as the use of network automation increases. Once a fault has been classified, operations teams can then leverage GenAI tools with relevant knowledge databases to determine issue resolution.
This type of accelerated RCA allows network engineers to identify and generate corrective configurations and procedures based on the most relevant data sources, such as past failure information and service reports. This capability is particularly valuable in disaggregated networks where the relationships among NEs and devices are highly complex.
With the combined power of neural networks and large language model (LLM) technologies, faults can be rapidly located and classified using predictive intelligence, reducing time to resolution and saving operating expenses. Furthermore, as operations teams rate the accuracy of predictions, remediation actions and configurations are then fed back into a continuous learning model to enable improved accuracy over time for greater network reliability and performance.
Real-world RCA
A key aspect of this highly accurate and fast RCA is the ability of the neural network to understand the relationship between nodes and build groups of like behavior. To that end, we tested the Fujitsu Virtuora® AX Accelerated RCA (aRCA) application with real-world network data to determine how well the app could automatically discover the relationship between behavior groups and distinguish between noise and meaningful alarm groups.
In this specific example, the Virtuora AX aRCA app evaluated 6,060 raised alarms and successfully classified them into 47 unique alarms, resulting in an alarm suppression rate greater than 99 percent and a 94 percent success rate in identifying events. Additionally, a plain language recommendation system was able to provide the correct root cause among the top five recommendations in 41 of the 47 accurately identified events for an 87 percent accuracy rate in the initial run. This accuracy rate will continue to improve over time with more data and root cause confirmation from network operations teams.
This example demonstrates that the Fujitsu Virtuora AX aRCA application was able to detect an event on the first day and identify the root cause of the disruption within minutes. In contrast, an alternative event tool did not recognize the outage until 29 hours later, because no root cause was identified from the events on the previous day, so the problem was ignored. As a result, AI-powered RCA reduced root cause analysis time from days to minutes, significantly speeding MTTR.
Welcome to tomorrow
Network operations managers face mounting pressures and challenges due to demanding SLAs, strict network, and service key performance indicators (KPIs), increasing multi-vendor network complexity, dynamic service requirements, and burgeoning data sets. Engineers can no longer afford to take a reactive approach to network conditions, faults and maintenance. Powerful automation tools are needed to help operations teams to make decisions rapidly and accurately, allowing staff to focus on more strategic tasks. Intelligent automation and AI/ML can be invaluable to predict network performance and drive operational efficiencies with complex problem resolution, helping to streamline and simplify network operations. With this ability to make vast improvements in efficiency, network operators can increase performance and reliability to prevent network outages, while reducing operating costs and maintaining QoE for an outstanding customer experience and the best possible return on investment.