The telecom industry is in the midst of a massive shift to new service offerings enabled by 5G and edge computing technologies. With this digital transformation, networks and network services are becoming increasingly complex: RAN, Core and Transport are only a few of the network’s many layers and integrated components. Today’s telecom engineers are expected to handle, manage, optimize, monitor and troubleshoot multi-technology and multi-vendor networks. The biggest challenge is balancing the innovation that pushes for new technologies, layers and nodes with the need to provide robust, high quality products and services 24/7, 365 days a year.
For telecoms (CSPs) and other verticals employing extremely complex systems, fully autonomous monitoring technologies are the holy grail. As monitoring and alerting platforms mature, there is a growing expectation that they will go from anomaly detection to full remediation, without a human in the loop. This is not your run of the mill industry buzz. Over the last five years, monitoring telecom networks have evolved to the extent that autonomous remediation (aka “the action phase”) is the logical next step, likely to become a dominant feature for leading CSPs. But to get there, robust machine learning capabilities are key.
Scale, accuracy, speed
Machine learning is already making a difference in the network monitoring space. In order to ensure availability and reliability and deliver more business value, CSPs need to stay on top of hundreds of metrics. But with the ongoing growth in operational complexities, effectively managing and monitoring connections, devices, radio networks, current and legacy core networks, services, and transport and IT operations is becoming a radical challenge. Static network monitoring gives rise to billions of alarms with a very high rate of false positives, since it’s based on manual thresholding for a system that is too complex and volatile to adhere to predetermined states. What is worse – static monitoring leads to late detection of service degradation and incidents. Even after detection, which often occurs after the incident has already impacted customers, there is no context to go on for expedited resolution.
Compared to manual, dashboard-based monitoring systems, ML enables unprecedented scale, accuracy and speed. It enables today’s telecom engineers to handle, manage, optimize, monitor and troubleshoot multi-technology and multi-vendor networks. Machine learning enables CSPs to move from reactive problem solving to proactive monitoring and learn more about what is happening across their networks before any minor issues escalate into bigger problems.
In the network operations context, every network generates millions of time series data, measuring all aspects of the network. Anomalies can cause service degradations and system-wide outages/incidents. Therefore, discovering these anomalies and identifying the technical root cause to fix incidents is a key objective of network operations. Autonomous anomaly detection minimizes time spent looking for issues, leaving more time to focus on resolution.
From detection to remediation
AI enables the transformation of traditional network and service operations towards automation and intelligent operations through three crucial steps that can only be achieved by applying cutting edge machine learning: anomaly detection, correlations and root cause analysis, and, finally – remediation.
Anomaly detection. In the first stage, ML enables real time monitoring of 100% of the network data from connections, devices, radio networks, current and legacy core networks, services, transport, IT operations and any other source. Leading monitoring platforms feature fully autonomous baselining that also accounts for different seasonalities and constantly and optimally adapts to change. By monitoring the full scope of data using adaptable algorithms that take seasonality, trends and other behavioral variabilities into account, anomalies are detected faster and false alarms are reduced to a minimum.
Correlations and root cause analysis. One of ML’s superpowers is its ability to correlate across billions of metrics. When such a technology is leashed on data that has been freed from its silos, it autonomously creates the correlation between different related events and glitches across multi-technology (3G/4G/5G) and multi-vendor networks. These correlations provide the full context of what is happening, enabling teams to swiftly get to the root cause of every issue for the fastest possible remediation.
Remediation. By autonomously pinpointing network anomalies and mapping the relations between them, ML-based monitoring is paving the way for autonomous remediation. These automated, closed-loop processes are referred to as ITSM or “self-driving ITOM”. Currently, they can be observed in low level tasks, such as automating “bounce the server” or an “open a ticket” type of script. This is done through automation scripts that still require a human in the loop. However, the technological roadmap is leading towards automation rule mapping and a fully automated ML remediation engine. In this scenario, the ML-based system will go through phases 1 and 2 – anomaly detection and root cause analysis – recommend an action based on previous incidents, execute the action through the remediation engine, and fine tune its operations through a closed feedback loop, increasingly improving its reactions.
Moving forward
Only these three ML-based monitoring tiers can provide CSPs with robust anomaly detection and remediation that ensures reliability, availability and a seamless customer experience. Still in its infancy, the “action” phase of monitoring is still lacking in most solutions. However, since this is the direction this domain is going in, it’s a good idea to check with respective vendors where they stand on automated actions. Since autonomous remediation is predicted to become a dominant feature for leading platforms, in the meantime it’s crucial to verify that the platform is ML-based and can effectively communicate granular data and insights to both IT stakeholders and other IT systems that can be used in the remediation phase.