According to a recent Capgemini research, fewer than half (48%) of consumers feel that the connectivity services that they have today adequately meet their remote needs. Still, many CSPs openly admit that they often hear about service issues via social media and sites like Downdetector. And with the fixed/mobile convergence, a negative home broadband experience now has the potential to cause churn for CSPs’ mobile customers too. Customer experience is a critical factor in building and maintaining a competitive advantage: customer experience and network infrastructure teams need to know immediately if customer experience starts to degrade and why it’s happening.
To solve this, CSPs are constantly upgrading their infrastructure: building up their core, pulling more fiber, upgrading IT infrastructure, and introducing new technologies like Fixed Wireless Access (FWA) as back-up when their fixed network is down. But this is just part of the solution. Maintaining good quality of service requires CSPs to have full visibility into what is happening across their network, monitor critical network and customer experience KPIs in real time. Only by autonomously monitoring the network at scale can CSPs achieve early visibility of potential issues before they result in service failure and escalate into brand-damaging incidents.
Monitoring fault alarms v. monitoring performance data
While CSPs already have many monitoring solutions in place, most are focused on monitoring events and alarms that are either created manually or automated — and not on performance data. That’s because historically, real-time monitoring and analysis of performance KPIs at scale has been humanly impossible because of the scale and number of metrics and KPIs. But fault management is a very noisy environment, in which it’s exceedingly difficult to prioritize which alerts to deal with first. Fault alarms, by nature, don’t necessarily point to issues impacting the customer experience. Performance monitoring, on the other hand, mirrors the customer experience, and using correlations enables NOC teams to tie adverse experiences back to their root cause. So while performance monitoring empowers early proactive resolution, fault alarms are by nature reactive: NOC/SOC teams are alerted to incidents that have already happened, and then work to understand and later solve the related problems. Aiming to improve performance and customer experience on fault alarms is futile: customers will always experience the outage or service degradation before the CSP has a chance to remediate.
There is, however, a way to get to the incidents before customers do. The advent of next-generation AI/ML platforms – that are fully autonomous and can scale to literally billions of metrics in real-time — now enables autonomous correlation of performance KPIs across the different network and business domains (Application Performance, Network performance etc.) By harnessing performance data that is constantly generated by the network and monitoring it at a granular level, CSPs can detect incidents earlier, sometimes hours and even days before a fault alarm is generated for the same incident. That is because very often, service degradation will not leave a trace on the fault alarms systems, but will be readily apparent by monitoring performance data.
A case in point: A leading CSP recently experienced emerging issues caused by DNS failures. These failures can’t be detected using fault alarms since the DNS wasn’t faulted, and most of these severe cases don’t originate with faulty equipment. However, the failure was detected early through the monitoring of performance and telemetry data, including Uplink and Downlink Traffic (TCP, HTTP, IP Charged etc) across DNS, provider getaways, applications, social media, websites, Mobile core elements, and DNS queries.
In another case, a large CSP was struggling with rerouting incidents that weren’t showing up as alerts. When rerouting calls to a different location, a fault alarm will not be generated in case outbound calls to a certain location can not be made. Customers will experience a service problem, but the NOC team will be in the dark until customer service calls start coming in. Performance data, however, will quickly expose the problem: a decrease in calls to the problematic location will immediately alert the problem. When monitoring performance data, correlating this drop with other changes in usage data will enable teams not only to learn about the problem sooner but to also quickly understand the root cause of the problem, leading to faster remediation. By relying on real-time granular performance data NOC teams are able to dramatically cut time to detection and remediation, going from reactive to proactive monitoring and resolution.
In other words, fault monitoring does not monitor customer experience or performance, but rather the technology or infrastructure itself. Therefore, it does not provide any information about how this is actually impacting the end-user service experience, which is critical for providing the service level that customers now expect. Such timely information can only be derived by monitoring performance and telemetry data, where customer impact manifests itself directly and in real-time. Since service impact is often caused not by faulty equipment, problems that are apparent in performance analytics and will not display as faults or manifest as faults later.
Proactive monitoring with Anodot
Anodot’s autonomous AI-based network monitoring platform is the Brain on top of the OSS tools and network components. Anodot ingests data from 100% of the network’s data sources in real-time and monitors granular performance and telemetry data in real-time. Anodot’s patented algorithms correlate between network layers and types across billions of data points to provide early detection of service degradation across the entire telco ecosystem. Stakeholders receive Anodot’s alerts in real-time with the relevant anomaly and event correlation for the fastest root cause detection and resolution.
CSPs use Anodot to proactively monitor service experience in their networks, prevent and mitigate outages and service degradation, save costs and drive operational efficiencies, providing better customer experience, and learning more about what is happening across their networks to provide stellar service and create a definitive competitive edge.