Organizations track KPIs and metrics from all aspects of their business, often from millions or even billions of distinct sources. Data analytics is used to make sense of all this data that is being collected, to draw conclusions of what is happening within the systems that are being measured. Correlation analysis is a key function within data analytics.
What is Correlation Analysis?
Correlation analysis is the process of discovering the relationships among data metrics by looking at patterns in the data. Finding relationships between disparate events and patterns can reveal a common thread, an underlying cause of occurrences that, on a surface level, may appear unrelated and unexplainable.
A high correlation points to a strong relationship between the two metrics, while a low correlation means that the metrics are weakly related. A positive correlation result means that both metrics increase in relation to each other, while a negative correlation means that as one metric increases, the other decreases.
Put simply, correlation analysis calculates the level of change in one variable due to the change in the other. When two metrics are highly correlated, and one of them increases for example, then you can expect the other one to also increase.
Why Is Correlation Analysis Important?
Just as you wouldn’t evaluate a person’s behavior in a vacuum, you shouldn’t analyze metric performance in isolation. How metrics influence and relate to one another is incredibly important to data analytics, and has many useful applications in business. For example:
Marketing professionals use correlation analysis to evaluate the efficiency of a campaign by monitoring and testing customers’ reactions to different marketing tactics. In this way, they can better understand and serve their customers.
Financial planners assess the correlation of an individual stock to an index such as the S&P 500 to determine if adding the stock to an investment portfolio might decrease the unsystematic risk of the portfolio.
Technical support teams can reduce alert fatigue by filtering irrelevant anomalies (based on the correlation) and grouping correlated anomalies into a single alert. Alert fatigue is a pain point many organizations face today – getting hundreds, even thousands of separate alerts from multiple systems, when many of them stem from the same incident.
For data scientists and those tasked with monitoring data, correlation analysis is incredibly valuable when used to for root cause analysis, subsequently and reducing time to detection (TTD) and time to remediation (TTR). Two unusual events or anomalies happening at the same time or /rate can help to pinpoint an underlying cause of a problem. The organization will incur a lower cost of experiencing a problem if it can be understood and fixed sooner rather than later.
How Does Correlation Analysis Relate to Business Monitoring?
Business monitoring is the process of collecting, monitoring, and analyzing data from business functions to gauge performance and to support decision making. Anomaly detection is a supplementary process for identifying when a business process is experiencing an unexpected change.
As organizations become more data-driven, they find themselves unable to scale their analytics capabilities without the help of automation. When an organization has thousands of metrics (or more), analyzing individual metrics can obscure key insights.
A faster method is to use machine learning-based correlation analysis in order to group related metrics together. In this way, when a metric becomes anomalous, all the related events and metrics that are also anomalous are grouped together in a single incident, saving teams from searching through dashboards to find these relationships themselves.
Let’s say that an eCommerce company has an unexpected drop in product sales. Using correlation analysis, the company sees the sales drop is tied to a spike in payment errors with PayPal. The fact that these two, clearly related, anomalies happened simultaneously is a good indication to start investigating the PayPal API.
Considerations and Challenges of Using Correlation Analysis
Correlation is not the same as causation. It’s possible that two events are correlated but neither one is the cause of the other. Suppose you are driving a car and the engine temperature warning light comes on and you hear a strange noise in the engine.
The anomalies are related but what is the root cause? The noise isn’t the cause and neither is the overheating. They’re just symptoms of an underlying problem that can point to the cause. A mechanic might look at those symptoms occurring together and suspect an oil leak as the cause of the problem.
As this example illustrates, even in day to day life, we resort to correlations, finding commonalities and relationships between symptoms so we can find the root cause. In business monitoring, coupling anomaly detection with automated correlation analysis can help get to the root cause of incidents—but there are challenges to implementation and training.
One challenge is that an incident and its symptoms may manifest in different areas of the business that operate in silos. One side of the business may have no visibility into what is affected elsewhere in the company. But correlating the events is critical for root cause analysis. For example, the roaming customers of a Tier 1 telco in Southeast Asia were using far less data than usual. This anomaly was correlated with an increase in DNS failures on the network.
The issue with the DNS server prevented some roaming customers from connecting to the telco’s network. The relationship between the two metrics is not an obvious one since the DNS metric is measured in a completely different area of the network. Without correlating the two, the telco’s Network Operations Center would have had a hard time understanding that the roaming incident was caused by the DNS server prolonging customers’ connection issues while traveling.
A second challenge is the ability to analyze millions and billions of metrics across the business. There is a technique called Locality Sensitive Hashing that is used to scale up correlation techniques. LSH is an algorithmic method of hashing similar input items into the same buckets with high probability. It speeds up clustering and “nearest neighbor” search techniques in machine learning. LSH is often used in image and video search and in other areas where there is a need to search across massive amounts of data.
A third challenge is to keep from correlating metrics that aren’t actually related. These are known as spurious correlations. Common techniques for correlation analysis produce a lot of spurious correlation and should be avoided for purposes of root cause investigation. For instance, suppose a gaming company has multiple games in the market. Their performance metrics may at times bear resemblance, especially as gamers tend to play at the same time. Linear correlation would discover that they are very much related, but an incident in one game is often not related to an incident in the other.
Identifying the relationships among data metrics has many practical applications in business monitoring. Correlation analysis can help reveal the root cause of a problem and vastly reduce the time to remediation. And by using these relationships to group related anomalies and events, teams will have to grapple with fewer false positives and get to addressing incidents faster.