Anodot Resources Page 27

FILTERS

Anodot Resources Page 27

Blog Post 5 min read

Integrating Time Series Correlation to Accelerate Root Cause Analysis

In any platform of sufficient complexity, multiple anomalies are likely to occur. For many organizations, NOC operators triage multiple anomalies based on their severity. There are internal, non-customer-facing issues that might affect only a small part of your workforce and one-time issues that affect only a small number of customers. Both of the issues get ticketed and sent to low-level support. Meanwhile, systemic issues that affect a large number of users and customers are handled directly by developers. Any issues that affect a large number of customers – causing dissatisfaction, preventing them from completing orders, or blocking your service entirely – should be treated urgently. We cite this figure often, but when downtime costs $5,600 per minute, seconds matter in the struggle against unplanned outages. In the rush to solve bigger and more costly outages, developers and engineers will most likely ignore the issues being faced by frontline support, and in most cases, they’re right to do so. In some instances, however, seemingly unrelated issues will be linked to broad outages and studying them will lead to a faster resolution. It’s nearly impossible for a human observer to say which of these smaller anomalies warns of a broader outage. Only an AI that monitors thousands of metrics at once can say definitively which pairs of anomalies are actually related. Correlating Anomalies: How Does it Work? To give you a sense of how valuable it is to correlate anomalies, let’s look at an example from our own archives. Anodot regularly collects about 230 million metrics every week. Out of those, we find 260,000 anomalies. After using correlation, however, only 4,700 grouped anomalies occur. This is a reduction in overall alerts, making it much easier to get to the bottom of each anomaly without being overwhelmed by issues competing for your attention.  Finding correlated anomalies requires scale, speed and accuracy. Only machine learning can correlate anomalies in time to produce useful results.  How do you identify which metrics are related, and which are not? Take a look at the metrics below. As you can see in the green graph, all except one of them look the same. This means that they’re probably related. They have identifiable patterns that people can recognize, but it’s not easy to perform similar feats with machine learning. For example, many of the methods used to compare time series involve linear correlation. This method is sensitive to seasonality, however, which means that you need to de-season data before processing. Even when seasonality has been removed, however, linear correlations can falsely group unrelated metrics if the two time series happen to be similar for a short amount of time. Pattern matching is a better solution. Using a dictionary of common patterns within your metrics, an algorithm can apply a pattern-matching solution to identify which metrics are related. You’ll likely have to build your own pattern-matching engine, but this can be done using deep learning. It’s important to note that this method only finds similarities in metrics while they’re acting normally. Metrics that experience anomalies at the same time might be related, but the best way to be sure is to find multiple instances of synchronized anomalies. We do this by transforming metrics into what’s known as a ‘binary sparse vector’, and then hashing the results. When the hashes are compared, similar anomalies stand out. By matching similar metrics and clustering similar anomalies, it becomes that much easier to understand when two anomalies are related, perform root cause analysis and mitigate the problem. Practical Applications of Anomaly Correlation What does anomaly correlation look like when it’s applied in the real world? Imagine that you’re a software developer with a premium mobile application. The premium version of the application is around $100 – much more expensive than most apps – so the purchase volume is fairly low. That’s why it takes around an hour for the anomaly detection system to see a drop in the number of completed purchases from a specific bucket of Android users. Because every purchase is so valuable to the organization, it’s important to nail down the root cause as soon as possible. Since this is a real company, we can reveal that the organization did quickly find the cause, and was able to recoup $200,000 in sales that would have otherwise been lost. The key was that their anomaly detection software noticed two things. First, at the same time the number of sales began to drop, there was a spike in 400 errors indicating bad requests. In addition, the errors began at the same time as an API update. This allowed the developers to trace the anomaly down to its root – which turned out to be a developer who had made “source” into a required field. Other companies haven’t been so lucky. During a massive sell-off in stocks that occurred on August 12, 2019, funds owned by the Vanguard Group appeared to take an accelerated nosedive, losing up to 56 percent of their value. As investors reeled, the company reassured them that the losses were a glitch caused by an error in the NYSE. Correlation would have helped triangulate the root cause faster in this instance. Although stocks do go down in the normal course of events, a faster drop than normal would have caused an alert. At the same time, the system would have noticed that information from the NYSE was no longer forthcoming. Analysts could have quickly put the dots together using these alerts and then warned their clients before they panicked about their portfolios. If you have a major issue that needs all hands on deck, and a minor issue that can be handled by the front desk, you’ll likely devote your resources to one and not the other. When anomaly detection correlates these issues, it creates the opportunity for you to restore customer satisfaction and marshal your support resources in a much more effective manner.  
Documents 1 min read

4 KPIs to Measure Cloud Efficiency

Blog Post 15 min read

Real-Time Analytics for Time Series

What is real-time analytics for time series, and what do you do with it? Let’s start with simple definitions. Time series data is largely what it sounds like – a stream of numerical data representing events that happen in sequence. One can analyze this data for any number of use cases, but here we will be focusing on two: forecasting and anomaly detection. First, you can use time series data to extrapolate the future. Given the sequence “two, four, six, eight,” you can probably predict that the next digits will be “ten, twelve, fourteen” and so on. That’s the heart of time-series forecasting – numbers come in, and the patterns that they make inform a forecast that predicts which numbers will come next. You can use time series data to detect anomalies as well. This isn’t very different from forecasting; anomaly detection forecasts the future and then looks for divergences. A simple example of this would be if you were to receive the sequence of data above. Based on what you know about the sequence, you’d expect it to go up steadily by twos. But what if you saw it start increasing by prime numbers instead? The primes would represent an anomaly. When businesses receive accurate forecasts and can reliably detect anomalies, they’re able to make more informed decisions and prevent application errors that drain their revenue. When they can do this in real time, the advantages become even sharper. Time Series Data for Business and Operational Monitoring Businesses are operating increasingly complicated customer-facing and user-facing application environments. Most of these environments aren’t written in-house – they’re purchased from vendors and partners or copied from open-source libraries. In addition, development processes emphasize continuous releases of new code. All of this fosters an environment where breakages are both common and difficult to fix. In an environment that serves millions of customers, even an outage that lasts just a few minutes can have ripple effects that tarnish the reputation of your organization. It’s best to catch outages before they happen. Fortunately, most outages provide signs that precede them – and it’s possible to catch these signs.  Companies can mitigate anomalies before they happen using real-time analytics for time series data. To do this, they need to understand three things. Normal Behavior of Time Series Data How does data look when it’s not currently undergoing an anomaly? If you understand this, then anomalies become easy to flag. At the outset, however, most data doesn’t look anything like normal – even when it’s acting normally. For example, look at the chart below. Here you can see several kinds of data patterns, and none of them are presenting an anomaly as pictured. In the normal course of events, the signal from a metric can swing wildly. Instead of understanding this chart in terms of data points, it’s easier to consider the points as shapes with patterns. Each point makes a recognizable shape – a sawtooth, a square wave, a sine wave, and so on. When considering anomalies at scale, you can assign each metric a shape that matches its normal behavior. When the metric deviates from its shape, it’s experiencing an anomaly.  We should point out that this is a high-level explanation, and that there’s more than one way to understand the normal behavior of data. In addition, both this and other methods of understanding anomalous behavior are subject to two caveats. Data can exhibit seasonality. The best example here is a store during the weekday. It’s relatively empty from 9AM to 5PM, because everyone else is at work, but it becomes crowded afterwards. The increase in shoppers doesn’t mean that the store has suddenly become wildly successful – it’s just a reflection in data of how individuals behave. Analysts need to detect and compensate for seasonality before they can understand anomalies in a time series. Data patterns can change abruptly on their own. Continuing the example above, consider the difference between weekdays and weekends. All of a sudden, shoppers can come into the store any time they want, because most of them aren’t working on Saturday and Sunday. The signal from this metric completely changes for two days out of every week. In order to fully understand anomalies within a set of time series metrics, data scientists need to understand what data looks like when it’s normal, while filtering out dramatic changes that can look anomalous but aren’t. In order to find these anomalies in real time, however, data scientists need to understand the behavior of anomalies themselves. Understanding Anomalous Behavior Let’s look back at our pattern-matching example. When we reduce time-series data to a series of patterns, it becomes easy to see anomalies where they occur. For example, it’s easy to see that the pattern below contains several spikes that are well above the normal variance for the metric that’s being recorded. Not every anomaly is significant – and some anomalies vary in significance based on the metric that is being measured. Anomalies are categorized based on their duration and their deviation. Duration refers to how long the anomaly lasts, and deviation refers to the extent to which the anomaly differs from the normal behavior of the metric. The tallest spike in the figure above has an extremely large deviation. It also has a relatively long duration based on the other two spikes. The three largest spikes in the time series are probably worth investigating no matter which metric is being measured. On the other hand, there are several smaller spikes of short duration. Are these concerning? To a certain extent, the truth rests on what you’re measuring. Any anomaly that has to do with a core metric like revenue, for example, is probably worth investigating. For other metrics – website visitors from regions where you don’t do a lot of business, for instance – you most likely need to focus only on anomalies that exceed a certain threshold. There’s an exception to this, however – a small anomaly in a less-important metric might suggest a large anomaly in a mission-critical metric. Correlating Anomalies Across Different Metrics Outages tend not to occur in a vacuum. A single outage will most likely be related to two or more different anomalies. Understanding how these anomalies occur together makes it easy to understand where the cause of the anomaly originates and how to fix it before it results in unplanned downtime, price glitches or abandoned shopping carts.  It’s possible to correlate anomalies based on both normal and abnormal behavior. Metrics that are related will likely be affected at the same time by similar anomalies. To find related metrics, you can look at their normal behavior – if their patterns are the same at the same time, then they’re probably related and an anomaly that affects one will most likely affect the other. On the other hand, metrics with different patterns can also be related. One can find these by looking for apparently unrelated metrics that experience anomalies at the same time. These anomalies are harder to parse, because in an environment with millions of metrics, it’s likely that several unrelated metrics are simultaneously experiencing anomalies. The key here is to look for repetition – a pair or group of metrics that experience anomalies at the same time several times in a row are very likely to be related. When an anomaly detection system can determine that multiple anomalies in time series data are related to the same issue, it lends itself to a property known as “conciseness.” It means that an analyst reading a report will only take a short amount of time before connecting the dots and understanding what’s about to break. For example, a series of related alerts might highlight that: There’s an abnormal amount of web traffic There’s an abnormal amount of bad requests Average latency is increasing These metrics, when viewed at a glance, reveal an ongoing DDoS attack.  Real-time anomaly detection for time series helps companies pivot before outages and other glitches can affect their customers and workers. Unplanned downtime is just one example of a crisis averted. eCommerce companies can use anomaly detection to find and fix pricing errors that can cause their customers to over (or under) pay. Marketers can combine anomaly detection with social listening to find the latest trend. Engineers can use network sensors to find anomalies in capital equipment and perform proactive maintenance.  Using a different form of analytics, however, one can make decisions not just from moment to moment, but for years into the future. Time Series Forecasting Like any company, you are generating a large amount of data every second. Each metric that you store data for is valuable, and you can use this warehouse of data to help you make forecasts about the future. Extrapolation is one of the first and simplest forms of statistical analysis – but remember that most metrics don’t lend themselves to easy extrapolation. Patterns change. Seasonality occurs. Once you understand what the patterns are and what seasonality is, you’ll be able to create a more accurate forecast. You’ll know what the ebbs and flows of your business look like over the short term, and you can filter these out to understand your growth in the long term.   Once you do this, however, you’ll find yourself in a position to meet your customers’ needs before their desires turn into impulses. Ridesharing companies can put enough cars on the road to accommodate the demands of their passengers. E-commerce companies can order enough inventory to satisfy the needs of their customers. IT administrators can provision enough infrastructure to support increased demands on the corporate website.  If you’re starting from zero, the capabilities of real-time analytics for time series data can seem like magic. Fortunately, you can already begin laying the groundwork for real-time analytics. Here’s how to start implementing real-time analytics in your own workplace. Implementing a Real-Time Analytics System for Time Series It all starts with data. Companies collect reams of data, but less than 30 percent of data is ever analyzed.. Building real-time analytics for time series means running most or all of this data through your analytics tool as fast as it comes in. Data I/O The real problem with data analytics is that most data isn’t optimized for analysis. If you want to run analytics on a social listening program, for example, you’ll find that tweets and hashtags don’t natively convert into time series data. 80 percent of all data is unstructured, and it needs to be converted into a digestible format prior to analysis.  Building an analytics capacity, therefore, requires three foundational steps: Data Sourcing: Finding the sources of data that you’ll use to conduct analytics. If it’s your website, you’ll be using your CMS. If it’s sales data, you’ll use your CRM. If it’s security data, you may use your SIEM. Data Pipeline: Once you’ve identified your sources, your next step is to transform them into a format that’s suitable for analytics. Some of your raw data might be in a format that’s immediately digestible by an analytics tool. The rest will have to go through an extract, transform, and load (ETL) process. Hopefully, you’ll be able to extract application data using built-in API tools; otherwise you’ll have to build your own. Data Warehouse: Not all analytics tools work in real time. With many, you can expect a certain amount of lag – a delay between receiving the data in an analytics format and being able to process it. The data warehouse is where your analytics data sits until it can be processed. Once the data is sourced, transformed and waiting in its warehouse, you have two choices: analyze the data using existing manual methods or by using next-generation artificial intelligence. Manual Analytics Let’s just come right out and say it: manual analytics has few, if any, advantages when compared to newer methods involving AI and machine learning. Primarily, detailed manual analytics can’t be performed in real time. Some manual analytics can be presented in real time, but the resulting data usually isn’t detailed enough to be actionable. A lot of manual analytics involves data visualization – essentially throwing a few time series charts onto a collection of monitors. Analysts can see the metrics going up and down in real time, but an analyst isn’t going to be able to pay attention to as many metrics as they truly need to monitor. An enterprise can potentially monitor thousands or even millions of metrics. Detecting anomalies means monitoring all of them. Creating an accurate forecast means understanding the ways in which these metrics affect one another. Analysts are able to do this with manual methods to a certain extent. It takes a long time to process an individual data warehouse or cluster of metrics as a single chunk. A process called “slicing and dicing” reduces this data into more manageable portions. For example, instead of performing the forecast for the entire United States, an analyst can create forecasts for Boston, Chicago, New York, and other large cities. Although this method is faster, it lacks accuracy. AI Anomaly Detection  Earlier, we talked about some of the methods that forecasters use to detect anomalies. As it turns out, these methods scale when used in a machine learning context. Remember the example we used with the spikes in the chart – compiling the patterns that your metrics create for a library of archetypal shapes, and then throwing alerts when those shapes begin to change? This is an ideal use case for artificial intelligence. Machine learning software can both create the library of shapes and then match these shapes to various metrics. The advantage of artificial intelligence in this instance is that it can monitor millions of metrics simultaneously and then react immediately when a metric begins to change or throw up anomalies. There’s no slicing and dicing in artificial intelligence – AI gives you the entire picture, and automatically highlights the most interesting (or serious) anomalies that you need to look at.  For example, think back to July 2019, when Target suffered two massive POS outages over Father’s Day weekend. For over three hours during what was projected to be a $16 billion shopping event, customers in Target stores were unable to complete their purchases – leading to a potential $100 million loss in sales. Big outages like this don’t happen on their own, and they’re usually due to multiple overlapping failures as opposed to a single cause. Somewhere, a server goes down. A bad patch breaks application dependencies. An overload of activity causes a network to fail. Enough of these things happen together and the whole Jenga tower falls over. Most of this activity is foreshadowed throughout the network and the e-commerce environment – which is where AI-assisted anomaly detection comes into its own. This solution would have the ability to alert on a cluster of related metrics that were all showing anomalies at the same time. Analysts would be able to notice the problem instantly, understand the source of the problem, and then trace the problem to a single event – making it that much easier to fix. AI Forecasting In terms of forecasting, AI provides similar benefits. Manual forecasting methods cannot scale to incorporate the number of metrics needed to create an accurate forecast. Thousands of seemingly miniscule factors influence future events. If you don’t account for them, your forecast will likely experience wide margins of error.  No manual methods can account for this scale of influencing metrics. AI, on the other hand, can produce continuous forecasts. If you need a forecast for what’s going to happen next week, tomorrow or an hour from now, you don’t need to wait – the forecast is already there. Instead of using input from just a few metrics for the sake of speed, AI can ingest every metric you measure. It becomes more accurate the more data it has.  The biggest advantage of AI is that it can account for thousands of influencing metrics (for example, what sales were like during last year's Cyber Monday), and measurements (for example, weather or stock market). It can also account for historical data - to determine a forecast, which leads to more accurate results. A human can only account for a handful. Organizations like the UK’s National Grid have begun to incorporate AI forecasting into their clean energy portfolios in order to understand how much energy they’ll be able to produce from solar power every day. Here, AI demonstrates its advantage by producing a fast forecast using more variables – one forecast a day, and 80 metrics as opposed to just two. The organization has already increased its forecast accuracy by 33 percent. Creating real-time analytics for time series may seem like a long journey, but it’s one that’s worth doing. Organizations that use these methods are already beginning to demonstrate competitive advantages that are heads and shoulders above their peers, and the gap between them will only get larger as time goes on. When you start building time-series analytics, you’ll be building an opportunity to recapture lost revenue, prevent loyal customers from churning, and meet the needs of your customers before they even know what their needs are. Start building your analytics capability now – request an Anodot demo to get started.
Autonomous business monitoring
Blog Post 6 min read

Don't Treat Your Business Metrics Like Other Metrics

Why? Because monitoring machines and monitoring business KPIs are completely different tasks.
Documents 1 min read

2021 State of Cloud Cost Report

The survey, conducted in April and May of 2021, opens a window to the ways more than 100 organizations of varying sizes and verticals are progressing in their journey to cloud and cloud cost monitoring.
Blog Post 5 min read

The Top 3 Use Cases for Machine Learning in Analytics and Monitoring

Analytics and monitoring are undergoing widespread change. We offer real-world use cases for how we can truly learn from machine learning today.
Documents 1 min read

Report: The Business Value of AI in Zero-Touch Network Monitoring

This new AI in Zero-Touch Network Monitoring survey report uncovers the motivations, challenges, and business value of investing in AI-based network monitoring and automation.
Blog Post 9 min read

Top Use Cases for Growth Forecasting Using Autonomous Forecast

Forecasting business growth is not easy. In this post we cover several use cases that can help you on the path to achieving long-term growth and success.
Webinars 1 min read

Webinar: The Road to Zero Touch Networks: Why Start Now?

In this webinar, hear how three of the world’s leading telcos, Telefónica, Vodafone, and Deutsche Telekom, with their combined subscriber base of close to 700 million, are tackling the road to zero touch networks.