Anodot Resources Page 48

FILTERS

Anodot Resources Page 48

Blog Post 3 min read

Closing the Loop on Anomalies, Alerts and Dashboards

Team Anodot is always busy working on new features and new capabilities for our users. Our most recent version upgrade rolled out yesterday and we’ve already received great feedback. So what’s all the fuss? We just closed the loop between your metrics, anomalies, alerts and dashboards! Almost every BI and visualization tool provides a dashboard...it’s a familiar and logical way to keep track of metrics that you’re interested in. Our newest version upgrade takes the dashboard concept to the next level. By showing anomaly alerts directly in your dashboard tiles, we're making it even easier to uncover and access business insights in real time. Not only will you receive traditional email/JSON/webhook alerts on anomalies in the data streams that are interesting to you, you’ll now also see these alerts in the context of the relevant dashboards. Get Started So how does it work? You’ve created a dashboard with graphs and meters… now click the “settings icon” in the upper right corner of a tile to display the options. Clicking on "Create Alert" tells the system that you are interested in receiving alerts whenever any of the metrics in the tile are anomalous. Once you've created the alert, a small bell outline icon will appear on the top left corner of the tile (see image below). From now on, if the alert bell is completely black, it means that anomalies occurred within the frame you’re looking at. This is in addition to the regular alert notification you would receive, but may have missed. Anomalies Can Hide in Plain Sight! The alert bell will appear even if the anomalies on the dashboard are not obvious to the human eye. In this example, the alert notification icon clearly shows that anomalies occurred in the selected data, but from a quick glance at the dashboard, it is not possible to actually SEE the anomalies. Drill Down to Investigate Root Cause In order to investigate further, you can easily see the full list of alert notifications on the right hand side. Click each notification to drill down into the Anomap page, where you’ll find information about individual anomalies that were alerted on, along with correlated events for other metrics that may not have been displayed on the dashboard. In the example below, we see that upon further investigation, the anomalies that were not obvious in the high level view are easy to understand when you look more closely at the individual alerts and correlations. In this example, we see a correlation between an increase in Payment API Failures which caused the Revenue metrics to decrease. For full documentation, visit our Support entry where you’ll find detailed information about creating and editing alerts as well as viewing dashboard tile alert events.  Got ideas for new features you’d love to see? Drop us an email at [email protected] and let us know. We’d love to hear from you.
Videos & Podcasts 11 min read

Rich Galan of Rubicon Project: The Need for Real-Time Anomaly Detection

Rich Galan of Rubicon Project presents the need for real-time anomaly detection at Innovation Enterprise CTO Conference.
Blog Post 4 min read

Website down? How to Track the Impact on Your Bottom Line

Costs of Unavailability Availability is one of the key measurements for every company with an online presence. The expectation of customers is constantly increasing, and today they expect access to the service at any time, from any device. They expect nothing less than 100% availability. Measuring availability is a difficult but critical task. I strongly advise that no matter how difficult it is, you must take the time to define what availability means for your business and start tracking it. The following table will help you understand the effect of different availability service level agreements (SLA) in terms of potential downtime: 99.9% 99.95% 99.99% Daily 1m 26.4s 43.2s 8.6s Weekly 10m 4.8s 5m 2.4s 1m 0.5s Monthly 43m 49.7s 21m 54.9s 4m 23.0s Yearly 8h 45m 57.0s 4h 22m 58.8s 52m 35.7s   Below, I share some of the potential impacts of unavailability. The emphasis you put on these factors will depend on the service being offered and your own circumstances. Lost Revenue If you are conducting business over the internet, every minute of downtime is directly linked to loss of revenue. There are different ways to calculate lost revenue: Determine how much revenue you make per hour, and use this as a cost to the enterprise for unavailability per hour/min. For example, in this article, Google’s cost of downtime was calculated at $108,000 per minute based on its Q2 2013 revenue of $14.1 billion. In another article, Facebook’s downtime cost was calculated at $22,453 per minute. This is the simple method, but it is not very accurate as revenue changes over time of day, day of week etc. Consider seasonality and recovered revenue, using week-over-week for a comparison of expected behavior vs. previous week. This is a more accurate method. In the following example, we see a significant drop in the transaction volume for about 10 minutes. Let’s assume that the revenue dropped by $110,000, and once the service was restored, users retried and completed their transactions resulting in an increase of $80,000. Now we can calculate the real impact as recovered revenue minus lost revenue: $80,000 - $110,000 = -$30,000 for those 10 minutes of downtime. Contractual Penalties Some organizations face financial penalties in the event of downtime. If your partners rely on your service being available, there is probably an SLA in place to guarantee certain availability. If this is not met, the provider must compensate the partner. Negative Brand Impact Almost every online service, and definitely all mature services, has competition. Uber vs. Lyft, Airbnb vs. VRBO, hotels.com vs. booking.com, and so on. If one service is not available, it is very easy for customers to switch to the competition. The expectation of customers in today’s world is for the service to be available all the time. In a previous post, we discussed the different elements of an incident life cycle.  Major incidents are detected very easily, even with very basic monitoring in place. The real challenge is getting to the root cause of the issue and fixing it quickly. Even if you have the right set of signals across the entire technology stack including infrastructure, application and business metrics, the data most likely resides in silos. Because of this, the person that triages the issue doesn’t have complete visibility, so different teams must investigate the root cause simultaneously. Adopting machine learning-based anomaly detection enables the processing of all relevant metrics in a single system. In this set up, if an anomaly is detected in one of the metrics, it is easier to correlate between all the other metrics and uncover the root cause much faster. In fact, a good anomaly detection system not only detects the issue faster and more accurately than traditional threshold-based alerts, it correlates across all relevant metrics and provides visibility to all other related anomalies. Let’s look at an example of a drop in volume of a specific product in a specific country. In this case, the system sends an alert that an anomaly was detected on conversion rates in the specific country and will provide visibility into signals that may have caused the issue such as: Events that happened at the same time, like code push Another anomaly that occurred at the same time on the DB metrics Network metrics that might indicate DDOS attack The idea is that with an anomaly alert, we will also receive other correlated events and anomalies to help us get to the root cause much faster by shortening the time it takes to triage the issue, thus reducing the impact on the business.
Videos & Podcasts 30 min read

Disrupt the Static Nature of BI with Predictive Anomaly Detection

Anodot's Uri Moaz discusses how predictive anomaly detection can identify revenue-impacting business incidents in minutes(!) not days or weeks.
Documents 1 min read

PART 3: Ultimate Guide to Anomaly Detection - Correlating Abnormal Behavior

Part 3 of our Ultimate Guide to Anomaly Detection explains the process of identifying, ranking and correlating abnormal behavior in time series. Read on to find out.
Blog Post 2 min read

The More Things Change...The More You Need Anomaly Detection

In my last post, I talked about the importance of considering timeliness and scale for the design of an anomaly detection system. In this post, I will discuss the impact of a company’s rate of change in relation to detecting anomalies. Online Business: A Constantly Changing Ecosystem A slow rate of change, as pictured below, is normally seen in closed systems, where outside events have no effect, and any changes take place slowly. In this example, we see that over the course of a week, the metrics remain relatively stable. This is typical, for example, in automated manufacturing processes. For slow-changing processes, a system can learn on a year's worth of data to learn its normal behavior. The model generated from this may not have to be updated for a long time. In contrast, online businesses are constantly making changes to improve and increase revenue and keep up with the demands of their audience. Whether they release new products or new versions of applications, their environment changes rapidly. In the example below, this set of metrics display a sudden, rapid and drastic change, which is typical to see after new releases. Because of this, dynamic online businesses need an anomaly detection system that adapts to changes efficiently and effectively. To achieve this, it is vital that the system have adaptive algorithms enabling it to collect data and adapt its gathering process to the needs of the business. In other words, the more things change...the more you need an adaptive anomaly detection solution. For more information on how anomaly detection is pertinent for your business, and how these systems are designed, see our white paper: Building a Large Scale, Machine Learning-Based Anomaly Detection System, Part 1: Design Principles.  
Blog Post 5 min read

Who owns the “I” in BI

[caption id="attachment_2542" align="alignright" width="350"] © Kate07lyn / Wikimedia Commons / CC-BY-SA-3.0 / GFDL[/caption] It is well known that when it comes to gaining insights from your BI system, the more granular data you have, the more accurate the insights you will gain. While most of the existing BI solutions can process and store a huge amount of data with many dimensions, they don’t offer an easy way to get the insights from the data. In fact, the BI solutions left the “I” - the intelligence - completely in the hands and minds of the data analysts. The human brain is limited to processing not more than few dozen signals, which is why you typically find organizations looking at the big picture and possibly missing issues that impact a specific segment or product because the root cause gets lost in the average. You’ve probably seen dashboards like this one with multiple KPIs to show sales figures, customer satisfaction score, churn etc. These kinds of dashboards (often called executive dashboards) are designed to provide high level visibility of different business metrics for senior management. Often, one of the KPIs will show a negative value and the data analysts will be tasked with providing explanations. It is up to the investigation capabilities of the data analyst to ask the right questions to get to the right answers that explain the shift in the KPI. This task is very time consuming and frequently feels like finding a needle in a haystack. [caption id="attachment_2543" align="alignleft" width="240"] Missing Hugh II by Little Miss no Name[/caption] I always was a big fan of detective stories, so it’s not surprising that my favorite characters are Sherlock Holmes and Doctor Gregory House. For those of you who haven’t seen the show, a typical episode of “House” begins with someone getting really sick and rushed to the hospital where he or she is referred to the diagnostic departments headed by Dr. House. This is when he fun begins… House and his team review the different symptoms and make a call on the most probable cause (It's not Lupus!) and start treatment. When the symptoms worsen, they look for more clues by searching the person’s home for toxic substances, finding out the family history, running more tests and trying a new treatment which usually fails. Eventually something unrelated triggers House to correlate the different signals with the huge amount of data he stores in his brain, a light bulb goes on and he finds the real root cause and is able to order the proper treatment which saves the person’s life. House is probably the best example of human correlation and anomaly detection, making him an anomaly that stands out as an unconventional and misanthropic genius. Imagine how the world of medicine could benefit from automating House’s diagnostic capabilities…well, with machine learning and automated anomaly detection, this day isn’t really so far off anymore. [caption id="attachment_2549" align="alignright" width="350"] PayPal's command center in San Jose, California by Kristen Fortier[/caption] Let’s take a real example from PayPal’s command center which relies heavily on multiple dashboards powered by its homegrown monitoring solution named “Sherlock” to early detect major site incidents. The Technical Duty Officers (TDOs, AKA the “diagnosticians”) constantly scan the large wall with seven big HD screens and dozens of signals to identify abnormal behavior. The PayPal monitoring system collects more than 350 thousand signals per second, and the high-value signals are displayed on the wall while the rest go to a simple alert mechanism. You can read more about this here. In case of a significant drop in volume or spike in error rate, the TDO on duty will try to figure out if this is a real issue by correlating different signals displayed on the screens. In fact, they are applying human correlation to determine if the different signals are indicative of a real issue or not. This is a time consuming process dependent on the capabilities of the person on duty, and occasionally when the signals are not that clear or strong, it can lead to a miss. This is an example of a BI solution based on visualizations that require humans to have the intelligence to make a decision based on multiple signals.  With the explosion in complexity of the environment, moving into cloud (public and private) and software designed as micro services deployed in containers, we are experiencing surge in the different signals that are collected and need analysis to gain insights. We are seeing more and more organizations face the challenge of getting real-time business insights from the enormous amount of data they collect. Data analysts simply can’t keep up with the increasing demand to crunch all the data to find ways for the business to improve its key KPIs. This reality is driving a the new paradigm of “monitoring by exception” that surfaces the most relevant anomalies so the business analysts can investigate them further. The human brain is limited in the number of data points it can process and correlate and this is exactly where the next wave of BI solutions comes in handy. With highly scalable machine learning-based algorithms, we now have software that can learn the normal pattern of any number of data points and correlate different signals to accurately identify anomalies that require action or investigation.
Documents 1 min read

Ultimate Guide to Building a Machine Learning Anomaly Detection System, Part 2 - Learning Normal Time Series Behavior

Part 2 of our Ultimate Guide to Anomaly Detection presents a general framework for learning normal behavior for time series. Can a seasonal pattern be assumed?What is the importance of modeling seasonality? Why does real-time detection at scale requires online adaptive learning algorithms? Read on to find out.
Blog Post 4 min read

Two Secrets of Swift and Scalable Anomaly Detection

Streamlining the way your digital business works – ensuring customers get what they need, conversions occur without a glitch, etc – is the goal of any company with an online presence, but it can be challenging with so many moving parts. As we discussed in a previous post, detecting anomalies is one of the best ways to make sure that everything is running smoothly and keep up with current trends. In digital businesses business, many processes happen simultaneously, and each activity may be monitored by a different person or team. Changes in different departments or even external partners can show up as an unexpected change in a totally different area, but the association might never be made if the metrics are not analyzed on a holistic level. The solution? An anomaly detection system that can understand all these different types of metrics, identify the normal behavior and alert when something has changed. When designing an anomaly detection system, there are certain principles within the design that are essential to its success. This post will give an overview of two of those secrets to success: timeliness and scale. In future posts, we’ll take a look at the other three key principles: Rate of Change, Conciseness and Definition of Incidents. How quickly do you need your anomalies detected? In anomaly detection, there are two types of decision making. First, detection can be done in non-real-time, meaning that the results are retroactively seen by the user. In this case, the anomalies are used for a retrospective analysis of what happened, which helps in making decisions about the future. The other option is real-time detection, where you see the results of metrics as they happen. When would you want non-real-time decision making? This model is useful for long term planning. Basically, the data received and understood is not relevant to the immediate situation of the company, and is not necessary for immediate action. An example of when you might use non-real-time decision making is when reviewing data from marketing campaigns to plan future strategy, scheduled maintenance, budget planning, etc. In this situation, data is collected over a period of time, and when that period finishes, a batch machine learning algorithm can be used to find out what anomalies occurred during the set amount of time. While viewing these results in non-real-time, your business can see the results of a longer course of action and thus make non-urgent decisions for future action. However, most online businesses are in dire need of real-time decision making. For example, sudden spikes or dips in purchases could present opportunities for action that would generate more sales. Knowing exactly what is going on with your digital business at the moment that it is happening enables you to take advantage of real-time trends for the furtherance of your business goals. Online machine learning algorithms are the best way to process data in real-time. Using these algorithms also helps in our next point. Scaling for Growth Online machine learning algorithms are easily scalable, thus making them ideal for large data sets. However, online machine learning algorithms are not without their faults. They tend to be more prone to false positives. If your company is continuously growing, then scalability is a valid concern. Thus, online machine learning algorithms are still the best option for businesses that have more metrics and large data sets. There are ways to reduce false positives, which we discuss in our White Paper “Building a Large Scale, Machine Learning-Based Anomaly Detection System, Part 1: Design Principles.” Conclusion Online machine learning algorithms are a viable solution to the needs of businesses in the digital age. As we’ve explained, real-time decision making and ability to scale are two of the secrets of building a successful online machine learning anomaly detection system. For more information about the design principles of an anomaly detection system, read the full white paper: Building a Large Scale, Machine Learning-Based Anomaly Detection System, Part 1: Design Principles.